Building a data labeling app with Python, Dash, and Databricks Lakehouse
✍️ Intro
Data labeling is a crucial aspect of machine learning that involves annotating raw data to create labeled datasets that can be used to train models.
However, the process of labeling large volumes of data can be time-consuming and error-prone. Fortunately, there are several tools and technologies available that can simplify the data labeling process and increase its efficiency. In this blog post, we will explore how to build a data labeling app using Python, Dash, and the Databricks Lakehouse platform.
By the end of this post, you will have a better understanding of how to leverage these powerful tools to create a custom data labeling app that can save you time and increase the accuracy of your labeled datasets.
🏗️ Architecture
Let's take a closer look at the architecture of the labeling app. As usual, we have several layers — frontend, backend, and storage:
Some considerations that make this approach distinct from others:
- [Frontend] Data Scientists usually would prefer to stick to the main language of their choice, which is usually Python. Therefore we’ll use Dash — a robust framework for building UI applications in pure Python.
- [Backend] This time we’ll use Databricks SQL not as a high-performant query backend (as it’s used in many BI apps), but as a CRUD interface to the table with data.
- [Storage] Since DBSQL is being used and we’ll require Updates, Delta is an obvious choice.
To sum up — we’re going to build a CRUD-like application with UI on top of the data stored in the Databricks Lakehouse.
❓What will this app do?
The app described in the blog post implements a user interface to classify text blocks. Users will be able to open the app in the browser, see the text blocks and select the relevant class per each block. By clicking confirm, the newly selected class will be saved back into the Lakehouse.
🔷 Frontend
First, let’s quickly sketch out the UI component. What we’re going to build is a simple app that allows the user to see some text and then classify it to a given label:
Well, maybe it’s not the most enjoyable UI, but definitely a simple and clear one. Hopefully, professional web designers can forgive me for the UX of this one.
If you have ideas on how such an app should look like, drop a line in the comments!
Now it’s time to code a bit. I’ll omit the package setup, take a look at the developer instructions in this repo for details.
Let’s start by sketching out the layout:
app.layout = html.Div(
[
html.Div(
[
header,
guideline,
current_index_view,
html.Div(
[
text_container,
html.Div(
[
class_guideline,
dropdown(CLASSES),
confirm_button,
],
style={"flex-basis": "10%", "padding-left": "1em"},
),
],
style={
"display": "flex",
"flex-direction": "row",
"padding": "2em",
"margin-bottom": "2em",
},
),
html.Div(
navigation_buttons,
style={
"justify-content": "space-evenly",
"display": "flex",
},
),
dcc.Store(id="current-index", data=choice(ALL_IDS)),
],
style={
"padding-top": "1em",
"padding-left": "2em",
"padding-right": "2em",
"padding-bottom": "5em",
},
),
]
)
I did some minor changes to the outlook of the original sketch. Let’s see what comes out of it:
Well, this looks kinda functional, but we definitely miss some 💫fancy💫 feeling here.
Let’s fix these things and sugar-coat it a bit with CSS. We’re going to do several things:
- Add bootstrap to make styles nicer
- Switch to another font
- Add a nice background gradient (because it’s cool and easy!)
Adding Bootstrap support is pretty easy:
external_stylesheets = [
# bootstrap
{
"href": "https://cdn.jsdelivr.net/npm/bootstrap@3.4.1/dist/css/bootstrap.min.css",
"rel": "stylesheet",
"integrity": "sha384-HSMxcRTRxnN+Bdg0JdbxYKrThecOKuH5zCYotlSAcp1+c8xmyTe9GYg1l9a69psu",
"crossorigin": "anonymous",
},
# fonts
{"rel": "preconnect", "href": "https://fonts.googleapis.com"},
{
"rel": "preconnect",
"href": "https://fonts.gstatic.com",
"crossorigin": "anonymous",
},
{
"href": "https://fonts.googleapis.com/css2?family=Inter&family=Noto+Sans&display=swap",
"rel": "stylesheet",
},
]
# later in the app - add a reference
app = Dash(
__name__, title="Data Labeling App", external_stylesheets=external_stylesheets
)
Note: I’m using Bootstrap 3 which is pretty old. Consider using the latest version.
To make buttons look nice, simply add relevant class names to them:
# later in the layout - add some classes to make buttons look better
confirm_button = dcc.Loading(
id="submit-loading",
children=[
html.Button(
"Confirm",
id="confirm_btn",
n_clicks=0,
className="btn btn-success btn-block btn-md",
),
html.Div(id="output-mock", style={"display": "none"}),
],
)
All these steps will make our app look like this:
This already looks softer, but I really don’t like the default fonts of Bootstrap.
Therefore, let’s also add a new folder called assets
and put a file custom.css
into it. Please note the package layout:
|-- README.md
|-- dbsql_labeling_app_example
| |-- __init__.py
| |-- app.py
| |-- assets
| | `-- custom.css
The assets
folder should be in exactly the same folder as the app.
With assets configured, put the font settings into the custom.css
file:
body {
font-family: 'Inter', sans-serif;
height: 100%;
}
html {
height: 100%;
}
h1 {
margin: 0
}
a {
color: plum;
}
However, the result looks a bit strange — the whole page is out with just one single font. What would be nice is to logically separate the text block that users will classify from the rest of the app.
This can be achieved by providing a custom font for a component:
# note the styling part with font-family.
dcc.Markdown(
id="text-container",
style={
"font-size": "1.2em",
"font-family": "'Noto Sans', sans-serif",
"height": "50vh",
"overflow-y": "auto",
},
),
All this together will end up in the following UI:
This already looks slicker, but still not 💫fancy💫. To ensure the proper level of fanciness a nice background gradient is definitely required. I’ve used the ColorSpace tool to generate a cool gradient and put it into the body definition:
body {
background-image: linear-gradient(to right top, #1b3139, #164472, #754293, #d41677, #ff3621);
color: whitesmoke;
font-family: 'Inter', sans-serif;
height: 100%;
}
All this together will lead to the following:
The styling is unfortunately only one part of the frontend story. The second one is logic that connects buttons and web components. The logic here is a bit non-linear, therefore writing down the requirements would be helpful.
The app should be able to show any of the texts provided by backend, together with their associated labels. User should be able to change the label of the text, and should be provided with choices of available classes.
This description leads to the following:
- A state with an index is required. The index will be changed via the buttons and based on the index the text and label will be demonstrated.
- A variant selector is required. When the user confirms the choice, the chosen variant should be sent to the backend via update-by-id logic.
Fortunately, Dash has great capabilities for state management and callbacks.
The index part is trivial with the provided capabilities. First, it’s required to define buttons — they’ll will act as inputs:
navigation_buttons = [
html.Button(
"⬅️ Previous",
id="prev_btn",
n_clicks=0,
className="btn btn-default btn-lg",
),
html.Button(
"🔀 Random",
id="random_btn",
n_clicks=0,
className="btn btn-default btn-lg",
),
html.Button(
"Next ➡️",
id="next_btn",
n_clicks=0,
className="btn btn-default btn-lg",
),
]
Second, let’s introduce the index as a store component:
dcc.Store(id="current-index", data=choice(ALL_IDS)),
The Store
component allows you to store the data that can be JSON-serialized inside the browser (somewhat similar to React useState
).
Finally, it’s time to bind the buttons to the index. In Dash such interactions are organized as callbacks:
@app.callback(
Output("current-index", "data"),
Output("text-container", "children"),
Output("class-selector", "value"),
Output("prev_btn", "disabled"),
Output("next_btn", "disabled"),
inputs=[
Input("current-index", "data"),
Input("prev_btn", "n_clicks"),
Input("next_btn", "n_clicks"),
Input("random_btn", "n_clicks"),
],
)
def navigate_to_element(current_index, _, __, ___):
if "prev_btn" == ctx.triggered_id:
current_index -= 1
elif "random_btn" == ctx.triggered_id:
current_index = choice(ALL_IDS)
elif "next_btn" == ctx.triggered_id:
current_index += 1
label_data = operator.get_element_by_id(current_index)
disable_previous = current_index <= 0
disable_next = current_index >= len(ALL_IDS) - 1
return (
current_index,
label_data.text,
label_data.label,
disable_previous,
disable_next,
)
This specific callback gets triggered in case if any of the navigation buttons are clicked:
The outputs define the side effects of these clicks. As per the code above, there are 5 potential side effects:
- The current index itself will be changed
- Text container component will be filled with text of the relevant sample that is fetched from the Lakehouse
- The label component will be filled with the label of the relevant sample
- Two last outputs will disable Previous/Next buttons if there is no available Id.
Using the same logic, it’s simple to organize the class choice and confirmation:
From the UI perspective, two components are introduced:
def dropdown(classes: List[str]) -> html.Div:
return html.Div(
dcc.Loading(
id="dropdown-loader",
children=[
dcc.Dropdown(
classes,
id="class-selector",
placeholder="Select the class",
multi=False,
clearable=False,
style={
"color": "black",
},
),
],
),
style={
"padding-top": "10px",
"padding-bottom": "10px",
},
)
confirm_button = dcc.Loading(
id="submit-loading",
children=[
html.Button(
"Confirm",
id="confirm_btn",
n_clicks=0,
className="btn btn-success btn-block btn-md",
),
html.Div(id="output-mock", style={"display": "none"}),
],
)
And then a callback is added to bind events together:
@app.callback(
Output("output-mock", "children"),
Input("confirm_btn", "n_clicks"),
State("class-selector", "value"),
State("current-index", "data"),
)
def save_selected_class(_, value, current_index):
if value:
operator.update_element_by_id(current_index, value)
return value
A profound reader might notice above an interesting twist — what’s that output Output(“output-mock”, “children”)
is actually used for?
The explanation is fairly simple — Dash doesn’t support callbacks without outputs.
At the same time, it’s possible to create a dummy output object that is not visible to users:
html.Div(id="output-mock", style={"display": "none"}),
Moreover, this object will become pretty useful further when a Loading
state will be introduced.
The loading state handling is a topic that is covered with Dash built-in capabilities completely:
text_container = html.Div(
dcc.Loading(
id="text-block-loading",
children=[
dcc.Markdown(
id="text-container",
style={
"font-size": "1.2em",
"font-family": "'Noto Sans', sans-serif",
"height": "50vh",
"overflow-y": "auto",
},
),
],
),
style={
"flex-basis": "90%",
"padding-right": "1em",
},
)
Hint: put a html.Div
with generic layout styles on top of Loading
component so it will look stable in the UI.
And this is it — no need to connect callbacks, if-else, when, etc. Simple and concise wrapper around the UI component that’s being loaded.
Now to the cool tricks — the output-mock
we’ve used above can be used to add a Loading
state to the confirm button itself:
confirm_button = dcc.Loading(
id="submit-loading",
children=[
html.Button(
"Confirm",
id="confirm_btn",
n_clicks=0,
className="btn btn-success btn-block btn-md",
),
html.Div(id="output-mock", style={"display": "none"}),
],
)
Since the button is not reloaded when it’s clicked, it’s possible to simply wrap it into a Loading
block together with its mocked output. This will add a smooth confirm click behavior:
The app is full of all required utilities for text labeling, and the only thing it awaits is the backend connection to it.
💎 Backend
Per the code above the logic of the backend seems pretty obvious. To simplify the CRUD-related operations I’ve put all of the relevant methods into a single class:
from itertools import chain
from typing import List, Optional
from sqlalchemy import select
from sqlalchemy.orm import Session
from dbsql_labeling_app_example.engine import Label, engine
class DataOperator:
def __init__(self) -> None:
print("Initializing the session for CRUD operator")
self._session = Session(bind=engine)
print("Session initialized")
def get_all_ids(self) -> List[int]:
return self._session.execute(select(Label.label_id)).scalars().all()
def get_element_by_id(self, id: int) -> Optional[Label]:
return self._session.query(Label).get(id)
def update_element_by_id(self, id: int, new_label: str):
self._session.query(Label).filter_by(label_id=id).update(
{Label.label: new_label}
)
def get_all_classes(self) -> List[str]:
return list(chain(*self._session.query(Label.label).distinct().all()))
This class is used to provide data from the backend towards the frontend app (and therefore used in various callbacks).
The real beauty of SQLAlchemy is that all these standard CRUD operations are already in place.
At the same time, the Databricks SQL Connector can be easily used together with SQLAlchemy, providing efficient functions and methods to build CRUD operations of any complexity:
from sqlalchemy import create_engine
engine = create_engine(
f"databricks://token:{endpoint_info.token}@{endpoint_info.server_hostname}?http_path={endpoint_info.http_path}&catalog={endpoint_info.catalog}&schema={endpoint_info.database}",
echo=debug_mode,
)
An additional bonus of using SQLAlchemy is that the code becomes well-typed. For instance, the objects that are used in the frontend code, as well as their attributes can be defined in a few lines of code:
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import declarative_base
from dbsql_labeling_app_example.engine_provider import get_prepared_engine
print("Preparing the SQL engine")
engine = get_prepared_engine()
Base = declarative_base(bind=engine)
print("SQL Engine prepared")
class Label(Base):
__tablename__ = "labels"
label_id = Column(Integer, primary_key=True)
text = Column(String)
label = Column(String)
Since the frontend code shares the same codebase and runtime, SerDe between “backend” and “frontend” becomes obsolete.
✅ Summary
- CRUD with DBSQL and Delta directly on top of the Lakehouse is a real thing
- With the Dash framework, one can quickly build efficient data applications.
- SQLAlchemy and DBSQL work together pretty well, which allows reusing existing ORM tooling and brings a lot of benefits to Python users.
The source code for this app as well as demo can be found here. Feel free to copy it and fiddle around with your use cases 🙌.
Have you tried using Databricks SQL and Dash already? Have an opinion? Feel free to share it in the comments! Also, hit subscribe if you liked the post — it keeps the author motivated to write more.