Building data visualization apps with Streamlit and Databricks SQL
Intro
Data visualization became an inevitable component (and sometimes — the most important result) of data analysis.
The set of tools for data visualizations is growing rapidly. In general, developers of data visualization applications are choosing between two options:
- No-code BI visualization tools, with a drag-n-drop interface and some possible extensions for customization
- Web frameworks with rich visualization capabilities in high-level programming languages, such as JS, Python, and R
As a person preferring the latter option, I was always missing a quick and programmable way to turn my ideas into a simple, nice-looking web page. It’s indeed possible to do this by using JavaScript and its extremely flexible set of libraries and frameworks, but sometimes I was looking towards a more Pythonic way to build such applications. A couple of years ago this meant using Flask/Django and adding all visualization capabilities by using CSS, HTML, and JS. Back then, there was no easy-to-use and widely adopted framework for building a rich web application solely in Python.
The universe does not tolerate the absence of something, therefore projects such as Dash and Streamlit gained their audience and interest among Python developers, data engineers, and data scientists.
In this particular blog post, I would like to show an example Streamlit application that connects to Databricks SQL Endpoint and visualizes the data. If you prefer the “just-show-me-the-code” approach, please find the source code here.
Preparing the data
For demo purposes, I’m using the NYC Taxi Dataset. Databricks users can access this dataset easily in any workspace:
display(dbutils.fs.ls("dbfs:/databricks-datasets/nyctaxi"))
Let’s start by creating a database and a Delta table out of one of the tables:
CREATE DATABASE IF NOT EXISTS streamlit_demo_db;
CREATE TABLE IF NOT EXISTS streamlit_demo_db.nyctaxi_yellow
USING DELTA
LOCATION "dbfs:/databricks-datasets/nyctaxi/tables/nyctaxi_yellow";
You can execute this snippet either via a notebook, attached to an interactive cluster or as a query via Databricks SQL endpoint. This code will create an external table in Hive metastore, allowing users to access the content of the underlying files in cloud storage via a pure SQL interface.
Connecting to the Databricks SQL endpoint
To serve the SQL queries on top of the defined table, we can use multiple different interfaces. This is possible due to the open nature of the Delta Lake format — you can query tables in this format from various systems.
In this particular example, I’m going to use Databricks SQL — a performant and cost-efficient SQL interface on top of the data in object storages, with real-time performance provided by Delta Engine. There is numerous amount of different BI tools which support it, but in this particular example, we’re going to use ODBC driver + pyodbc + pandas to connect to the endpoint and execute our queries.
To unify the solution for different platforms, I’ve packaged the ODBC driver into a Docker container. You can find the reference Dockerfile here. Now, it’s time to use generic Pandas code to access the data.
For convenience, I’ve extracted all low-level code into a DataProvider
class. This class takes care of picking up the environment variables that define the endpoint information (host, token, and HTTP path). Please note that these environment variables shall be kept as secrets.
After defining the connection string, we can easily execute any SQL and return results directly into Pandas (snippet from DataProvider):
Now we have all low-level data transfer details prepared, so we can dive straight into the visualization part.
Building visualizations with Streamlit
Streamlit is a new framework for building web applications in Python. My personal opinion after working with it can be described as: “Flask on steroids with batteries included”.
Here is a quick outline of the application I’ve built with relatively zero knowledge of Streamlit in a couple of days:
My application skeleton is pretty simple:
- header with a warning below
- two sections with 1/4 ratio for counter + time-series plot
- two sections with a 1/1 ratio for density maps
Adding header and warning in Streamlit is as simple as possible, simply call st.write
method with markdown inside:
Making containers side-by-side is also pretty straightforward (I’ve factored out the plotting steps into a separate class for readability):
What was not really straightforward is the height alignment of containers. I’ve fixed it via setting the height of figures in both containers manually, but probably there should be a better way of doing this.
For visualizing a single number (sometimes this is called counter or indicator), I’ve used Indicator
class from plotly:
fig = go.Figure(
go.Indicator(
mode="number",
value=cnt,
align="center",
title={"text": "Total pickups"},
)
)
This code works well with st.plotly_chart
and follows the page style (if you switch to light theme it will follow the theme styling).
Since we have the geolocation data for pickups/dropoffs, I was eager to make a density visualization with animation hour-by-hour. Plotly provides a ready-to-use function for such a map — px.density_mapbox
.
The result looks good and can be easily added to the main web application:
I’ve also decided to make something non-trivial —fork the default st.spinner
code to add the spinner from Bootstrap 4. Even with my moderate knowledge of CSS/HTML, it was a pretty simple task:
Something worth mentioning is indeed the scalability of the density map visualization. For the given use-case, when the daily pickups amount is <500k, my laptop was able to build the density map in 3–4 seconds. I’m not sure if such a solution will work scalably for the 1mln+ dataset, so probably for bigger datasets some pre-aggregation / sampling shall be used.
Summary
By leveraging together the visualization power of Streamlit and Databricks SQL endpoint interface, it’s really simple to build a pretty flexible and feature-rich data visualization application in almost pure Python.
There are some topics I haven’t touched in this post, for example:
- Adding ORM layer on top of SQL endpoint
- Testing Streamlit applications with tools such as Selenium
- Improving the performance of queries with optimizations such as Z-Ordering (in the given use-case this wasn’t even required, most of the queries were executed in 2 seconds or less)
I encourage the audience of this blog post to take a look at the source code and find some inspiration for new data visualization applications.