Useful patterns for Apache Spark application development with Scala

Ivan Trusov
4 min readJan 26, 2023

--

Ju-Jitsu by David Bomberg, credits to Wikiart

Scala is an incredibly flexible programming language that enabled Software Engineers to build powerful and easily extensible applications in various domains.

Since a large chunk of Apache Spark is written in Scala, it’s also pretty natural to write Spark applications in Scala language to utilize its typing system fully.

Although Scala is well-known for its functional libraries (and code written in a pure-functional, monadic style is something that might look unusual for some developers), it also allows users to write in object-oriented style with minor additions of functional programming where needed.

In this blog post, I wanted to demonstrate some common usage patterns for writing Spark Scala applications in a concise, object-oriented style.

Using HOCON-based configurations

HOCON — is a very flexible configuration language, actively used in many Scala-based projects. It supports a pretty wide range of additions, most notable of which I find the following:

  • Environment variable substitution
  • In-place variable substitution (including nested configs)
  • Comments
  • Duration and size-like properties, e.g. “10 seconds”

Here is a simple example of writing and reading such a config:

Example application.conf file

Reading the config is trivial if we want a standard untyped Config object:

A nice feature of the HOCON is to share configs across various applications. For example, if you need to configure a chain of tasks, where inputs of one task are dependent on the output configuration of a previous one — HOCON can easily resolve such cases, e.g.:

Another important point of using HOCON is the ability to work nicely with tests. Later in this post, we’ll see how HOCON-based definitions could be injected with testing properties.

Simplifying tests with testcontainers-scala

As you can see in the configuration, we’re going to write an application that reads data from a Kafka topic. Therefore, we would need to either mock or provide an example of a Kafka stream in our local environment for proper local tests. Here we can easily use a great JVM-based tool called testcontainers, specifically an extension that makes this library easy to use in Scala.

By using scalatest and testcontainers developers can easily set up a proper environment for local testing:

Here is another useful example — this trait provides support for SparkSession with Delta Lake enabled for test cases:

Most certainly an attentive reader has already noticed that spark object is introduced as an implicit here. Therefore let’s talk a little bit about the application layout which makes the setup easier and more concise from the programming perspective.

The implicit for spark is used for the following reasons:

  • It simplifies access to the SparkSessioninstance across the codebase
  • Easy injection of this instance across the testing codebase

Following this idea, one can easily guess that it’s simple to introduce a package-level object which provides access to the SparkSession:

Now any application inside net.renarde.demos.apps will have access to the Spark Session instance, and in the testing environments a proper local session will be initialized.

Generic app trait and reading the configs

Since every application requires a config and a logger instance, a generic app trait can be introduced:

N.B. I’ve omitted ConfigProvider code for simplicity, take a look at the utils package for more details.

This trait is simple to re-use in various applications, e.g.:

By this, we avoid re-writing boilerplate code which reads configs and initializes the logging and SparkSession instance — providing a smooth and fast start to develop a new application.

Let’s now take a look at the usage of configs. In my example, it’s required to provide KAFKA_BOOTSTRAP_SERVERS environment variable to properly resolve the configuration object. This can be easily done inside tests by overriding the getGlobalConfig function in the ConfigProvider :

Instead of direct reading from the application.conf (which would be done in case of production launch), the configuration provider function is overridden with properties that are dynamically coming from the test environment configuration and then resolved with a fallback to the main configuration. This approach allows dynamical override and injection of the properties for proper unit testing.

Some examples and links

If you know other useful patterns are there for Spark Scala application development, feel free to share them in the comments!

Also, hit subscribe if you liked the post — it keeps the author motivated to write more.

--

--

Ivan Trusov

Senior Specialist Solutions Architect @ Databricks. All opinions are my own.