Pub/Sub Lite as a source with Spark Structured Streaming on Databricks

Published in

Google Cloud - Community

4 min readSep 24, 2021

--

I recently tried a streaming workload of real-time taxi rides data using the Spark connector for Pub/Sub Lite on Databricks Community Edition (free). The connector was easy to set up. I got my pipeline written in PySpark up and running in almost no time. In this article, I will share my step-by-step guide and call out some features of interests for new Pub/Sub Lite users.

Databricks Community Edition

After sign-up, familiarize yourself with the left navigation pane. Then,

Create a cluster.

- For Databricks Runtime Version, pick Runtime: 6.4 Extended Support (Scala 2.11, Spark 2.4.5).
- Wait for the cluster is ready for configuration.
- Go to Libraries and click Install New.
- With Library Source Upload and Library Type Jar selected, drop the connector’s uber jar with dependencies (downloadable from this Google Cloud Storage location gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar or Maven Central).
Create a notebook.

Default Language can be Python or Scala.

Deployed Databricks on GCP

Alternatively, head over to GCP Marketplace and search for Databricks. Then sign up for a 14-day free trial. Follow these steps to complete the setup.

Create a workspace.
Create a cluster. Reference the same cluster configurations above to install the connector’s jar.
Create a notebook.

Taxi Rides > Pub/Sub Lite > Spark > Console

Pub/Sub Lite as an input source for various kinds of data streams

I introduced a convenient way to set up a real-time data stream of taxi rides using Dataflow Flex Template in my other article. For the demo below, I used a custom template to set up a Dataflow pipeline that publishes messages to a Pub/Sub Lite topic. I let the pipeline run for about 15 minutes while I experimented with some simple data transforms on the real-time taxi rides data stream.

2. Provide my GCP project number, Pub/Sub Lite subscription name and location.

Because I was using the Databricks Community Edition, I had to upload my GCP service account (of IAM role pubsublite.subscribe) key file to the Databricks file system. You can skip this step if you are using the deployed Databricks edition because the credentials should have already been configured when you first set up your account and create a workspace.

3. Load the my service account key. This allows Databricks to authenticate with GCP and use Pub/Sub Lite.

4. Connect to Pub/Sub Lite using the `readStream` API.

5. Peak into the `data` field of the streaming output.

Notice the total count for "passenger_count": 3 increases from 4,627 to 5,295 when the cursor hovers over the bar chart. The live chart updates every 5 seconds or so.

6. Watch `passenger_count` update in real time.

The Pub/Sub Lite topic I used here has 2 partitions, each of 30 GiB. It shares the publishing and subscribing throughput capacity with my other Pub/Sub Lite topics and subscriptions in the same cloud region and GCP project. I set my throughput capacity unit to 4 when I created my Pub/Sub Lite reservation. This means that my total publisher throughput for all my topics cannot exceeds 4 MiB/s and my total subscriber throughput cannot exceed 16 MiB/s. Because my other topics weren’t active when I was testing out the taxi rides data stream, the provisioned capacity was more than enough.

Dataflow Job Metrics show taxi rides data going into the Dataflow pipeline at less than 400 KiB/s

Pub/Sub Lite supports seeking to the beginning or the end of the message backlog as well as to a publish or event timestamp. But the connector does not yet support seek. This means you can’t yet restart a data stream from a point in time in mind, be it publish timestamp or event timestamp. If you find this feature very useful, or any other feature you’d like to have, leave us a comment or question here or at Google Cloud Community. You can also visit Buganizer or GitHub to formally file a public feature request.

Thank you for reading!

Pub/Sub Lite as a source with Spark Structured Streaming on Databricks

Databricks Community Edition

Deployed Databricks on GCP

Taxi Rides > Pub/Sub Lite > Spark > Console

Written by Tianzi Cai