Publish a Quarto Report that Accesses Databricks

This guide shows you how to create and publish a simple Quarto report that accesses the sample NYC Taxi Trip Duration information from a SQL Warehouse hosted in Databricks.

Prerequisites

Before you begin, you must:

For more information about requesting permission to publish your content, please see the Publishing section of the User Guide.

Step 1. Create a Quarto project

From a terminal session:

Create a new Quarto project

quarto create project default fare_by_distance

Replace the contents of Fare_by_distance.qmd with the following code:

---
title: "Fare by Distance"
format:
  html:
    code-fold: true
jupyter: python3
---

## Databricks

Demonstration of Quarto content accessing a SQL Warehouse on Databricks using Python.

## Data Analysis

The chart below shows fare prices by distance for taxi rides, color coded by day of the week when the trip took place.

```{python}
#| echo: false
#| warning: false

from databricks import sql
import plotly.express as px
import pandas as pd
import os

query = """
SELECT
  T.weekday,
  CASE
    WHEN T.weekday = 1 THEN 'Sunday'
    WHEN T.weekday = 2 THEN 'Monday'
    WHEN T.weekday = 3 THEN 'Tuesday'
    WHEN T.weekday = 4 THEN 'Wednesday'
    WHEN T.weekday = 5 THEN 'Thursday'
    WHEN T.weekday = 6 THEN 'Friday'
    WHEN T.weekday = 7 THEN 'Saturday'
    ELSE 'N/A'
  END AS day_of_week,
  T.fare_amount,
  T.trip_distance
FROM
  (
    SELECT
      dayofweek(tpep_pickup_datetime) as weekday,
      *
    FROM
      samples.nyctaxi.trips
  ) T
LIMIT 500;
"""

# Values defined in _environment
host = os.getenv("DATABRICKS_HOST")
path = os.getenv("DATABRICKS_PATH")
token = os.getenv("DATABRICKS_TOKEN")

with sql.connect(server_hostname = host,
                 http_path       = path,
                 access_token    = token) as connection:

  with connection.cursor() as cursor:
    tmp = cursor.execute(query)
    df = pd.DataFrame(cursor.fetchall(), columns = ["weekday", "day_of_week", "fare_amount", "trip_distance"])

fig = px.scatter(
  df,
  x = "trip_distance",
  y = "fare_amount",
  color = "day_of_week",
  title = "Fare by Distance",
  labels = {
    "fare_amount": "Fare amount (USD)",
    "trip_distance": "Trip distance (miles)",
    "day_of_week": "Day of the week"
  }
)
fig.show()
```

To avoid warnings and package conflicts during deployment, it is recommended to create a requirements.txt file in the project folder with the following contents:

databricks-sql-connector==3.0.1
jupyter==1.0.0
plotly==5.18.0

Step 2. Configure your local environment variables

In order to configure the report to access Databricks resources, you need to gather a few values from your account:

  • Login to your Databricks account
  • Navigate to SQL Warehouse > {your-warehouse} > Connection Details
  • Use Server hostname for DATABRICKS_HOST
  • Use HTTP path for DATABRICKS_PATH
  • Use an existing Access Token for DATABRICKS_TOKEN. If you need to create a token, follow the instructions in the Databricks documentation

Create an _environment file in your project in the same folder as the _quarto.yml file containing these lines:

DATABRICKS_HOST='{server-hostname}'
DATABRICKS_PATH='{http-path}'
DATABRICKS_TOKEN='{access-token}'

Step 3. Render the report locally

Check that all python dependencies are installed:

pip install -r requirements.txt

Make sure everything is working locally before publishing the report to Posit Connect. From inside the project folder, run:

quarto preview Fare_by_distance.qmd

This may take a few minutes to complete.

Step 4. Publish the Quarto report to Connect

Since the report relies on environment variables to provide the Databricks connection values, use rsconnect-python to pass them to Connect during deployment, as documented in the Environment Variables section of the documentation.

Quarto knows how to read the content of the _environment file, but the variables also need to be available on your terminal session so rsconnect-python can access them. The commands below are one way to accomplish this:

set -o allexport
source _environment
set +o allexport

Use a method that is appropriate for your environment and operating system.

Validate that rsconnect-python is installed:

rsconnect version

Or install it as documented in the installation section of the documentation.

To publish, make sure CONNECT_SERVER and API_KEY have valid values. Then, on a terminal session, enter the following command:

rsconnect deploy quarto . \
  --server $CONNECT_SERVER \
  --api-key $API_KEY \
  --environment DATABRICKS_TOKEN \
  --environment DATABRICKS_PATH \
  --environment DATABRICKS_HOST

Note that the Databricks environment variables do not need to be resolved by the shell, so they do not include the $ prefix.

The Databricks environment variables only need to be set once, unless a change needs to be made. If the values have not changed, you don’t need to provide them again when you publish updates to the document.

rsconnect deploy quarto . \
  --server $CONNECT_SERVER \
  --api-key $API_KEY

Once the report is published, you can view the rendered version on Connect.

For more information on how to deploy Quarto content, see the Quarto documentation