Publish a Quarto Report that Accesses Databricks
This guide shows you how to create and publish a simple Quarto report that accesses the sample NYC Taxi Trip Duration information from a SQL Warehouse hosted in Databricks.
Prerequisites
Before you begin, you must:
- Have a compatible version of Python and Quarto installed on the Posit Connect Server
- Have the public URL of the Connect server
- Have a valid Connect Publisher or Administrator account
- Have a working Python and Quarto versions installed on the local development environment
- Access to a Databricks account
- Have access to the NYC Taxi Trip Duration sample data set in Databricks
- Have installed the following Python packages:
For more information about requesting permission to publish your content, please see the Publishing section of the User Guide.
Step 1. Create a Quarto project
From a terminal session:
Create a new Quarto project
quarto create project default fare_by_distance
Replace the contents of Fare_by_distance.qmd
with the following code:
---
title: "Fare by Distance"
format:
html:
code-fold: true
jupyter: python3
---
## Databricks
Demonstration of Quarto content accessing a SQL Warehouse on Databricks using Python.
## Data Analysis
The chart below shows fare prices by distance for taxi rides, color coded by day of the week when the trip took place.
```{python}
#| echo: false
#| warning: false
from databricks import sql
import plotly.express as px
import pandas as pd
import os
query = """
SELECT
T.weekday,
CASE
WHEN T.weekday = 1 THEN 'Sunday'
WHEN T.weekday = 2 THEN 'Monday'
WHEN T.weekday = 3 THEN 'Tuesday'
WHEN T.weekday = 4 THEN 'Wednesday'
WHEN T.weekday = 5 THEN 'Thursday'
WHEN T.weekday = 6 THEN 'Friday'
WHEN T.weekday = 7 THEN 'Saturday'
ELSE 'N/A'
END AS day_of_week,
T.fare_amount,
T.trip_distance
FROM
(
SELECT
dayofweek(tpep_pickup_datetime) as weekday,
*
FROM
samples.nyctaxi.trips
) T
LIMIT 500;
"""
# Values defined in _environment
host = os.getenv("DATABRICKS_HOST")
path = os.getenv("DATABRICKS_PATH")
token = os.getenv("DATABRICKS_TOKEN")
with sql.connect(server_hostname = host,
http_path = path,
access_token = token) as connection:
with connection.cursor() as cursor:
tmp = cursor.execute(query)
df = pd.DataFrame(cursor.fetchall(), columns = ["weekday", "day_of_week", "fare_amount", "trip_distance"])
fig = px.scatter(
df,
x = "trip_distance",
y = "fare_amount",
color = "day_of_week",
title = "Fare by Distance",
labels = {
"fare_amount": "Fare amount (USD)",
"trip_distance": "Trip distance (miles)",
"day_of_week": "Day of the week"
}
)
fig.show()
```
To avoid warnings and package conflicts during deployment, it is recommended to create a requirements.txt
file in the project folder with the following contents:
databricks-sql-connector==3.0.1
jupyter==1.0.0 plotly==5.18.0
Step 2. Configure your local environment variables
In order to configure the report to access Databricks resources, you need to gather a few values from your account:
- Login to your Databricks account
- Navigate to SQL Warehouse > {your-warehouse} > Connection Details
- Use
Server hostname
forDATABRICKS_HOST
- Use
HTTP path
forDATABRICKS_PATH
- Use an existing Access Token for
DATABRICKS_TOKEN
. If you need to create a token, follow the instructions in the Databricks documentation
Create an _environment
file in your project in the same folder as the _quarto.yml
file containing these lines:
DATABRICKS_HOST='{server-hostname}'
DATABRICKS_PATH='{http-path}' DATABRICKS_TOKEN='{access-token}'
Step 3. Render the report locally
Check that all python dependencies are installed:
pip install -r requirements.txt
Make sure everything is working locally before publishing the report to Posit Connect. From inside the project folder, run:
quarto preview Fare_by_distance.qmd
This may take a few minutes to complete.
Step 4. Publish the Quarto report to Connect
Since the report relies on environment variables to provide the Databricks connection values, use rsconnect-python
to pass them to Connect during deployment, as documented in the Environment Variables section of the documentation.
Quarto knows how to read the content of the _environment
file, but the variables also need to be available on your terminal session so rsconnect-python
can access them. The commands below are one way to accomplish this:
set -o allexport
source _environment
set +o allexport
Use a method that is appropriate for your environment and operating system.
Validate that rsconnect-python
is installed:
rsconnect version
Or install it as documented in the installation section of the documentation.
To publish, make sure CONNECT_SERVER
and API_KEY
have valid values. Then, on a terminal session, enter the following command:
rsconnect deploy quarto . \
--server $CONNECT_SERVER \
--api-key $API_KEY \
--environment DATABRICKS_TOKEN \
--environment DATABRICKS_PATH \
--environment DATABRICKS_HOST
Note that the Databricks environment variables do not need to be resolved by the shell, so they do not include the $
prefix.
The Databricks environment variables only need to be set once, unless a change needs to be made. If the values have not changed, you don’t need to provide them again when you publish updates to the document.
rsconnect deploy quarto . \
--server $CONNECT_SERVER \
--api-key $API_KEY
Once the report is published, you can view the rendered version on Connect.
For more information on how to deploy Quarto content, see the Quarto documentation