Workbench-managed Databricks Credentials
Posit Workbench has a native integration for Databricks that includes support for managed Databricks OAuth credentials and a dedicated pane in the RStudio Pro IDE.
Workbench-managed Databricks credentials allow you to sign into a Databricks workspace from the home page and be immediately granted access to data and compute resources using your existing Databricks identity. Managed credentials avoid the burden (and risk) of managing Databricks personal access tokens (PATs) yourself, work with the Databricks CLI, most official packages and SDKs, and database drivers for Python and R. Anything that implements the Databricks client unified authentication standard picks up the ambient credentials supplied by Workbench.
Workbench-managed Databricks credentials are refreshed automatically while your session is active.
Starting a Session with Workbench-managed Databricks credentials
See Starting a Session with Workbench-managed credentials for more information on how to enable Databricks credentials for use in new sessions. If the Databricks selection is not available, then your administrator has not configured the integration.
If your Workbench administrator has enabled managed Databricks credentials, providing your PATs or host environment variables in a .env
, .databrickscfg
, .Renviron
or other files may interfere with Workbench-managed credentials and lead to inconsistent behavior. If you want to opt out of managed credentials and instead use your own configuration, do not enable the Databricks Session Credentials button when starting a new session.
Databricks CLI
The Databricks CLI wraps the Databricks REST API and provides access to Databricks account and workspace resources and data. For more information, reference the Databricks REST API reference or the Databricks CLI documentation. Providing Databricks credentials is unnecessary when using the CLI with managed Databricks credentials inside Posit Workbench.
# List cluster metadata
databricks clusters get 1234-567890-a12bcde3
Databricks with R
Workbench-managed Databricks credentials have been validated to work with sparklyr
, odbc
using the Posit Pro Driver for Databricks, and the Databricks R SDK. For more information about using the Databricks pane to manage clusters or sparklyr
connections easily, or about the Connections pane in RStudio Pro, reference RStudio Pro: Databricks.
sparklyr
version 1.8.4 or higher works with Workbench-managed credentials and modern Databricks Runtimes that make use of Databricks Connect (ML 13+). This is accomplished via reticulate
and an additional pysparklyr
package that powers sparklyr
. For more information and installation instructions see the sparklyr
package documentation.
library(sparklyr)
<- spark_connect(
sc cluster_id = "[Cluster ID]",
method = "databricks_connect",
dbr_version = "14.0"
)
odbc
version 1.4.0 or higher provides odbc::databricks()
which automatically handles many common authentication scenarios. Pair odbc
with DBI
if you want to write your own SQL:
library(DBI)
<- dbConnect(
con ::databricks(),
odbchttpPath = "value found under Advanced Options > JDBC/ODBC in the Databricks UI"
)
dbGetQuery(con, "
SELECT passenger_count, AVG(fare_amount) AS avg_fare
FROM nyctaxi
WHERE trip_distance > 0.5
GROUP BY passenger_count
")
Or use dbplyr
to generate SQL from your dplyr
code:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
<- tbl(con, in_catalog("samples", "nyctaxi", "trips"))
nyctaxi |>
nyctaxi filter(trip_distance > 0.5) |>
summarise(
avg_fare = mean(fare_amount),
.by = passenger_count
)
The R Databricks SDK can be used for general Databricks operations:
library(databricks)
<- DatabricksClient()
client # List clusters
clustersList(client)[, "cluster_name"]
Databricks with Python
Workbench-managed credentials are expected to work with Python tools that adhere to the Databricks client unified authentication standard. The following have been tested in Posit Workbench: pyodbc
, Databricks VS Code extension, Databricks CLI, Databricks SDK for Python, and pyspark
via Databricks Connect.
pyodbc
has been validated with Posit’s Databricks Pro Driver in VS Code.
import configparser
import os
import pyodbc
import re
# open the DATABRICKS_CONFIG_FILE to get oauth token information
= configparser.ConfigParser()
config "DATABRICKS_CONFIG_FILE"])
config.read(os.environ[
# Replace <table-name> with the name of the database table to query.
= "samples.nyctaxi.trips"
table_name
= os.environ["DATABRICKS_HOST"]
databricks_host = "138962681435081"
databricks_workspace_id = "1108-152427-dq9mgl"
databricks_cluster_id = (
databricks_http_path f"sql/protocolv1/o/{databricks_workspace_id}/{databricks_cluster_id}"
)= config["workbench"]["token"]
databricks_oauth_token
# Setup the connection string for the Databricks instance
= f"""
connection_string Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so;
Host = {databricks_host};
HTTPPath = {databricks_http_path};
Port = 443;
Protocol = https;
UID = token;
SparkServerType = 3;
Schema = default;
ThriftTransport = 2;
SSL = 1;
AuthMech = 11;
Auth_Flow = 0;
Auth_AccessToken = {databricks_oauth_token};
"""
# remove spaces from connection string
= re.sub(r"\s+", "", connection_string)
connection_string
# connect to the Databricks instance
= pyodbc.connect(connection_string, autocommit=True) conn
PySpark is a Python API for working with Apache Spark and can be used with Databricks managed Spark environments. It can be used for SQL or dataframe operations as well as machine learning algorithms via MLlib.
from pyspark.sql import SparkSession
= spark
spark: SparkSession
print("Hello from Databricks")
"select * from samples.nyctaxi.trips").show(3) spark.sql(
The .databrickscfg
file that Posit Workbench manages for you has a profile called “workbench” containing the short-lived tokens and other metadata for connecting.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
# Optional to specify the profile
= Config(profile="workbench", cluster_id="1234-abcdef-5678hijk")
config
= DatabricksSession.builder.sdkConfig(config).getOrCreate()
spark
# PySpark code executed on Databricks Cluster
= spark.read.table("samples.nyctaxi.trips")
df = df.filter(df.trip_distance > 0.5)
df = df.limit(3).toPandas()
df_pd
# Python code executed on local machine
print(df_pd)
Python SDK
This brief example prints the available clusters based on your user profile, using the Databricks profile (profile="workbench"
) that Workbench-managed credentials provides.
from databricks.sdk import WorkspaceClient
# Connect to the Databricks instance
# Optional to specify the profile
= WorkspaceClient(profile="workbench")
w
# Retrieve the list of clusters
= w.clusters.list()
clusters
# Print the names of the clusters
for c in clusters:
print(c.cluster_name)