Workbench-managed Databricks Credentials

Enhanced Advanced

Posit Workbench offers a native integration for Databricks that supports managed Databricks OAuth credentials and includes a dedicated pane in the RStudio Pro IDE.

With Workbench-managed Databricks credentials, you can sign into a Databricks workspace from the home page and gain immediate access to data and compute resources using your existing Databricks identity.

Managed credentials eliminate the burden and risk of managing Databricks personal access tokens (PATs) yourself. They work seamlessly with the Databricks CLI, most official packages and SDKs, and database drivers for Python and R. Any tool that implements the Databricks client unified authentication standard can use the ambient credentials supplied by Workbench.

Workbench-managed Databricks credentials refresh automatically while your session remains active.

Starting a Session with Workbench-managed Databricks credentials

See Starting a Session with Workbench-managed credentials for more information on enabling Databricks credentials for use in new sessions. If the Databricks selection is not available, then your administrator has not configured the integration.

Avoid using both Workbench-managed credentials and PATs

If your Workbench administrator has enabled managed Databricks credentials, providing your PATs or host environment variables in a .env, .databrickscfg, .Renviron or other files may interfere with Workbench-managed credentials and lead to inconsistent behavior. If you want to opt out of managed credentials and instead use your own configuration, leave the Databricks Session Credentials button disabled when starting a new session.

Databricks CLI

The Databricks CLI wraps the Databricks REST API and provides access to Databricks account and workspace resources and data. You don’t need to provide Databricks credentials when using the CLI with managed Databricks credentials inside Posit Workbench.

# List cluster metadata
databricks clusters get 1234-567890-a12bcde3

For more information, reference the Databricks REST API reference or the Databricks CLI documentation.

Databricks with R

We validated that Workbench-managed Databricks credentials work with sparklyr, odbc using the Posit Professional Driver for Databricks, and the Databricks R SDK. For more information about using the Databricks pane to manage clusters or sparklyr connections easily, or about the Connections pane in RStudio Pro, reference the RStudio Pro: Databricks documentation.

sparklyr version 1.8.4 or higher is compatible with Workbench-managed credentials and modern Databricks Runtimes that make use of Databricks Connect (ML 13+).

This compatibility is achieved via reticulate and an additional pysparklyr package that powers sparklyr. For more information and installation instructions, see the sparklyr package documentation.

library(sparklyr)

sc <- spark_connect(
  cluster_id = "[Cluster ID]",
  method = "databricks_connect",
  dbr_version = "14.0"
)

Writing data into Databricks with sparklyr

To write tables into Databricks with sparkylr, use copy_to() and spark_write_table():

test_df <- data.frame("A" = rep(1,5,1), "B" = rep(1,5,1))

spark_tbl_test_df <- copy_to(sc, test_df, "spark_test_df")

spark_tbl_test_df  %>%
  spark_write_table(
    name = in_catalog("demo", "default", "test_df")
  )

odbc version 1.4.0 or higher provides odbc::databricks(), which automatically handles many common authentication scenarios. Pair odbc with DBI to write your own SQL:

library(DBI)

con <- dbConnect(
  odbc::databricks(),
  httpPath = "value found under ⁠Advanced Options > JDBC/ODBC⁠ in the Databricks UI"
)

dbGetQuery(con, "
  SELECT passenger_count, AVG(fare_amount) AS avg_fare 
  FROM nyctaxi 
  WHERE trip_distance > 0.5
  GROUP BY passenger_count
")

Writing data into Databricks with odbc

To write tables into Databricks with odbc use dbWriteTable() with useNativeQuery = FALSE in dbConnect():

library(DBI)

con <- dbConnect(
  odbc::databricks(),
  httpPath = "value found under ⁠Advanced Options > JDBC/ODBC⁠ in the Databricks UI",
  useNativeQuery = FALSE
)

test_df <- data.frame("A" = rep(1,5,1), "B" = rep(1,5,1))

DBI::dbWriteTable(
  con,
  DBI::Id("demo", "default", "test_df"),
  test_df
)

Or use dbplyr to generate SQL from your dplyr code:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

nyctaxi <- tbl(con, in_catalog("samples", "nyctaxi", "trips"))
nyctaxi |> 
   filter(trip_distance > 0.5) |> 
   summarise(
     avg_fare = mean(fare_amount), 
     .by = passenger_count
   )

The R Databricks SDK can be used for general Databricks operations:

library(databricks)

client <- DatabricksClient()
# List clusters
clustersList(client)[, "cluster_name"]

Databricks with Python

We expect Workbench-managed credentials to work with Python tools that adhere to the Databricks client unified authentication standard. We have tested the following in Posit Workbench: pyodbc, Databricks VS Code extension, Databricks CLI, Databricks SDK for Python, and pyspark via Databricks Connect.

pyodbc has been validated with Posit’s Databricks Pro Driver in VS Code.

import configparser
import os
import pyodbc
import re

# open the DATABRICKS_CONFIG_FILE to get oauth token information
config = configparser.ConfigParser()
config.read(os.environ["DATABRICKS_CONFIG_FILE"])

# Replace <table-name> with the name of the database table to query.
table_name = "samples.nyctaxi.trips"

databricks_host = os.environ["DATABRICKS_HOST"]
databricks_workspace_id = "138962681435081"
databricks_cluster_id = "1108-152427-dq9mgl"
databricks_http_path = (
    f"sql/protocolv1/o/{databricks_workspace_id}/{databricks_cluster_id}"
)
databricks_oauth_token = config["workbench"]["token"]

# Setup the connection string for the Databricks instance
connection_string = f"""
   Driver           = /opt/simba/spark/lib/64/libsparkodbc_sb64.so;
   Host             = {databricks_host};
   HTTPPath         = {databricks_http_path};
   Port             = 443;
   Protocol         = https;
   UID              = token;
   SparkServerType  = 3;
   Schema           = default;
   ThriftTransport  = 2;
   SSL              = 1;
   AuthMech         = 11;
   Auth_Flow        = 0;
   Auth_AccessToken = {databricks_oauth_token};
"""
# remove spaces from connection string
connection_string = re.sub(r"\s+", "", connection_string)

# connect to the Databricks instance
conn = pyodbc.connect(connection_string, autocommit=True)

PySpark is a Python API for working with Apache Spark and use it with Databricks-managed Spark environments. You can perform SQL or dataframe operations and implement machine learning algorithms via MLlib.

from pyspark.sql import SparkSession

spark: SparkSession = spark 

print("Hello from Databricks")
spark.sql("select * from samples.nyctaxi.trips").show(3)

The Workbench-managed .databrickscfg file has a profile called workbench containing the short-lived tokens and other metadata for connecting.

from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

# Optional to specify the profile
config = Config(profile="workbench", cluster_id="1234-abcdef-5678hijk")

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

# PySpark code executed on Databricks Cluster
df = spark.read.table("samples.nyctaxi.trips")
df = df.filter(df.trip_distance > 0.5)
df_pd = df.limit(3).toPandas()

# Python code executed on local machine
print(df_pd)

Python SDK

Using the Databricks profile (profile="workbench") that Workbench-managed credentials provides, this example prints the available clusters based on your profile:

from databricks.sdk import WorkspaceClient

# Connect to the Databricks instance
# Optional to specify the profile
w = WorkspaceClient(profile="workbench")

# Retrieve the list of clusters
clusters = w.clusters.list()

# Print the names of the clusters
for c in clusters:
    print(c.cluster_name)