Curated Data

Chronicle provides curated datasets that simplify common reporting and analysis tasks. These datasets are pre-processed, deduplicated, and optimized for efficient querying.

Curated data storage location

Curated datasets are stored in the following locations:

  • By default: /var/lib/posit-chronicle/data/curated/v2/{product}/{dataset}
  • Custom local storage: {[LocalStorage].Location}/curated/v2/{product}/{dataset}
  • S3 storage: {[S3Storage].Location/Prefix}/curated/v2/{product}/{dataset}

Where: - {product} is either connect or workbench - {dataset} is the dataset name (e.g., user_list, user_totals)

Reading curated data

The chronicle.reports R package provides simple functions to read curated data.

Curated data is stored in Apache Parquet format with Hive-style date partitioning and can be read using:

  • R: Use the arrow package with the open_dataset() function
  • Python: Use the pandas, pyarrow, or polars packages
  • DuckDB: Query directly with SQL
  • Any Parquet-compatible tool

The date partition is automatically available as a column when reading with tools that support hive-style partitioned datasets (like the {arrow} function open_dataset()).

Example: Reading curated data in R

library(arrow)

# Read user totals with automatic date column
user_totals <- open_dataset("/var/lib/posit-chronicle/data/curated/v2/connect/user_totals")

# Filter by date range
recent_totals <- user_totals |>
  filter(date >= as.Date("2025-01-01")) |>
  collect()

Example: Reading curated data in Python

import pyarrow.dataset as ds

# Read content list with automatic date column
content_list = ds.dataset("/var/lib/posit-chronicle/data/curated/v2/connect/content_list")

# Convert to pandas DataFrame
df = content_list.to_table(filter=ds.field("date") >= "2025-01-01").to_pandas()

Automatic backfilling

Chronicle automatically backfills curated datasets for historical dates after an upgrade from Chronicle 2025.08 or earlier.

The backfill process runs in the background after server startup and processes dates in reverse chronological order (most recent first). The server tracks backfill progress in {storage}/upgrade/curation-backfill-state.json. This ensures curated datasets are available for all historical data after upgrading Chronicle.

Available curated datasets

Chronicle provides the following curated datasets:

Posit Connect datasets

Posit Workbench datasets

  • Workbench User List - List of all named users with details
  • Workbench User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats

Connect User List

Path: curated/v2/connect/user_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all named users across Connect environments. Deduplicated by email and environment.

Purpose

Provides a complete list of users for:

  • User directory exports
  • Activity analysis
  • Cross-referencing with content ownership
  • Understanding user roles and permissions

Filtering rules

  • Excludes locked users
  • Excludes unconfirmed users
  • Excludes users without email addresses
  • Excludes users without activity data
  • Excludes users inactive for more than 1 year
  • Deduplicates by (email, environment), keeping the most recent record

Schema

Column Type Description
environment string Connect environment identifier
id string User GUID (unique identifier)
username string User’s username
email string User’s email address
first_name string User’s first name
last_name string User’s last name
user_role string User role: administrator, publisher, or viewer
created_at timestamp When the user account was created
updated_at timestamp When the user account was last updated
last_active_at timestamp User’s last activity timestamp
active_today boolean Whether the user was active on this date
date date Partition date (automatically added by Arrow)

Connect User Totals

Path: curated/v2/connect/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats. Contains a single row per day with global counts.

Purpose

Provides pre-computed user counts for:

  • License compliance monitoring
  • Historical growth tracking
  • Daily active user (DAU) trends
  • User role distribution

Key definitions

Named Users (Licensing): Users who are not locked and have been active within the past year. This aligns with the Connect licensing model.

Active Users (Operational): Users counted within specific time windows (30 days, 1 day), excluding locked users, providing visibility into product usage.

Role Counts: Include only named users (active within the past year).

Deduplication

Users are deduplicated by email address across all environments. When multiple records exist, the most recent valid record is used.

Schema

Column Type Description
named_users int64 Count of users active within the past year (licensing metric)
active_users_30days int64 Count of users active within the past 30 days
active_users_1day int64 Count of users active on this specific date
administrators int64 Count of named users with administrator role
publishers int64 Count of named users with publisher role
viewers int64 Count of named users with viewer role
licensed_user_seats int64 Maximum licensed seats across all environments
date date Partition date (automatically added by Arrow)

Connect Content List

Path: curated/v2/connect/content_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all content items across Connect environments. Deduplicated by GUID and environment.

Purpose

Provides a complete content inventory for:

  • Content audits and reports
  • Resource allocation analysis (CPU, memory, processes)
  • Deployment tracking
  • Access control reviews

Filtering rules

  • Excludes locked content
  • Deduplicates by (environment, GUID), keeping the most recent unlocked record
  • If the latest record is locked, the content is excluded entirely

Schema

Column Type Description
environment string Connect environment identifier
id string Content GUID (unique identifier)
name string Content name (URL-friendly)
title string Content display title
created_time timestamp When content was created
last_deployed_time timestamp When content was last deployed
type string Content type (e.g., shiny, rmd-static, quarto-static)
description string Content description
access_type string Access control type (logged_in, acl, all)
locked boolean Whether content is locked
locked_message string Message shown when content is locked
connection_timeout int Connection timeout in seconds
read_timeout int Read timeout in seconds
init_timeout int Initialization timeout in seconds
idle_timeout int Idle timeout in seconds
max_processes int Maximum number of processes
min_processes int Minimum number of processes
max_conns_per_process int Maximum connections per process
load_factor float64 Load factor for scaling
cpu_request float64 CPU request (cores)
cpu_limit float64 CPU limit (cores)
memory_request int64 Memory request (bytes)
memory_limit int64 Memory limit (bytes)
amd_gpu_limit int AMD GPU limit
nvidia_gpu_limit int NVIDIA GPU limit
bundle_id string Current bundle GUID
content_category string Content category
parameterized boolean Whether content accepts parameters
cluster_name string Kubernetes cluster name
image_name string Container image name
default_image_name string Default container image
default_r_environment_management boolean Default R environment management setting
default_py_environment_management boolean Default Python environment management setting
service_account_name string Kubernetes service account
r_version string R version
r_environment_management boolean R environment management enabled
py_version string Python version
py_environment_management boolean Python environment management enabled
quarto_version string Quarto version
run_as string Unix user to run as
run_as_current_user boolean Whether to run as current user
owner_guid string Owner’s user GUID
content_url string Content access URL
dashboard_url string Dashboard URL
app_role string Application role
vanity_url string Custom vanity URL
tags list[string] Content tags
extension boolean Whether content is an extension
date date Partition date (automatically added by Arrow)

Connect Content Totals

Path: curated/v2/connect/content_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Daily counts of content grouped by type and environment.

Purpose

Provides content distribution metrics for:

  • Understanding content type usage
  • Tracking content growth by type
  • Environment-specific content analysis

Filtering rules

  • Excludes locked content
  • Deduplicates by (environment, GUID), keeping the most recent record

Schema

Column Type Description
count int64 Number of content items
type string Content type
environment string Connect environment identifier
date date Partition date (automatically added by Arrow)

Connect Content Visits Totals by User

Path: curated/v2/connect/content_visits_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of content visits per content item and user. Provides visit metrics for each user-content combination.

Purpose

Provides pre-computed visit counts for:

  • User activity analysis
  • Content popularity by user
  • Access pattern tracking
  • User engagement metrics

Filtering rules

  • Deduplicates visits by (environment, content_guid, user_guid, path, timestamp)
  • Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
  • Counts unique visit timestamps per user-content-path combination

Schema

Column Type Description
environment string Connect environment identifier
content_guid string Content GUID being visited
user_guid string User GUID who visited the content
visits int64 Total number of visits (unique timestamps)
path string URL path accessed within the content
date date Partition date (automatically added by Arrow)

Connect Shiny Usage Totals by User

Path: curated/v2/connect/shiny_usage_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of Shiny app sessions per app and per user. Provides session counts and total duration for each user-content combination.

Purpose

Provides pre-computed Shiny usage metrics for:

  • User engagement with Shiny applications
  • Session duration analysis
  • Content usage patterns
  • Resource utilization tracking

Filtering rules

  • Deduplicates sessions by (environment, content_guid, user_guid, timestamp)
  • Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
  • Counts unique sessions and durations per user-content combination

Schema

Column Type Description
environment string Connect environment identifier
content_guid string Shiny content GUID
user_guid string User GUID who used the Shiny app
num_sessions int64 Total number of unique Shiny sessions
duration int64 Total session duration in seconds
date date Partition date (automatically added by Arrow)

Workbench User List

Path: curated/v2/workbench/user_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all named users across Workbench environments. Deduplicated by username and environment.

Purpose

Provides a complete list of users for:

  • User directory exports
  • Activity analysis
  • License compliance
  • Understanding user roles

Filtering rules

  • Excludes non-active users (status != “Active”)
  • Excludes users without usernames
  • Excludes users without activity data
  • Excludes users inactive for more than 1 year
  • Deduplicates by (username, environment), keeping the most recent record

Schema

Column Type Description
environment string Workbench environment identifier
id string User GUID (unique identifier)
username string User’s username
email string User’s email address
user_role string User role: admin, superadmin, or user
created_at timestamp When the user account was created
last_active_at timestamp User’s last activity timestamp
active_today boolean Whether the user was active on this date
date date Partition date (automatically added by Arrow)

Workbench User Totals

Path: curated/v2/workbench/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats. Contains a single row per day with global counts.

Purpose

Provides pre-computed user counts for:

  • License compliance monitoring
  • Historical growth tracking
  • Daily active user (DAU) trends
  • User role distribution

Key definitions

Named Users (Licensing): Active users who have been active within the past year. This aligns with the Workbench licensing model.

Active Users (Operational): Users counted within specific time windows (30 days, 1 day), providing visibility into product usage.

Role Counts: Include only named users (active within the past year).

Deduplication

Users are deduplicated by username across all environments. When multiple records exist, the most recent valid record is used.

Schema

Column Type Description
named_users int64 Count of users active within the past year (licensing metric)
active_users_30days int64 Count of users active within the past 30 days
active_users_1day int64 Count of users active on this specific date
administrators int64 Count of named users with administrator role
super_administrators int64 Count of named users with super administrator role
users int64 Count of named users with standard user role
licensed_user_seats int64 Maximum licensed seats across all environments
date date Partition date (automatically added by Arrow)
Back to top