Curated Data
Chronicle provides curated datasets that simplify common reporting and analysis tasks. These datasets are pre-processed, deduplicated, and optimized for efficient querying.
Curated data storage location
Curated datasets are stored in the following locations:
- By default:
/var/lib/posit-chronicle/data/curated/v2/{product}/{dataset} - Custom local storage:
{[LocalStorage].Location}/curated/v2/{product}/{dataset} - S3 storage:
{[S3Storage].Location/Prefix}/curated/v2/{product}/{dataset}
Where: - {product} is either connect or workbench - {dataset} is the dataset name (e.g., user_list, user_totals)
Reading curated data
The chronicle.reports R package provides simple functions to read curated data.
Curated data is stored in Apache Parquet format with Hive-style date partitioning and can be read using:
- R: Use the arrow package with the
open_dataset()function - Python: Use the pandas, pyarrow, or polars packages
- DuckDB: Query directly with SQL
- Any Parquet-compatible tool
The date partition is automatically available as a column when reading with tools that support hive-style partitioned datasets (like the {arrow} function open_dataset()).
Example: Reading curated data in R
library(arrow)
# Read user totals with automatic date column
user_totals <- open_dataset("/var/lib/posit-chronicle/data/curated/v2/connect/user_totals")
# Filter by date range
recent_totals <- user_totals |>
filter(date >= as.Date("2025-01-01")) |>
collect()Example: Reading curated data in Python
import pyarrow.dataset as ds
# Read content list with automatic date column
content_list = ds.dataset("/var/lib/posit-chronicle/data/curated/v2/connect/content_list")
# Convert to pandas DataFrame
df = content_list.to_table(filter=ds.field("date") >= "2025-01-01").to_pandas()Automatic backfilling
Chronicle automatically backfills curated datasets for historical dates after an upgrade from Chronicle 2025.08 or earlier.
The backfill process runs in the background after server startup and processes dates in reverse chronological order (most recent first). The server tracks backfill progress in {storage}/upgrade/curation-backfill-state.json. This ensures curated datasets are available for all historical data after upgrading Chronicle.
Available curated datasets
Chronicle provides the following curated datasets:
Posit Connect datasets
- Connect User List - List of all named users with details
- Connect User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats
- Connect Content List - List of all content items with configuration
- Connect Content Totals - Counts of content items grouped by type and environment
- Connect Content Visits Totals by User - Counts of content visits per content item and user
- Connect Shiny Usage Totals by User - Counts of Shiny app sessions per app and per user
Posit Workbench datasets
- Workbench User List - List of all named users with details
- Workbench User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats
Connect User List
Path: curated/v2/connect/user_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all named users across Connect environments. Deduplicated by email and environment.
Purpose
Provides a complete list of users for:
- User directory exports
- Activity analysis
- Cross-referencing with content ownership
- Understanding user roles and permissions
Filtering rules
- Excludes locked users
- Excludes unconfirmed users
- Excludes users without email addresses
- Excludes users without activity data
- Excludes users inactive for more than 1 year
- Deduplicates by (email, environment), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
id |
string | User GUID (unique identifier) |
username |
string | User’s username |
email |
string | User’s email address |
first_name |
string | User’s first name |
last_name |
string | User’s last name |
user_role |
string | User role: administrator, publisher, or viewer |
created_at |
timestamp | When the user account was created |
updated_at |
timestamp | When the user account was last updated |
last_active_at |
timestamp | User’s last activity timestamp |
active_today |
boolean | Whether the user was active on this date |
date |
date | Partition date (automatically added by Arrow) |
Connect User Totals
Path: curated/v2/connect/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats. Contains a single row per day with global counts.
Purpose
Provides pre-computed user counts for:
- License compliance monitoring
- Historical growth tracking
- Daily active user (DAU) trends
- User role distribution
Key definitions
Named Users (Licensing): Users who are not locked and have been active within the past year. This aligns with the Connect licensing model.
Active Users (Operational): Users counted within specific time windows (30 days, 1 day), excluding locked users, providing visibility into product usage.
Role Counts: Include only named users (active within the past year).
Deduplication
Users are deduplicated by email address across all environments. When multiple records exist, the most recent valid record is used.
Schema
| Column | Type | Description |
|---|---|---|
named_users |
int64 | Count of users active within the past year (licensing metric) |
active_users_30days |
int64 | Count of users active within the past 30 days |
active_users_1day |
int64 | Count of users active on this specific date |
administrators |
int64 | Count of named users with administrator role |
publishers |
int64 | Count of named users with publisher role |
viewers |
int64 | Count of named users with viewer role |
licensed_user_seats |
int64 | Maximum licensed seats across all environments |
date |
date | Partition date (automatically added by Arrow) |
Connect Content List
Path: curated/v2/connect/content_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all content items across Connect environments. Deduplicated by GUID and environment.
Purpose
Provides a complete content inventory for:
- Content audits and reports
- Resource allocation analysis (CPU, memory, processes)
- Deployment tracking
- Access control reviews
Filtering rules
- Excludes locked content
- Deduplicates by (environment, GUID), keeping the most recent unlocked record
- If the latest record is locked, the content is excluded entirely
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
id |
string | Content GUID (unique identifier) |
name |
string | Content name (URL-friendly) |
title |
string | Content display title |
created_time |
timestamp | When content was created |
last_deployed_time |
timestamp | When content was last deployed |
type |
string | Content type (e.g., shiny, rmd-static, quarto-static) |
description |
string | Content description |
access_type |
string | Access control type (logged_in, acl, all) |
locked |
boolean | Whether content is locked |
locked_message |
string | Message shown when content is locked |
connection_timeout |
int | Connection timeout in seconds |
read_timeout |
int | Read timeout in seconds |
init_timeout |
int | Initialization timeout in seconds |
idle_timeout |
int | Idle timeout in seconds |
max_processes |
int | Maximum number of processes |
min_processes |
int | Minimum number of processes |
max_conns_per_process |
int | Maximum connections per process |
load_factor |
float64 | Load factor for scaling |
cpu_request |
float64 | CPU request (cores) |
cpu_limit |
float64 | CPU limit (cores) |
memory_request |
int64 | Memory request (bytes) |
memory_limit |
int64 | Memory limit (bytes) |
amd_gpu_limit |
int | AMD GPU limit |
nvidia_gpu_limit |
int | NVIDIA GPU limit |
bundle_id |
string | Current bundle GUID |
content_category |
string | Content category |
parameterized |
boolean | Whether content accepts parameters |
cluster_name |
string | Kubernetes cluster name |
image_name |
string | Container image name |
default_image_name |
string | Default container image |
default_r_environment_management |
boolean | Default R environment management setting |
default_py_environment_management |
boolean | Default Python environment management setting |
service_account_name |
string | Kubernetes service account |
r_version |
string | R version |
r_environment_management |
boolean | R environment management enabled |
py_version |
string | Python version |
py_environment_management |
boolean | Python environment management enabled |
quarto_version |
string | Quarto version |
run_as |
string | Unix user to run as |
run_as_current_user |
boolean | Whether to run as current user |
owner_guid |
string | Owner’s user GUID |
content_url |
string | Content access URL |
dashboard_url |
string | Dashboard URL |
app_role |
string | Application role |
vanity_url |
string | Custom vanity URL |
tags |
list[string] | Content tags |
extension |
boolean | Whether content is an extension |
date |
date | Partition date (automatically added by Arrow) |
Connect Content Totals
Path: curated/v2/connect/content_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Daily counts of content grouped by type and environment.
Purpose
Provides content distribution metrics for:
- Understanding content type usage
- Tracking content growth by type
- Environment-specific content analysis
Filtering rules
- Excludes locked content
- Deduplicates by (environment, GUID), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
count |
int64 | Number of content items |
type |
string | Content type |
environment |
string | Connect environment identifier |
date |
date | Partition date (automatically added by Arrow) |
Connect Content Visits Totals by User
Path: curated/v2/connect/content_visits_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of content visits per content item and user. Provides visit metrics for each user-content combination.
Purpose
Provides pre-computed visit counts for:
- User activity analysis
- Content popularity by user
- Access pattern tracking
- User engagement metrics
Filtering rules
- Deduplicates visits by (environment, content_guid, user_guid, path, timestamp)
- Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
- Counts unique visit timestamps per user-content-path combination
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
content_guid |
string | Content GUID being visited |
user_guid |
string | User GUID who visited the content |
visits |
int64 | Total number of visits (unique timestamps) |
path |
string | URL path accessed within the content |
date |
date | Partition date (automatically added by Arrow) |
Connect Shiny Usage Totals by User
Path: curated/v2/connect/shiny_usage_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of Shiny app sessions per app and per user. Provides session counts and total duration for each user-content combination.
Purpose
Provides pre-computed Shiny usage metrics for:
- User engagement with Shiny applications
- Session duration analysis
- Content usage patterns
- Resource utilization tracking
Filtering rules
- Deduplicates sessions by (environment, content_guid, user_guid, timestamp)
- Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
- Counts unique sessions and durations per user-content combination
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
content_guid |
string | Shiny content GUID |
user_guid |
string | User GUID who used the Shiny app |
num_sessions |
int64 | Total number of unique Shiny sessions |
duration |
int64 | Total session duration in seconds |
date |
date | Partition date (automatically added by Arrow) |
Workbench User List
Path: curated/v2/workbench/user_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all named users across Workbench environments. Deduplicated by username and environment.
Purpose
Provides a complete list of users for:
- User directory exports
- Activity analysis
- License compliance
- Understanding user roles
Filtering rules
- Excludes non-active users (status != “Active”)
- Excludes users without usernames
- Excludes users without activity data
- Excludes users inactive for more than 1 year
- Deduplicates by (username, environment), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Workbench environment identifier |
id |
string | User GUID (unique identifier) |
username |
string | User’s username |
email |
string | User’s email address |
user_role |
string | User role: admin, superadmin, or user |
created_at |
timestamp | When the user account was created |
last_active_at |
timestamp | User’s last activity timestamp |
active_today |
boolean | Whether the user was active on this date |
date |
date | Partition date (automatically added by Arrow) |
Workbench User Totals
Path: curated/v2/workbench/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats. Contains a single row per day with global counts.
Purpose
Provides pre-computed user counts for:
- License compliance monitoring
- Historical growth tracking
- Daily active user (DAU) trends
- User role distribution
Key definitions
Named Users (Licensing): Active users who have been active within the past year. This aligns with the Workbench licensing model.
Active Users (Operational): Users counted within specific time windows (30 days, 1 day), providing visibility into product usage.
Role Counts: Include only named users (active within the past year).
Deduplication
Users are deduplicated by username across all environments. When multiple records exist, the most recent valid record is used.
Schema
| Column | Type | Description |
|---|---|---|
named_users |
int64 | Count of users active within the past year (licensing metric) |
active_users_30days |
int64 | Count of users active within the past 30 days |
active_users_1day |
int64 | Count of users active on this specific date |
administrators |
int64 | Count of named users with administrator role |
super_administrators |
int64 | Count of named users with super administrator role |
users |
int64 | Count of named users with standard user role |
licensed_user_seats |
int64 | Maximum licensed seats across all environments |
date |
date | Partition date (automatically added by Arrow) |