Curated Data
Chronicle provides curated datasets that simplify common reporting and analysis tasks. These datasets are pre-processed, deduplicated, and optimized for efficient querying.
Curated data storage location
Curated datasets are stored in the following locations:
- By default:
/var/lib/posit-chronicle/data/curated/v2/{product}/{dataset} - Custom local storage:
{[LocalStorage].Location}/curated/v2/{product}/{dataset} - S3 storage:
{[S3Storage].Location/Prefix}/curated/v2/{product}/{dataset}
Where:
{product}is eitherconnectorworkbench{dataset}is the dataset name (e.g.,user_list,user_totals)
Reading curated data
The chronicle.reports R package provides simple functions to read curated data.
Curated data is stored in Apache Parquet format with Hive-style date partitioning and can be read using:
- chronicle.reports: Use the
chronicle_data()function - R: Use the arrow package with the
open_dataset()function - Python: Use the pandas, pyarrow, or polars packages
- DuckDB: Query directly with SQL
- Any Parquet-compatible tool
The date partition is automatically available as a column when reading with tools that support hive-style partitioned datasets (like the {arrow} function open_dataset()).
Example: Reading curated data with the chronicle.reports R package
library(chronicle.reports)
# Read user totals with automatic date column
recent_totals <- chronicle_data("connect/user_totals", base_path = "/var/lib/posit-chronicle/data") |>
dplyr::filter(date >= as.Date("2025-01-01")) |>
dplyr::collect()Example: Reading curated data in R
library(arrow)
# Read user totals with automatic date column
user_totals <- open_dataset("/var/lib/posit-chronicle/data/curated/v2/connect/user_totals")
# Filter by date range
recent_totals <- user_totals |>
dplyr::filter(date >= as.Date("2025-01-01")) |>
dplyr::collect()Example: Reading curated data in Python
import pyarrow.dataset as ds
# Read content list with automatic date column
content_list = ds.dataset("/var/lib/posit-chronicle/data/curated/v2/connect/content_list")
# Convert to pandas DataFrame
df = content_list.to_table(filter=ds.field("date") >= "2025-01-01").to_pandas()Available curated datasets
Chronicle provides the following curated datasets:
Posit Connect datasets
- Connect User List - List of all named users with details
- Connect User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats
- Connect Content List - List of all content items with configuration
- Connect Content Totals - Counts of content items grouped by type and environment
- Connect Content Visits Totals by User - Counts of content visits per content item and user
- Connect Shiny Usage Totals by User - Counts of Shiny app sessions per app and per user
- Connect Content Hits Totals - Counts of content hits per content item
- Connect Content Hits Totals by User - Counts of content hits per content item and user
Posit Workbench datasets
- Workbench User List - List of all named users with details
- Workbench User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats
Connect user list
Path: curated/v2/connect/user_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all named users across Connect environments. Deduplicated by email and environment.
Purpose
Provides a complete list of users for:
- User directory exports
- Activity analysis
- Cross-referencing with content ownership
- Understanding user roles and permissions
Filtering rules
- Excludes locked users
- Excludes unconfirmed users
- Excludes users without email addresses
- Excludes users without activity data
- Excludes users inactive for more than 1 year
- Deduplicates by (email, environment), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
id |
string | User GUID (unique identifier) |
username |
string | User’s username |
email |
string | User’s email address |
first_name |
string | User’s first name |
last_name |
string | User’s last name |
user_role |
string | User role: administrator, publisher, or viewer |
created_at |
timestamp | When the user account was created |
updated_at |
timestamp | When the user account was last updated |
last_active_at |
timestamp | User’s last activity timestamp |
active_today |
boolean | Whether the user was active on this date |
date |
date | Partition date (automatically added by Arrow) |
Connect user totals
Path: curated/v2/connect/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats. Contains a single row per day with global counts.
Purpose
Provides pre-computed user counts for:
- License compliance monitoring
- Historical growth tracking
- Daily active user (DAU) trends
- User role distribution
Key definitions
Named Users (Licensing): Users who are not locked and have been active within the past year. This aligns with the Connect licensing model.
Active Users (Operational): Users counted within specific time windows (30 days, 1 day), excluding locked users, providing visibility into product usage.
Role Counts: Include only named users (active within the past year).
Deduplication
Users are deduplicated by email address across all environments. When multiple records exist, the most recent valid record is used.
Schema
| Column | Type | Description |
|---|---|---|
named_users |
int64 | Count of users active within the past year (licensing metric) |
active_users_30days |
int64 | Count of users active within the past 30 days |
active_users_1day |
int64 | Count of users active on this specific date |
administrators |
int64 | Count of named users with administrator role |
publishers |
int64 | Count of named users with publisher role |
viewers |
int64 | Count of named users with viewer role |
licensed_user_seats |
int64 | Maximum licensed seats across all environments |
date |
date | Partition date (automatically added by Arrow) |
Connect content list
Path: curated/v2/connect/content_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all content items across Connect environments. Deduplicated by GUID and environment.
Purpose
Provides a complete content inventory for:
- Content audits and reports
- Resource allocation analysis (CPU, memory, processes)
- Deployment tracking
- Access control reviews
Filtering rules
- Excludes locked content
- Deduplicates by (environment, GUID), keeping the most recent unlocked record
- If the latest record is locked, the content is excluded entirely
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
id |
string | Content GUID (unique identifier) |
name |
string | Content name (URL-friendly) |
title |
string | Content display title |
created_time |
timestamp | When content was created |
last_deployed_time |
timestamp | When content was last deployed |
type |
string | Content type (e.g., shiny, rmd-static, quarto-static) |
description |
string | Content description |
access_type |
string | Access control type (logged_in, acl, all) |
locked |
boolean | Whether content is locked |
locked_message |
string | Message shown when content is locked |
connection_timeout |
int | Connection timeout in seconds |
read_timeout |
int | Read timeout in seconds |
init_timeout |
int | Initialization timeout in seconds |
idle_timeout |
int | Idle timeout in seconds |
max_processes |
int | Maximum number of processes |
min_processes |
int | Minimum number of processes |
max_conns_per_process |
int | Maximum connections per process |
load_factor |
float64 | Load factor for scaling |
cpu_request |
float64 | CPU request (cores) |
cpu_limit |
float64 | CPU limit (cores) |
memory_request |
int64 | Memory request (bytes) |
memory_limit |
int64 | Memory limit (bytes) |
amd_gpu_limit |
int | AMD GPU limit |
nvidia_gpu_limit |
int | NVIDIA GPU limit |
bundle_id |
string | Current bundle GUID |
content_category |
string | Content category |
parameterized |
boolean | Whether content accepts parameters |
cluster_name |
string | Kubernetes cluster name |
image_name |
string | Container image name |
default_image_name |
string | Default container image |
default_r_environment_management |
boolean | Default R environment management setting |
default_py_environment_management |
boolean | Default Python environment management setting |
service_account_name |
string | Kubernetes service account |
r_version |
string | R version |
r_environment_management |
boolean | R environment management enabled |
py_version |
string | Python version |
py_environment_management |
boolean | Python environment management enabled |
quarto_version |
string | Quarto version |
run_as |
string | Unix user to run as |
run_as_current_user |
boolean | Whether to run as current user |
owner_guid |
string | Owner’s user GUID |
content_url |
string | Content access URL |
dashboard_url |
string | Dashboard URL |
app_role |
string | Application role |
vanity_url |
string | Custom vanity URL |
tags |
list[string] | Content tags |
extension |
boolean | Whether content is an extension |
date |
date | Partition date (automatically added by Arrow) |
Connect content totals
Path: curated/v2/connect/content_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Daily counts of content grouped by type and environment.
Purpose
Provides content distribution metrics for:
- Understanding content type usage
- Tracking content growth by type
- Environment-specific content analysis
Filtering rules
- Excludes locked content
- Deduplicates by (environment, GUID), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
count |
int64 | Number of content items |
type |
string | Content type |
environment |
string | Connect environment identifier |
date |
date | Partition date (automatically added by Arrow) |
Connect content hits totals
Path: curated/v2/connect/content_hits_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Purpose
Provides pre-computed content hit metrics for:
- User engagement with content
- Access pattern tracking
- Content popularity analysis
Filtering rules
- Counts hits by (environment, content_guid) for the specific date
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
content_guid |
string | Content GUID being accessed |
hits |
int64 | Total number of hits for the content on this date |
date |
date | Partition date (automatically added by Arrow) |
unique_users |
int64 | Total number of unique users who accessed the content on this date |
Connect content hits totals by user
Path: curated/v2/connect/content_hits_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet
Purpose
Provides pre-computed content hit metrics by user for:
- User engagement with content
- Access pattern tracking
- Content popularity analysis by user
Filtering rules
- Counts hits by (environment, content_guid, user_guid) for the specific date
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Connect environment identifier |
content_guid |
string | Content GUID being accessed |
user_guid |
string | User GUID who accessed the content |
hits |
int64 | Total number of hits for the content by this user on this date |
date |
date | Partition date (automatically added by Arrow) |
Workbench user list
Path: curated/v2/workbench/user_list/date={YYYY-MM-DD}/chronicle-data.parquet
List of all named users across Workbench environments. Deduplicated by username and environment.
Purpose
Provides a complete list of users for:
- User directory exports
- Activity analysis
- License compliance
- Understanding user roles
Filtering rules
- Excludes non-active users (status != “Active”)
- Excludes users without usernames
- Excludes users without activity data
- Excludes users inactive for more than 1 year
- Deduplicates by (username, environment), keeping the most recent record
Schema
| Column | Type | Description |
|---|---|---|
environment |
string | Workbench environment identifier |
id |
string | User GUID (unique identifier) |
username |
string | User’s username |
email |
string | User’s email address |
user_role |
string | User role: admin, superadmin, or user |
created_at |
timestamp | When the user account was created |
last_active_at |
timestamp | User’s last activity timestamp |
active_today |
boolean | Whether the user was active on this date |
date |
date | Partition date (automatically added by Arrow) |
Workbench user totals
Path: curated/v2/workbench/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet
Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats. Contains a single row per day with global counts.
Purpose
Provides pre-computed user counts for:
- License compliance monitoring
- Historical growth tracking
- Daily active user (DAU) trends
- User role distribution
Key definitions
Named Users (Licensing): Active users who have been active within the past year. This aligns with the Workbench licensing model.
Active Users (Operational): Users counted within specific time windows (30 days, 1 day), providing visibility into product usage.
Role Counts: Include only named users (active within the past year).
Deduplication
Users are deduplicated by username across all environments. When multiple records exist, the most recent valid record is used.
Schema
| Column | Type | Description |
|---|---|---|
named_users |
int64 | Count of users active within the past year (licensing metric) |
active_users_30days |
int64 | Count of users active within the past 30 days |
active_users_1day |
int64 | Count of users active on this specific date |
administrators |
int64 | Count of named users with administrator role |
super_administrators |
int64 | Count of named users with super administrator role |
users |
int64 | Count of named users with standard user role |
licensed_user_seats |
int64 | Maximum licensed seats across all environments |
date |
date | Partition date (automatically added by Arrow) |