Curated Data

Chronicle provides curated datasets that simplify common reporting and analysis tasks. These datasets are pre-processed, deduplicated, and optimized for efficient querying.

Curated data storage location

Curated datasets are stored in the following locations:

By default: /var/lib/posit-chronicle/data/curated/v2/{product}/{dataset}
Custom local storage: {[LocalStorage].Location}/curated/v2/{product}/{dataset}
S3 storage: {[S3Storage].Location/Prefix}/curated/v2/{product}/{dataset}

Where: - {product} is either connect or workbench - {dataset} is the dataset name (e.g., user_list, user_totals)

Reading curated data

The chronicle.reports R package provides simple functions to read curated data.

Curated data is stored in Apache Parquet format with Hive-style date partitioning and can be read using:

R: Use the arrow package with the open_dataset() function
Python: Use the pandas, pyarrow, or polars packages
DuckDB: Query directly with SQL
Any Parquet-compatible tool

The date partition is automatically available as a column when reading with tools that support hive-style partitioned datasets (like the {arrow} function open_dataset()).

Example: Reading curated data in R

library(arrow)

# Read user totals with automatic date column
user_totals <- open_dataset("/var/lib/posit-chronicle/data/curated/v2/connect/user_totals")

# Filter by date range
recent_totals <- user_totals |>
  filter(date >= as.Date("2025-01-01")) |>
  collect()

Example: Reading curated data in Python

import pyarrow.dataset as ds

# Read content list with automatic date column
content_list = ds.dataset("/var/lib/posit-chronicle/data/curated/v2/connect/content_list")

# Convert to pandas DataFrame
df = content_list.to_table(filter=ds.field("date") >= "2025-01-01").to_pandas()

Automatic backfilling

Chronicle automatically backfills curated datasets for historical dates after an upgrade from Chronicle 2025.08 or earlier.

The backfill process runs in the background after server startup and processes dates in reverse chronological order (most recent first). The server tracks backfill progress in {storage}/upgrade/curation-backfill-state.json. This ensures curated datasets are available for all historical data after upgrading Chronicle.

Available curated datasets

Chronicle provides the following curated datasets:

Posit Connect datasets

Connect User List - List of all named users with details
Connect User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats
Connect Content List - List of all content items with configuration
Connect Content Totals - Counts of content items grouped by type and environment
Connect Content Visits Totals by User - Counts of content visits per content item and user
Connect Shiny Usage Totals by User - Counts of Shiny app sessions per app and per user

Posit Workbench datasets

Workbench User List - List of all named users with details
Workbench User Totals - Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats

Connect User List

Path: curated/v2/connect/user_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all named users across Connect environments. Deduplicated by email and environment.

Purpose

Provides a complete list of users for:

User directory exports
Activity analysis
Cross-referencing with content ownership
Understanding user roles and permissions

Filtering rules

Excludes locked users
Excludes unconfirmed users
Excludes users without email addresses
Excludes users without activity data
Excludes users inactive for more than 1 year
Deduplicates by (email, environment), keeping the most recent record

Schema

Column	Type	Description
`environment`	string	Connect environment identifier
`id`	string	User GUID (unique identifier)
`username`	string	User’s username
`email`	string	User’s email address
`first_name`	string	User’s first name
`last_name`	string	User’s last name
`user_role`	string	User role: `administrator`, `publisher`, or `viewer`
`created_at`	timestamp	When the user account was created
`updated_at`	timestamp	When the user account was last updated
`last_active_at`	timestamp	User’s last activity timestamp
`active_today`	boolean	Whether the user was active on this date
`date`	date	Partition date (automatically added by Arrow)

Connect User Totals

Path: curated/v2/connect/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of named users, active users for the past day and past 30 days, users in each role (viewer, publisher, administrator), and number of licensed seats. Contains a single row per day with global counts.

Purpose

Provides pre-computed user counts for:

License compliance monitoring
Historical growth tracking
Daily active user (DAU) trends
User role distribution

Key definitions

Named Users (Licensing): Users who are not locked and have been active within the past year. This aligns with the Connect licensing model.

Active Users (Operational): Users counted within specific time windows (30 days, 1 day), excluding locked users, providing visibility into product usage.

Role Counts: Include only named users (active within the past year).

Deduplication

Users are deduplicated by email address across all environments. When multiple records exist, the most recent valid record is used.

Schema

Column	Type	Description
`named_users`	int64	Count of users active within the past year (licensing metric)
`active_users_30days`	int64	Count of users active within the past 30 days
`active_users_1day`	int64	Count of users active on this specific date
`administrators`	int64	Count of named users with administrator role
`publishers`	int64	Count of named users with publisher role
`viewers`	int64	Count of named users with viewer role
`licensed_user_seats`	int64	Maximum licensed seats across all environments
`date`	date	Partition date (automatically added by Arrow)

Connect Content List

Path: curated/v2/connect/content_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all content items across Connect environments. Deduplicated by GUID and environment.

Purpose

Provides a complete content inventory for:

Content audits and reports
Resource allocation analysis (CPU, memory, processes)
Deployment tracking
Access control reviews

Filtering rules

Excludes locked content
Deduplicates by (environment, GUID), keeping the most recent unlocked record
If the latest record is locked, the content is excluded entirely

Schema

Column	Type	Description
`environment`	string	Connect environment identifier
`id`	string	Content GUID (unique identifier)
`name`	string	Content name (URL-friendly)
`title`	string	Content display title
`created_time`	timestamp	When content was created
`last_deployed_time`	timestamp	When content was last deployed
`type`	string	Content type (e.g., `shiny`, `rmd-static`, `quarto-static`)
`description`	string	Content description
`access_type`	string	Access control type (`logged_in`, `acl`, `all`)
`locked`	boolean	Whether content is locked
`locked_message`	string	Message shown when content is locked
`connection_timeout`	int	Connection timeout in seconds
`read_timeout`	int	Read timeout in seconds
`init_timeout`	int	Initialization timeout in seconds
`idle_timeout`	int	Idle timeout in seconds
`max_processes`	int	Maximum number of processes
`min_processes`	int	Minimum number of processes
`max_conns_per_process`	int	Maximum connections per process
`load_factor`	float64	Load factor for scaling
`cpu_request`	float64	CPU request (cores)
`cpu_limit`	float64	CPU limit (cores)
`memory_request`	int64	Memory request (bytes)
`memory_limit`	int64	Memory limit (bytes)
`amd_gpu_limit`	int	AMD GPU limit
`nvidia_gpu_limit`	int	NVIDIA GPU limit
`bundle_id`	string	Current bundle GUID
`content_category`	string	Content category
`parameterized`	boolean	Whether content accepts parameters
`cluster_name`	string	Kubernetes cluster name
`image_name`	string	Container image name
`default_image_name`	string	Default container image
`default_r_environment_management`	boolean	Default R environment management setting
`default_py_environment_management`	boolean	Default Python environment management setting
`service_account_name`	string	Kubernetes service account
`r_version`	string	R version
`r_environment_management`	boolean	R environment management enabled
`py_version`	string	Python version
`py_environment_management`	boolean	Python environment management enabled
`quarto_version`	string	Quarto version
`run_as`	string	Unix user to run as
`run_as_current_user`	boolean	Whether to run as current user
`owner_guid`	string	Owner’s user GUID
`content_url`	string	Content access URL
`dashboard_url`	string	Dashboard URL
`app_role`	string	Application role
`vanity_url`	string	Custom vanity URL
`tags`	list[string]	Content tags
`extension`	boolean	Whether content is an extension
`date`	date	Partition date (automatically added by Arrow)

Connect Content Totals

Path: curated/v2/connect/content_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Daily counts of content grouped by type and environment.

Purpose

Provides content distribution metrics for:

Understanding content type usage
Tracking content growth by type
Environment-specific content analysis

Filtering rules

Excludes locked content
Deduplicates by (environment, GUID), keeping the most recent record

Schema

Column	Type	Description
`count`	int64	Number of content items
`type`	string	Content type
`environment`	string	Connect environment identifier
`date`	date	Partition date (automatically added by Arrow)

Connect Content Visits Totals by User

Path: curated/v2/connect/content_visits_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of content visits per content item and user. Provides visit metrics for each user-content combination.

Purpose

Provides pre-computed visit counts for:

User activity analysis
Content popularity by user
Access pattern tracking
User engagement metrics

Filtering rules

Deduplicates visits by (environment, content_guid, user_guid, path, timestamp)
Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
Counts unique visit timestamps per user-content-path combination

Schema

Column	Type	Description
`environment`	string	Connect environment identifier
`content_guid`	string	Content GUID being visited
`user_guid`	string	User GUID who visited the content
`visits`	int64	Total number of visits (unique timestamps)
`path`	string	URL path accessed within the content
`date`	date	Partition date (automatically added by Arrow)

Connect Shiny Usage Totals by User

Path: curated/v2/connect/shiny_usage_totals_by_user/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of Shiny app sessions per app and per user. Provides session counts and total duration for each user-content combination.

Purpose

Provides pre-computed Shiny usage metrics for:

User engagement with Shiny applications
Session duration analysis
Content usage patterns
Resource utilization tracking

Filtering rules

Deduplicates sessions by (environment, content_guid, user_guid, timestamp)
Handles duplicate reports from multiple Chronicle agent sidecars in HA deployments
Counts unique sessions and durations per user-content combination

Schema

Column	Type	Description
`environment`	string	Connect environment identifier
`content_guid`	string	Shiny content GUID
`user_guid`	string	User GUID who used the Shiny app
`num_sessions`	int64	Total number of unique Shiny sessions
`duration`	int64	Total session duration in seconds
`date`	date	Partition date (automatically added by Arrow)

Workbench User List

Path: curated/v2/workbench/user_list/date={YYYY-MM-DD}/chronicle-data.parquet

List of all named users across Workbench environments. Deduplicated by username and environment.

Purpose

Provides a complete list of users for:

User directory exports
Activity analysis
License compliance
Understanding user roles

Filtering rules

Excludes non-active users (status != “Active”)
Excludes users without usernames
Excludes users without activity data
Excludes users inactive for more than 1 year
Deduplicates by (username, environment), keeping the most recent record

Schema

Column	Type	Description
`environment`	string	Workbench environment identifier
`id`	string	User GUID (unique identifier)
`username`	string	User’s username
`email`	string	User’s email address
`user_role`	string	User role: `admin`, `superadmin`, or `user`
`created_at`	timestamp	When the user account was created
`last_active_at`	timestamp	User’s last activity timestamp
`active_today`	boolean	Whether the user was active on this date
`date`	date	Partition date (automatically added by Arrow)

Workbench User Totals

Path: curated/v2/workbench/user_totals/date={YYYY-MM-DD}/chronicle-data.parquet

Counts of named users, active users for the past day and past 30 days, users in each role (user, administrator, superadministrator), and number of licensed seats. Contains a single row per day with global counts.

Purpose

Provides pre-computed user counts for:

License compliance monitoring
Historical growth tracking
Daily active user (DAU) trends
User role distribution

Key definitions

Named Users (Licensing): Active users who have been active within the past year. This aligns with the Workbench licensing model.

Active Users (Operational): Users counted within specific time windows (30 days, 1 day), providing visibility into product usage.

Role Counts: Include only named users (active within the past year).

Deduplication

Users are deduplicated by username across all environments. When multiple records exist, the most recent valid record is used.

Schema

Column	Type	Description
`named_users`	int64	Count of users active within the past year (licensing metric)
`active_users_30days`	int64	Count of users active within the past 30 days
`active_users_1day`	int64	Count of users active on this specific date
`administrators`	int64	Count of named users with administrator role
`super_administrators`	int64	Count of named users with super administrator role
`users`	int64	Count of named users with standard user role
`licensed_user_seats`	int64	Maximum licensed seats across all environments
`date`	date	Partition date (automatically added by Arrow)