library(arrow)
# Collecting user data from a file
<- '/var/lib/posit-chronicle/data/hourly'
base_path <- arrow::open_dataset(paste0(base_path, "/v1/users/2024/05/")) users
Using Chronicle data
Chronicle stores the data it produces in parquet files. The Reports included with Chronicle are the easiest way to access data for most users. If you want to enhance the reports, write your own reports, or use this data for other purposes, this section describes how to access the data that Chronicle stores. You can also reference the code that is in the report QMD files.
Data Directory Structure
The Chronicle data directory is organized into a few subdirectories:
/var/lib/posit-chronicle/data
/private
/hourly
/v1
/<metric-name>
/daily
/v1
/<metric-name>
The private
directory contains transient data. This data is short-lived and should not be accessed by users.
Every hour, the private data is processed and stored in the hourly
directory. This data is minimally processed and relatively high volume. It includes “duplicate” values - where a metric does not change over a period of time. This data can be used for custom reporting, but the report must query the data efficiently due to the volume of data.
Every day, the hourly data is further processed and aggregated into the daily
directory. This processing eliminates duplicate values and significantly reduces the data volume. The specific nature of this aggregation varies by metric. The aggregation strategies are described below. The daily data is used by Chronicle reports, and it can also be used for custom reporting.
The structure within hourly
and daily
is identical. Each contains one or more top-level vN
subdirectories to delineate different versions of Chronicle’s internal data schema for each metric. Individual metrics are stored under the appropriate version directory. Within each metric directory, data is organized by the date/time of when it was gathered.
The following is a complete example. Note that daily
data is stored for a day, while hourly
data is stored for each hour.
├── daily
│ └── v1
│ ├── connect_content
│ │ └── 2024
│ │ └── 12
│ │ ├── 01
│ │ │ └── connect_content.parquet
│ │ ├── 02
│ │ │ └── connect_content.parquet
│ │ └── ...
│ └── connect_license_active_users
│ └── 2024
│ └── 12
│ ├── 01
│ │ └── connect_content.parquet
│ ├── 02
│ │ └── connect_content.parquet
│ └── ...
└── hourly
└── v1
├── connect_content
│ └── 2024
│ └── 12
│ ├── 01
│ │ ├── 00
│ │ │ └── connect_content.parquet
│ │ ├── 01
│ │ │ └── connect_content.parquet
│ │ ├── ...
│ │ └── 23
│ │ └── connect_content.parquet
│ └── 02
│ ├── 00
│ │ └── connect_content.parquet
│ ├── 01
│ │ └── connect_content.parquet
│ ├── ...
│ └── 23
│ └── connect_content.parquet
└── connect_license_active_users
└── 2024
└── 12
├── 01
└── ...
Reading parquet data
While parquet files are similar in concept to csv
files, they are optimized for better read/write performance and therefore unreadable by most text editors without the help of plugins.
The RStudio IDE is a great place to read parquet, and run both R and Python for the scripts below. Follow this link if you would like to install R and RStudio (open source Desktop edition).
If you are using VSCode, our team recommends the Parquet Explorer plugin to read and query parquet files directly in your editor.
Another common trick is to convert .parquet
files into .csv
files for easier viewing, leveraging python and the pandas library:
Terminal
>> import pandas as pd
>> df = pd.read_parquet('filename.parquet')
>> df.to_csv('filename.csv')
Using R
These scripts have been tested with R version 4.2.3. You may need to upgrade your R version if you are running into errors installing packages. In particular, arrow
may run into issues if the R version is too old.
The examples in the Apache Arrow documentation on reading parquet files show how to read data stored locally or in S3 into an arrow table class.
Opening local Chronicle data with R
You can read the May, 2024 partition of the Chronicle’s parquet data into an arrow table with the following:
Opening Chronicle data in S3 with R
You can read S3 bucket parquet contents from the month of May, 2024 into an arrow table with the following:
# Imports
library(arrow)
library(paws)
library(urltools)
# Set s3 bucket ----
<- "s3://{{YOUR_BUCKET_NAME}}"
s3_bucket <- s3(config = list(region = "us-east-2"))
svc <- svc$list_objects(Bucket = urltools::domain(s3_bucket))
bucket_str
# Collecting user data
<- paste0(s3_bucket, "/hourly/v1/users/2024/05")
users_bucket <- open_dataset(users_bucket,
users hive_style = FALSE,
format = "parquet")
Querying Chronicle data with R
Once you have run one of the above to bring your users parquet data into an arrow table, you can begin querying it:
library(arrow)
library(tidyverse)
# Viewing user data
<- head(users, 5) |>
users_head collect()
print(users_head)
Metrics Generated by Chronicle
Metrics are gathered and processed on a scheduled basis. This means that you may not see metrics files immediately when first starting Chronicle. It also means that there is a delay before the latest data shows up in the refined metrics files.
By default, the agent retrieves metrics data once every 60 seconds. The metrics data is processed into refined metrics once an hour. This process happens shortly after the top of the hour. The exact timing is not entirely predictable due to processing delays, but the refinement process typically completes by 15 minutes after the top of the hour.
Approaches to aggregation
Based on the type of metric represented in the data, metrics are aggregated according to one of the following strategies. Each individual metric type listed below includes an indication of which aggregation approach is employed to aggregate its data, or N/A if that metric is not currently aggregated.
The examples below reflect an aggregation of this data series:
Timestamp | Value |
---|---|
01:00 | 12 |
01:01 | 12 |
01:02 | 12 |
01:03 | 13 |
01:04 | 15 |
01:05 | 15 |
01:06 | 15 |
01:07 | 15 |
01:08 | 16 |
- Deduplication Aggregation: with this approach, a value is retained if it represents either the first or last observation with that value. With the example dataset above, this approach would aggregate the series to:
Timestamp | Value |
---|---|
01:00 | 12 |
01:02 | 12 |
01:03 | 13 |
01:04 | 15 |
01:07 | 15 |
01:08 | 16 |
- Delta Aggregation: with this approach, only the difference between consecutive values is considered, and this value is only retained if the difference is not
0
. With the example dataset above, this approach would aggregate the series to:
Timestamp | Value |
---|---|
01:00 | 0 |
01:03 | 1 |
01:04 | 2 |
01:08 | 1 |
Available metrics
The following is a non-exhaustive list of all product metrics Chronicle produces from Posit Connect, Posit Package Manager, and Posit Workbench. These metrics are stored in separate subfolders of the configured storage location, which is /var/lib/posit-chronicle/data
by default.
All metrics files include the following columns:
Name | Description |
---|---|
timestamp |
The time in UTC when the observation was recorded by the Chronicle agent. |
type |
The metric type (gauge , sum , etc.). |
cluster |
Reservered for future use. This column is always empty. |
environment |
A user-defined environment label set via the agent configuration. See the Advanced Agent Configuration appendix for setup instructions. |
service |
The source of the metric. One of connect , package-manager , or workbench . |
host |
The host name where the Chronicle agent that reported the metric observation is running. |
os |
Detailed operating system information for the host on which the Chronicle agent that reported this observation is running. |
Refined metrics
In addition to the columns described above, each refined metric includes a column called value
which contains the value observed regardless of the underlying numeric type.
Each of these refined metrics is stored in a separate subfolder named after the refined metric. For example, the data related to the connect_content_hits_total
refined metric is stored in the v1/connect_content_hits_total
subfolder of the configured Chronicle storage location.
connect_build_info
Build information for Connect. NOTE: The value for this metric is always 1
- Subfolder:
v1/connect_build_info
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
version
: the current version of Connectbuild
: the version with the build commit hash appended
connect_content
The current number of content items published in Connect.
- Subfolder:
v1/connect_content
- Metric type: gauge
- Requirements: Connect 2024.02.0 or later with metrics enabled.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
content_type
: the content type of the item visited
connect_content_app_sessions_current
The current number of active user sessions on a given piece of Shiny content.
- Subfolder:
v1/connect_content_app_sessions_current
- Metric type: gauge
- Requirements: Connect 2024.02.0 or later with metrics enabled; some columns (annotated with * below) also require a valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
content_id
: the internal ID of the content item visiteduser_id
: the internal ID of the user who visited the content itemcontent_name
*: the internal name of the content item visitedcontent_title
*: the user-visible title of the content item visitedcontent_type
*: the content type of the item visiteduser_name
*: the username of the user who visited the content item
connect_content_hits_total
The running total of user visits to a specific piece of content.
- Subfolder:
v1/connect_content_hits_total
- Metric type: sum
- Requirements: Connect 2024.02.0 or later with metrics enabled; some columns (annotated with * below) also require a valid administrator Connect API key.
- Aggregation strategy: Delta Aggregation
- Additional columns:
content_id
: the internal ID of the content item visiteduser_id
: the internal ID of the user who visited the content itemcontent_name
*: the internal name of the content item visitedcontent_title
*: the user-visible title of the content item visitedcontent_type
*: the content type of the item visiteduser_name
*: the username of the user who visited the content item
connect_installed_versions_python
A count of the versions of Python which are currently installed.
- Subfolder:
v1/connect_installed_versions_python
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
versions
: a list of the versions which are installed.
connect_installed_versions_r
A count of the versions of R which are currently installed.
- Subfolder:
v1/connect_installed_versions_r
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
versions
: a list of the versions which are installed.
connect_license_active_users
The current number of users consuming license seats in Connect.
- Subfolder:
v1/connect_license_active_users
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns: None
connect_license_user_seats
The total number of licensed seats allowed in Connect.
- Subfolder:
v1/connect_license_user_seats
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns: None
connect_users
A metric used to capture a list of users in Connect. The value of this metric is always 1
.
- Subfolder:
v1/connect_users
- Metric type: gauge
- Requirements: A valid administrator Connect API key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
id
: The ID of the user.username
: The username of the user (the name they use when logging in).email
: The email address of the user.first_name
: The first name of the user.last_name
: The last name of the user.role
: The role of the user (e.g., publisher, viewer).created_at
: The timestamp when the user was created.updated_at
: The timestamp when the user was most recently updated.last_active_at
: The timestamp when the user was most recently active (logged in) in Posit Connect.
pwb_license_active_users
The current number of users consuming license seats in Workbench.
- Subfolder:
v1/pwb_license_active_users
- Metric type: gauge
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: Deduplication Aggregation
- Additional columns: None
pwb_license_user_seats
The total number of licensed seats allowed in Workbench.
- Subfolder:
v1/pwb_license_user_seats
- Metric type: gauge
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: Deduplication Aggregation
- Additional columns: None
pwb_build_info
Build information for RStudio Server/Workbench. NOTE: The value for this metric is always 1
- Subfolder:
v1/pwb_build_info
- Metric type: gauge
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
version
: the current version of workbenchrelease_name
: the release name of the workbench version
pwb_session_startup_duration_seconds_bucket
A running total of counts of session startup durations. These counts are divided into buckets based on the startup duration. Each bucket has a duration threshold called a “limit”, and the value for a given limit indicates how many sessions started up in a duration less than or equal to that limit, and greater than the next smallest limit.
For example, if Workbench reported these 5 session startup durations:
- 8 seconds
- 3 seconds
- 42 seconds
- 4 seconds
- 325 seconds
The stored histogram bucket values would look like this:
value | limit |
---|---|
0 | 0.0 |
0 | 1.0 |
2 | 5.0 |
1 | 10.0 |
0 | 30.0 |
1 | 60.0 |
0 | 300.0 |
1 | Infinity |
The row with limit 5.0
reports a count of 2
as its value (representing the 3 and 4 second durations), the row with limit 10.0
reports a count of 1
(the 8 second duration), and so on.
- Subfolder:
v1/pwb_session_startup_duration_seconds_bucket
- Metric type: histogram
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: N/A
- Additional columns:
limit
: The time in seconds which is the upper-bound of the associated bucket and the lower-bound of the associated bucket with the next limit value.session_type
: The type of session (e.g.,vscode
,rstudio-pro
, etc) launched by the user.
pwb_session_startup_duration_seconds_count
A running total of the number of sessions launched in Workbench.
- Subfolder:
v1/pwb_session_startup_duration_seconds_count
- Metric type: sum
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: N/A
- Additional columns:
session_type
: The type of session (e.g.,vscode
,rstudio-pro
, etc) launched by the user.
pwb_session_startup_duration_seconds_sum
A running total of all session startup time in Workbench.
- Subfolder:
v1/pwb_session_startup_duration_seconds_sum
- Metric type: sum
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: N/A
- Additional columns:
session_type
: The type of session (e.g.,vscode
,rstudio-pro
, etc) launched by the user.
pwb_sessions_launched_total
A running total of all sessions launched in Workbench.
- Subfolder:
v1/pwb_sessions_launched_total
- Metric type: sum
- Requirements: Workbench 2024.04.0 or later with metrics enabled.
- Aggregation strategy: Delta Aggregation
- Additional columns:
session_type
: The type of session (e.g.,vscode
,rstudio-pro
, etc) launched by the users.
pwb_jobs_launched_total
A running total of all jobs launched in Workbench.
- Subfolder:
v1/pwb_jobs_launched_total
- Metric type: sum
- Requirements: Workbench 2024.09.0 or later with metrics enabled.
- Aggregation strategy: Delta Aggregation
- Additional columns:
job_type
: The type of job (e.g.,r
) launched by the users.
pwb_users
A list of all users in Workbench. NOTE: The value for this metric is always 1
- Subfolder:
v1/pwb_users
- Metric type: gauge
- Requirements: Workbench 2024.12.0 or later with a valid admin api key.
- Aggregation strategy: Deduplication Aggregation
- Additional columns:
id
: The UID of the Workbench user.guid
: The GUID of the Workbench user.username
: The Username of the Workbench user.email
: The Email Address of the Workbench user.status
: The Status of the Workbench user. (Active
,Inactive
)is_admin
: True if the Workbench user is an Administrator.is_super_admin
: True if the Workbench user is an Administrator Superuser.role
: The role of the Workbench user. (User
,Administrator
,Superuser
)last_active_at
: Timestamp of the Workbench user’s last sign in.created_at
: Timestamp when the Workbench user was created.