Connect OpenTelemetry operational guide

Core philosophy: “Admins want to know that something is wrong before their users do.”

This guide helps operations teams use OpenTelemetry metrics, traces, and logs to monitor Posit Connect health and diagnose performance issues. The guidance is organized around some exemplar operational questions rather than signal types.

Overview

OpenTelemetry provides comprehensive observability for Posit Connect through three complementary signal types:

Metrics - Quantitative measurements of system behavior (connection pool utilization, request latency, active users)
Traces - Distributed execution paths showing high-level lifecycle events (scheduled executions, queue operations, content deployments)
Logs - Structured event records with contextual information

This guide focuses on the most critical operational questions that Connect administrators face daily. Each question includes specific metrics to check, query patterns to use, and interpretation guidance to help you quickly identify and resolve issues.

For a complete catalog of every metric, trace span, and attribute that Connect emits, see the signal reference. For alert threshold guidance, see the alerting recommendations.

If you are setting up OpenTelemetry for the first time, see the getting started guide.

Prerequisites

Posit Connect 2026.02.0 or later
OpenTelemetry instrumentation enabled in configuration
Optionally, an observability platform that supports OpenTelemetry (Datadog, Grafana, etc.) if you want to export signals beyond the diagnostic bundle (highly recommended)

Configuration

Enable OpenTelemetry instrumentation in your rstudio-connect.gcfg:

[OpenTelemetry]
Enabled = true

This enables collection of all signals (traces, logs, and metrics). By default, Connect persists signals to disk for local debugging and diagnostic bundles. To export to your observability platform’s local collector, configure one or more [OTLPEndpoint] sections:

[OTLPEndpoint "mybackend"]
Endpoint = http://collector:4318

See the getting started guide for common configuration patterns and the configuration reference for full details on available settings.

Trace visibility

Connect exports high-level lifecycle traces to your Application Performance Monitoring (APM) system:

HTTP requests and API operations — All HTTP endpoints, service request handling, and render request timing
Scheduled execution lifecycle — Schedule triggers and execution workflow
Queue operations — Item enqueue and processing lifecycle (with wait time and processing duration)
Worker provisioning and lifecycle — Worker creation, process startup, full lifecycle tracking, and connection acceptance timing
Report execution and process startup — Report execution phases, setup, and process startup timing (separating startup from execution)
Launcher job lifecycle — Off-host execution setup and submission for Kubernetes deployments
Content deployments — Deployment operations and content launches
Email operations — Email send operations

Contact Posit Support if you need assistance with trace analysis.

Health and performance

Is Connect healthy right now? — Service availability, database health, schedule success, application stability. Health monitoring →
Why is Connect slow right now? — Database performance, API latency, infrastructure bottlenecks, trace-based root cause analysis. Performance troubleshooting →

Job queue operations

Are scheduled jobs running? — Job completion health, worker availability, timeout monitoring. Job execution health →
Why is this scheduled job taking longer than usual? — Duration analysis, queue wait times, trace investigation. Job duration analysis →
Is the job queue backing up? — Queue size and age monitoring, drain time estimation. Queue backup detection →

Content operations

Which content is using the most resources? — Job pressure, process counts, host metric correlation. Resource investigation →
How do I route content failures to owners? — Owner context in failure events, alert routing patterns. Owner alerting →
Why did this content fail? — Failure detection, log investigation, trace correlation. Failure investigation →

License and capacity

Are we approaching license limits? — Named user utilization, Shiny user limits, license expiration. License capacity monitoring →
Are we rejecting users due to capacity? — Rejection rates, breakdown by reason and type, investigation workflow. Request rejection monitoring →

Query pattern translation

Throughout the detailed guides, you’ll find Query Pattern sections using platform-agnostic pseudo-code alongside Datadog and Grafana equivalents. The pseudo-query patterns describe the logical operations needed, which you can translate to any platform’s syntax.

For example, a pattern like:

P95 of (db.sql.latency) over 5 minutes

Translates to different syntax depending on your platform:

Datadog:

sum:db.sql.latency.bucket{*} by {upper_bound}.as_count()

Grafana — Prometheus Query Language (PromQL):

histogram_quantile(0.95, rate(db_sql_latency_milliseconds_bucket[5m]))

The patterns help you understand what to calculate, so you can translate them to how in your specific observability platform.

Starter dashboards

Pre-built dashboards covering health, performance, job queues, content operations, license capacity, and request rejections:

Grafana dashboard — Import via Dashboards > New > Import in Grafana
Datadog dashboard — Import via Dashboards > New Dashboard > Import in Datadog

These dashboards use the queries documented throughout this guide. Customize thresholds and layouts to match your environment.