Connect OpenTelemetry operational guide
Core philosophy: “Admins want to know that something is wrong before their users do.”
This guide helps operations teams use OpenTelemetry metrics, traces, and logs to monitor Posit Connect health and diagnose performance issues. The guidance is organized around some exemplar operational questions rather than signal types.
Overview
OpenTelemetry provides comprehensive observability for Posit Connect through three complementary signal types:
- Metrics - Quantitative measurements of system behavior (connection pool utilization, request latency, active users)
- Traces - Distributed execution paths showing high-level lifecycle events (scheduled executions, queue operations, content deployments)
- Logs - Structured event records with contextual information
This guide focuses on the most critical operational questions that Connect administrators face daily. Each question includes specific metrics to check, query patterns to use, and interpretation guidance to help you quickly identify and resolve issues.
For a complete catalog of every metric, trace span, and attribute that Connect emits, see the signal reference. For alert threshold guidance, see the alerting recommendations.
If you are setting up OpenTelemetry for the first time, see the getting started guide.
Prerequisites
- Posit Connect 2026.02.0 or later
- OpenTelemetry instrumentation enabled in configuration
- Optionally, an observability platform that supports OpenTelemetry (Datadog, Grafana, etc.) if you want to export signals beyond the diagnostic bundle (highly recommended)
Configuration
Enable OpenTelemetry instrumentation in your rstudio-connect.gcfg:
[OpenTelemetry]
Enabled = trueThis enables collection of all signals (traces, logs, and metrics). By default, Connect persists signals to disk for local debugging and diagnostic bundles. To export to your observability platform’s local collector, configure one or more [OTLPEndpoint] sections:
[OTLPEndpoint "mybackend"]
Endpoint = http://collector:4318See the getting started guide for common configuration patterns and the configuration reference for full details on available settings.
Trace visibility
Connect exports high-level lifecycle traces to your Application Performance Monitoring (APM) system:
- HTTP requests and API operations — All HTTP endpoints, service request handling, and render request timing
- Scheduled execution lifecycle — Schedule triggers and execution workflow
- Queue operations — Item enqueue and processing lifecycle (with wait time and processing duration)
- Worker provisioning and lifecycle — Worker creation, process startup, full lifecycle tracking, and connection acceptance timing
- Report execution and process startup — Report execution phases, setup, and process startup timing (separating startup from execution)
- Launcher job lifecycle — Off-host execution setup and submission for Kubernetes deployments
- Content deployments — Deployment operations and content launches
- Email operations — Email send operations
Contact Posit Support if you need assistance with trace analysis.
Health and performance
Is Connect healthy right now? — Service availability, database health, schedule success, application stability. Health monitoring →
Why is Connect slow right now? — Database performance, API latency, infrastructure bottlenecks, trace-based root cause analysis. Performance troubleshooting →
Job queue operations
Are scheduled jobs running? — Job completion health, worker availability, timeout monitoring. Job execution health →
Why is this scheduled job taking longer than usual? — Duration analysis, queue wait times, trace investigation. Job duration analysis →
Is the job queue backing up? — Queue size and age monitoring, drain time estimation. Queue backup detection →
Content operations
Which content is using the most resources? — Job pressure, process counts, host metric correlation. Resource investigation →
How do I route content failures to owners? — Owner context in failure events, alert routing patterns. Owner alerting →
Why did this content fail? — Failure detection, log investigation, trace correlation. Failure investigation →
License and capacity
Are we approaching license limits? — Named user utilization, Shiny user limits, license expiration. License capacity monitoring →
Are we rejecting users due to capacity? — Rejection rates, breakdown by reason and type, investigation workflow. Request rejection monitoring →
Query pattern translation
Throughout the detailed guides, you’ll find Query Pattern sections using platform-agnostic pseudo-code alongside Datadog and Grafana equivalents. The pseudo-query patterns describe the logical operations needed, which you can translate to any platform’s syntax.
For example, a pattern like:
P95 of (db.sql.latency) over 5 minutes
Translates to different syntax depending on your platform:
- Datadog:
sum:db.sql.latency.bucket{*} by {upper_bound}.as_count()
- Grafana — Prometheus Query Language (PromQL):
histogram_quantile(0.95, rate(db_sql_latency_milliseconds_bucket[5m]))
The patterns help you understand what to calculate, so you can translate them to how in your specific observability platform.
Starter dashboards
Pre-built dashboards covering health, performance, job queues, content operations, license capacity, and request rejections:
- Grafana dashboard — Import via Dashboards > New > Import in Grafana
- Datadog dashboard — Import via Dashboards > New Dashboard > Import in Datadog
These dashboards use the queries documented throughout this guide. Customize thresholds and layouts to match your environment.