Health and performance guide
This guide provides detailed metrics, query patterns, and troubleshooting workflows for monitoring Connect health and diagnosing performance issues using OpenTelemetry signals.
Overview
This guide focuses on two critical operational questions for Posit Connect administrators:
- Is Connect healthy right now? - Service availability and database connection health
- Why is Connect slow right now? - Database performance (connection pool, query latency, wait times) and host/infrastructure metrics (CPU, memory, storage I/O)
For job queue and content execution monitoring, see the job queue operations guide.
Is Connect healthy right now?
This question addresses immediate health status: Is the service up? Is the database accessible? Is the filesystem responsive?
Health signals to check
Service health
Question: Is the Connect server process running?
Primary Metric: db.sql.connection.open (Gauge)
Dimensions:
status: Connection state (idleorinuse)pool_name: Connection pool identifier
Query Pattern:
sum:db.sql.connection.open{pool.name:core}
sum by(pool_name) (db_sql_connection_open{pool_name="core"})
db.sql.connection.open WHERE pool_name == "core"
Interpretation:
> 0= Service is up and runningNo data= Service is down
Usage: Display as a single stat or use your platform’s built-in service health detection based on telemetry presence. This metric is continuously reported regardless of traffic.
Most observability platforms automatically detect service health based on whether telemetry is being received. Any regularly exported metric (like db.sql.connection.open, application.count, or users.active) indicates the service is running. Platforms like Datadog, New Relic, and Grafana Cloud provide built-in service health views based on metric presence.
Database connection pool utilization
Question: Is the database connection pool exhausted?
Primary Metric: db.sql.connection.open (Gauge)
Dimensions:
status: Connection state (idleorinuse)pool_name: Connection pool identifier (e.g., “core” or “instrumentation”)
Query Pattern:
(sum:db.sql.connection.open{status:inuse} by {pool.name}) / (sum:db.sql.connection.max_open{*} by {pool.name}) * 100
(db_sql_connection_open{status="inuse"} / db_sql_connection_max_open) * 100
(db.sql.connection.open WHERE status == "inuse" AND pool_name == "core")
/ (db.sql.connection.max_open WHERE pool_name == "core") * 100
Interpretation:
< 70%= Healthy pool utilization70-90%= Pool under pressure but functional> 90%= Pool exhaustion risk, slowness likely
Usage: Display as a gauge or percentage. This is a critical health indicator for database performance.
Scheduled report health
Question: Are scheduled reports running successfully?
Primary Metric: schedule.count (Gauge)
Dimensions:
schedule.status: Status (queued,running,success,failure)
Query Pattern:
sum:schedule.count{schedule.status:failure}
schedule_count{schedule_status="failure"}
schedule.count WHERE schedule.status == "failure"
Interpretation:
- Rising failure count indicates issues with scheduled content (locked content, auth failures, or resource problems)
Usage: Track failure counts over time and alert on sustained increases.
Running applications
Question: Are applications staying running?
Primary Metric: application.count (Gauge)
Dimensions:
application.type: Application type
Query Pattern:
sum:application.count{*} by {application.type}
application_count
application.count
Interpretation:
- Sudden drops may indicate crash loops or job launcher issues
Usage: Monitor for unexpected decreases in running application count.
Why is Connect slow right now?
When users report slowness or you observe performance degradation, use this decision tree to identify the bottleneck systematically.
Decision tree
Check these areas in order to identify the bottleneck:
- Database metrics (most common) - Check connection pool utilization, wait times, and query latency
- Host and infrastructure metrics - Check CPU, memory, and storage I/O (especially NFS/EFS)
Database performance metrics
Database connection pool utilization
See Is Connect healthy right now? above for full details. Quick check:
(sum:db.sql.connection.open{status:inuse} by {pool.name}) / (sum:db.sql.connection.max_open{*} by {pool.name}) * 100
(db_sql_connection_open{status="inuse"} / db_sql_connection_max_open) * 100
(db.sql.connection.open WHERE status == "inuse")
/ db.sql.connection.max_open * 100
Interpretation:
> 90%= Pool exhaustion is likely causing slowness
Connection wait time
Question: Are threads waiting for database connections to become available?
Primary Metrics:
db.sql.connection.wait_duration(Counter, cumulative milliseconds)db.sql.connection.wait(Counter, cumulative count)
Dimensions:
pool_name: Connection pool identifier
Query Pattern:
sum:db.sql.connection.wait_duration{*} by {pool.name}.as_count()
rate(db_sql_connection_wait_duration_milliseconds_total{pool_name="core"}[5m])
/ rate(db_sql_connection_wait_total{pool_name="core"}[5m])
RATE(db.sql.connection.wait_duration WHERE pool_name == "core") over 5 min
/ RATE(db.sql.connection.wait WHERE pool_name == "core") over 5 min
Interpretation:
< 10ms= Healthy, minimal contention10-100ms= Moderate contention> 100ms= Significant wait time causing slowness
Usage: Display as a time-series graph. Spikes indicate connection pool pressure. These are cumulative counters that only increment when connections actually wait, so no data or zero values indicate healthy pool utilization with no contention.
Query latency
Question: How long are database queries taking?
Primary Metric: db.sql.latency (Histogram, milliseconds)
Dimensions:
method: Database operation method name (e.g., “sql.DB.Query”, “sql.Stmt.Exec”)status: Operation status (success/error)
Query Pattern:
sum:db.sql.latency.bucket{*} by {upper_bound}.as_count()
histogram_quantile(0.50, rate(db_sql_latency_milliseconds_bucket[5m]))
AVG(db.sql.latency) over 5 minutes
Interpretation:
< 100ms= Fast queries100-1000ms= Acceptable for complex queries> 1000ms= Slow queries detected, investigate
Usage: Display as a time-series graph showing average latency over time. Sudden spikes indicate performance issues. You can group by method to identify which types of operations are slowest. For percentile analysis (P95, P99), use histogram queries in your platform.
API endpoint performance
Question: Are specific API endpoints slow?
Primary Metrics:
http.server.request.duration(Histogram, seconds)http.server.active_requests(Gauge)
Dimensions:
http.method: HTTP method (GET, POST, etc.)http.route: API endpoint routehttp.status_code: Response status code
Query Pattern:
sum:http.server.request.duration.bucket{*} by {upper_bound}.as_count()
histogram_quantile(0.95, sum by (le, http_route) (rate(http_server_request_duration_seconds_bucket[5m])))
P95 of (http.server.request.duration) GROUP BY http.route over 5 minutes
Interpretation:
< 0.5s= Fast endpoints0.5-2.0s= Acceptable for complex operations> 2.0s= Slow endpoints requiring investigation
Usage: Group by http.route to identify which specific API endpoints are slow. Use http.server.active_requests to detect request pileup.
Off-host execution health (Kubernetes)
Question: Is the launcher responding?
Primary Metric: launcher.client.retry.exhausted (Counter)
Dimensions:
http.method: HTTP method used
Query Pattern:
per_second(sum:connect.launcher_client_retry_exhausted.count{*})
rate(launcher_client_retry_exhausted_total[5m])
RATE(launcher.client.retry.exhausted) over 5 minutes
Interpretation:
- Any non-zero rate indicates launcher communication failures after all retries
- Common causes: launcher downtime, network issues, or resource exhaustion
Usage: Alert on any exhausted retries. This indicates jobs are failing to submit to Kubernetes.
Process management
Question: Are processes churning or getting stuck?
Primary Metric: process.count (UpDownCounter)
Dimensions:
process.tag: Process type (e.g.,run_shiny_app,run_dash_app)
Query Pattern:
sum:otel.process.count{*} by {process.tag}
sum by (process_tag) (process_count)
process.count GROUP BY process.tag
Interpretation:
- Sudden drops indicate crashed processes or job launcher issues
- Steady values indicate stable processes
- Unexpected increases may indicate processes not terminating properly
Usage: Monitor current process counts by type. Track the rate of change to detect instability or process churn.
Queue performance
Question: How long are jobs waiting in queues and taking to process?
Primary Attributes (from traces):
queue.wait_time.ms- Time item waited in queue before processing (milliseconds)queue.processing_duration.ms- Time spent processing the item (milliseconds)
Available in: Trace span attributes on queue.item.process spans
Query Pattern:
resource_name:queue.item.process @queue.item.type:ScheduledRender
compute avg(@queue.wait_time.ms) by @queue.item.type
# Traces not available as metrics - use trace analytics in your platform
AVG(queue.wait_time.ms) WHERE resource_name == "queue.item.process"
GROUP BY queue.item.type
Interpretation:
- High wait times indicate queue backlog or insufficient workers
- High processing duration indicates slow job execution
- Both contribute to overall slowness in scheduled reports, deployments, and background tasks
Usage: Use trace analytics to query queue.item.process spans. Filter by queue.item.type (e.g., ScheduledRender, GitFetch) to identify which queue types have the longest wait times.
Host and infrastructure metrics
When database metrics look healthy but Connect is still slow, investigate host-level resource utilization and external infrastructure dependencies. The host system where Connect runs collects these metrics, not Connect’s OpenTelemetry instrumentation.
Key areas to monitor
CPU and memory utilization
Monitor the Connect server’s CPU and memory usage through your host monitoring agent. High CPU utilization (>80% sustained) or memory pressure can cause overall application slowness even when database performance is good.
Filesystem I/O and network storage
Connect performs extensive filesystem I/O operations for content management, including:
- Reading and writing content bundles during deployment
- Accessing R package caches and Python virtual environments
- Storing and retrieving content execution artifacts
- Managing logs and temporary files
Network-attached storage (Network File System (NFS), AWS Elastic File System (EFS), Azure Files) is a common bottleneck during periods of high activity. Key indicators include:
- High I/O wait times: Processes spending significant time waiting for disk operations
- Elevated disk latency: Read/write operations taking longer than expected
- Throughput saturation: Reaching the input/output operations per second (IOPS) or bandwidth limits of your storage tier
Why NFS systems become bottlenecks
- Each content deployment involves many small file operations (package installations, bundle extraction)
- Multiple concurrent deployments amplify the I/O load
- Network-attached storage adds latency compared to local disks
- Cloud storage services (EFS, Azure Files) have performance limits based on provisioning tier and burst credits
- High-concurrency scenarios (many users, frequent deployments) can exhaust available IOPS
Recommendations
Set up host metric collection if not already configured:
- Deploy a host monitoring agent on your Connect server
- Collect standard system metrics: CPU, memory, disk I/O, network I/O
- Monitor filesystem-specific metrics for mounted volumes
Monitor storage performance separately:
- AWS EFS: Watch CloudWatch metrics for burst credit balance, throughput, and IOPS
- Azure Files: Monitor storage account metrics for throttling and latency
- On-premises NFS: Track NFS server metrics and network latency
Investigate when you observe:
- Deployments that are slow despite healthy database metrics
- High I/O wait times on the Connect server
- Correlation between slowness and concurrent deployment activity
- Storage throttling events or exhausted burst credits
Optimization strategies if storage is the bottleneck:
- Upgrade storage tier for higher IOPS/throughput
- Switch from burst mode to provisioned throughput (EFS)
- Use local disk for frequently accessed caches when possible
- Implement deployment throttling to reduce concurrent I/O
Debugging
When metrics indicate slowness, traces provide execution context to identify bottlenecks.
Trace analysis workflow
- Identify slow operations in your APM tool’s trace explorer (filter by duration, endpoint, or time range).
- Examine spans to see overall timing:
HTTP {method} {route}- API endpoint durationqueue.item.process- Queue wait time (queue.wait_time.ms) and processing duration (queue.processing_duration.ms)schedule.render- Scheduled execution timingworker.provision/worker.process.startup- Worker lifecycle timingreport.execute/report.setup- Report execution phases
- Use span attributes to identify content (
content.guid), job keys (job.key), and types (report.type,runtime.type). - Correlate timing patterns with metrics to identify bottlenecks (database pool exhaustion, worker saturation, etc.).
For assistance with trace analysis, contact Posit Support.
See the OpenTelemetry signal reference for a complete catalog of available spans. Return to the operational guide overview for the full list of operational questions.