Health and performance guide

This guide provides detailed metrics, query patterns, and troubleshooting workflows for monitoring Connect health and diagnosing performance issues using OpenTelemetry signals.

Overview

This guide focuses on two critical operational questions for Posit Connect administrators:

  • Is Connect healthy right now? - Service availability and database connection health
  • Why is Connect slow right now? - Database performance (connection pool, query latency, wait times) and host/infrastructure metrics (CPU, memory, storage I/O)

For job queue and content execution monitoring, see the job queue operations guide.

Is Connect healthy right now?

This question addresses immediate health status: Is the service up? Is the database accessible? Is the filesystem responsive?

Health signals to check

Service health

Question: Is the Connect server process running?

Primary Metric: db.sql.connection.open (Gauge)

Dimensions:

  • status: Connection state (idle or inuse)
  • pool_name: Connection pool identifier

Query Pattern:

sum:db.sql.connection.open{pool.name:core}
sum by(pool_name) (db_sql_connection_open{pool_name="core"})
db.sql.connection.open WHERE pool_name == "core"

Interpretation:

  • > 0 = Service is up and running
  • No data = Service is down

Usage: Display as a single stat or use your platform’s built-in service health detection based on telemetry presence. This metric is continuously reported regardless of traffic.

Note

Most observability platforms automatically detect service health based on whether telemetry is being received. Any regularly exported metric (like db.sql.connection.open, application.count, or users.active) indicates the service is running. Platforms like Datadog, New Relic, and Grafana Cloud provide built-in service health views based on metric presence.


Database connection pool utilization

Question: Is the database connection pool exhausted?

Primary Metric: db.sql.connection.open (Gauge)

Dimensions:

  • status: Connection state (idle or inuse)
  • pool_name: Connection pool identifier (e.g., “core” or “instrumentation”)

Query Pattern:

(sum:db.sql.connection.open{status:inuse} by {pool.name}) / (sum:db.sql.connection.max_open{*} by {pool.name}) * 100
(db_sql_connection_open{status="inuse"} / db_sql_connection_max_open) * 100
(db.sql.connection.open WHERE status == "inuse" AND pool_name == "core")
  / (db.sql.connection.max_open WHERE pool_name == "core") * 100

Interpretation:

  • < 70% = Healthy pool utilization
  • 70-90% = Pool under pressure but functional
  • > 90% = Pool exhaustion risk, slowness likely

Usage: Display as a gauge or percentage. This is a critical health indicator for database performance.


Scheduled report health

Question: Are scheduled reports running successfully?

Primary Metric: schedule.count (Gauge)

Dimensions:

  • schedule.status: Status (queued, running, success, failure)

Query Pattern:

sum:schedule.count{schedule.status:failure}
schedule_count{schedule_status="failure"}
schedule.count WHERE schedule.status == "failure"

Interpretation:

  • Rising failure count indicates issues with scheduled content (locked content, auth failures, or resource problems)

Usage: Track failure counts over time and alert on sustained increases.


Running applications

Question: Are applications staying running?

Primary Metric: application.count (Gauge)

Dimensions:

  • application.type: Application type

Query Pattern:

sum:application.count{*} by {application.type}
application_count
application.count

Interpretation:

  • Sudden drops may indicate crash loops or job launcher issues

Usage: Monitor for unexpected decreases in running application count.


Why is Connect slow right now?

When users report slowness or you observe performance degradation, use this decision tree to identify the bottleneck systematically.

Decision tree

Check these areas in order to identify the bottleneck:

  1. Database metrics (most common) - Check connection pool utilization, wait times, and query latency
  2. Host and infrastructure metrics - Check CPU, memory, and storage I/O (especially NFS/EFS)

Database performance metrics


Database connection pool utilization

See Is Connect healthy right now? above for full details. Quick check:

(sum:db.sql.connection.open{status:inuse} by {pool.name}) / (sum:db.sql.connection.max_open{*} by {pool.name}) * 100
(db_sql_connection_open{status="inuse"} / db_sql_connection_max_open) * 100
(db.sql.connection.open WHERE status == "inuse")
  / db.sql.connection.max_open * 100

Interpretation:

  • > 90% = Pool exhaustion is likely causing slowness

Connection wait time

Question: Are threads waiting for database connections to become available?

Primary Metrics:

  • db.sql.connection.wait_duration (Counter, cumulative milliseconds)
  • db.sql.connection.wait (Counter, cumulative count)

Dimensions:

  • pool_name: Connection pool identifier

Query Pattern:

sum:db.sql.connection.wait_duration{*} by {pool.name}.as_count()
rate(db_sql_connection_wait_duration_milliseconds_total{pool_name="core"}[5m])
  / rate(db_sql_connection_wait_total{pool_name="core"}[5m])
RATE(db.sql.connection.wait_duration WHERE pool_name == "core") over 5 min
  / RATE(db.sql.connection.wait WHERE pool_name == "core") over 5 min

Interpretation:

  • < 10ms = Healthy, minimal contention
  • 10-100ms = Moderate contention
  • > 100ms = Significant wait time causing slowness

Usage: Display as a time-series graph. Spikes indicate connection pool pressure. These are cumulative counters that only increment when connections actually wait, so no data or zero values indicate healthy pool utilization with no contention.


Query latency

Question: How long are database queries taking?

Primary Metric: db.sql.latency (Histogram, milliseconds)

Dimensions:

  • method: Database operation method name (e.g., “sql.DB.Query”, “sql.Stmt.Exec”)
  • status: Operation status (success/error)

Query Pattern:

sum:db.sql.latency.bucket{*} by {upper_bound}.as_count()
histogram_quantile(0.50, rate(db_sql_latency_milliseconds_bucket[5m]))
AVG(db.sql.latency) over 5 minutes

Interpretation:

  • < 100ms = Fast queries
  • 100-1000ms = Acceptable for complex queries
  • > 1000ms = Slow queries detected, investigate

Usage: Display as a time-series graph showing average latency over time. Sudden spikes indicate performance issues. You can group by method to identify which types of operations are slowest. For percentile analysis (P95, P99), use histogram queries in your platform.


API endpoint performance

Question: Are specific API endpoints slow?

Primary Metrics:

  • http.server.request.duration (Histogram, seconds)
  • http.server.active_requests (Gauge)

Dimensions:

  • http.method: HTTP method (GET, POST, etc.)
  • http.route: API endpoint route
  • http.status_code: Response status code

Query Pattern:

sum:http.server.request.duration.bucket{*} by {upper_bound}.as_count()
histogram_quantile(0.95, sum by (le, http_route) (rate(http_server_request_duration_seconds_bucket[5m])))
P95 of (http.server.request.duration) GROUP BY http.route over 5 minutes

Interpretation:

  • < 0.5s = Fast endpoints
  • 0.5-2.0s = Acceptable for complex operations
  • > 2.0s = Slow endpoints requiring investigation

Usage: Group by http.route to identify which specific API endpoints are slow. Use http.server.active_requests to detect request pileup.


Off-host execution health (Kubernetes)

Question: Is the launcher responding?

Primary Metric: launcher.client.retry.exhausted (Counter)

Dimensions:

  • http.method: HTTP method used

Query Pattern:

per_second(sum:connect.launcher_client_retry_exhausted.count{*})
rate(launcher_client_retry_exhausted_total[5m])
RATE(launcher.client.retry.exhausted) over 5 minutes

Interpretation:

  • Any non-zero rate indicates launcher communication failures after all retries
  • Common causes: launcher downtime, network issues, or resource exhaustion

Usage: Alert on any exhausted retries. This indicates jobs are failing to submit to Kubernetes.


Process management

Question: Are processes churning or getting stuck?

Primary Metric: process.count (UpDownCounter)

Dimensions:

  • process.tag: Process type (e.g., run_shiny_app, run_dash_app)

Query Pattern:

sum:otel.process.count{*} by {process.tag}
sum by (process_tag) (process_count)
process.count GROUP BY process.tag

Interpretation:

  • Sudden drops indicate crashed processes or job launcher issues
  • Steady values indicate stable processes
  • Unexpected increases may indicate processes not terminating properly

Usage: Monitor current process counts by type. Track the rate of change to detect instability or process churn.


Queue performance

Question: How long are jobs waiting in queues and taking to process?

Primary Attributes (from traces):

  • queue.wait_time.ms - Time item waited in queue before processing (milliseconds)
  • queue.processing_duration.ms - Time spent processing the item (milliseconds)

Available in: Trace span attributes on queue.item.process spans

Query Pattern:

resource_name:queue.item.process @queue.item.type:ScheduledRender
compute avg(@queue.wait_time.ms) by @queue.item.type
# Traces not available as metrics - use trace analytics in your platform
AVG(queue.wait_time.ms) WHERE resource_name == "queue.item.process"
  GROUP BY queue.item.type

Interpretation:

  • High wait times indicate queue backlog or insufficient workers
  • High processing duration indicates slow job execution
  • Both contribute to overall slowness in scheduled reports, deployments, and background tasks

Usage: Use trace analytics to query queue.item.process spans. Filter by queue.item.type (e.g., ScheduledRender, GitFetch) to identify which queue types have the longest wait times.


Host and infrastructure metrics

When database metrics look healthy but Connect is still slow, investigate host-level resource utilization and external infrastructure dependencies. The host system where Connect runs collects these metrics, not Connect’s OpenTelemetry instrumentation.

Key areas to monitor

CPU and memory utilization

Monitor the Connect server’s CPU and memory usage through your host monitoring agent. High CPU utilization (>80% sustained) or memory pressure can cause overall application slowness even when database performance is good.

Filesystem I/O and network storage

Connect performs extensive filesystem I/O operations for content management, including:

  • Reading and writing content bundles during deployment
  • Accessing R package caches and Python virtual environments
  • Storing and retrieving content execution artifacts
  • Managing logs and temporary files

Network-attached storage (Network File System (NFS), AWS Elastic File System (EFS), Azure Files) is a common bottleneck during periods of high activity. Key indicators include:

  • High I/O wait times: Processes spending significant time waiting for disk operations
  • Elevated disk latency: Read/write operations taking longer than expected
  • Throughput saturation: Reaching the input/output operations per second (IOPS) or bandwidth limits of your storage tier

Why NFS systems become bottlenecks

  • Each content deployment involves many small file operations (package installations, bundle extraction)
  • Multiple concurrent deployments amplify the I/O load
  • Network-attached storage adds latency compared to local disks
  • Cloud storage services (EFS, Azure Files) have performance limits based on provisioning tier and burst credits
  • High-concurrency scenarios (many users, frequent deployments) can exhaust available IOPS

Recommendations

  1. Set up host metric collection if not already configured:

    • Deploy a host monitoring agent on your Connect server
    • Collect standard system metrics: CPU, memory, disk I/O, network I/O
    • Monitor filesystem-specific metrics for mounted volumes
  2. Monitor storage performance separately:

    • AWS EFS: Watch CloudWatch metrics for burst credit balance, throughput, and IOPS
    • Azure Files: Monitor storage account metrics for throttling and latency
    • On-premises NFS: Track NFS server metrics and network latency
  3. Investigate when you observe:

    • Deployments that are slow despite healthy database metrics
    • High I/O wait times on the Connect server
    • Correlation between slowness and concurrent deployment activity
    • Storage throttling events or exhausted burst credits
  4. Optimization strategies if storage is the bottleneck:

    • Upgrade storage tier for higher IOPS/throughput
    • Switch from burst mode to provisioned throughput (EFS)
    • Use local disk for frequently accessed caches when possible
    • Implement deployment throttling to reduce concurrent I/O

Debugging

When metrics indicate slowness, traces provide execution context to identify bottlenecks.

Trace analysis workflow

  1. Identify slow operations in your APM tool’s trace explorer (filter by duration, endpoint, or time range).
  2. Examine spans to see overall timing:
    • HTTP {method} {route} - API endpoint duration
    • queue.item.process - Queue wait time (queue.wait_time.ms) and processing duration (queue.processing_duration.ms)
    • schedule.render - Scheduled execution timing
    • worker.provision / worker.process.startup - Worker lifecycle timing
    • report.execute / report.setup - Report execution phases
  3. Use span attributes to identify content (content.guid), job keys (job.key), and types (report.type, runtime.type).
  4. Correlate timing patterns with metrics to identify bottlenecks (database pool exhaustion, worker saturation, etc.).

For assistance with trace analysis, contact Posit Support.

See the OpenTelemetry signal reference for a complete catalog of available spans. Return to the operational guide overview for the full list of operational questions.