Alerting recommendations

This guide provides alerting recommendations for monitoring Posit Connect with OpenTelemetry. Every deployment has a different workload profile — rather than prescribing fixed thresholds, the guidance below describes how to derive thresholds that are meaningful for your environment.

Note

Tailor alert thresholds to your specific environment, infrastructure, and operational requirements. These thresholds provide reasonable starting points, but adjust them based on your usage patterns, infrastructure capacity, and organizational processes.

Establishing baselines

Before configuring alerts, observe your system under normal conditions for at least one full scheduling cycle (typically one week) to capture daily and weekly patterns. Record:

Typical queue depth — How many items sit in the default queue during peak scheduling windows? If you have 50 scheduled reports that run at midnight, a queue size of 50 at that time is expected, not an emergency.
Normal job failure rate — Some failure rate may be normal (e.g., content that depends on external APIs). Measure your steady-state failure rate so alerts fire on deviations, not on background noise.
Typical job duration — Know the P95 duration for your workload. A 10-minute active_duration is concerning if most jobs finish in 30 seconds, but expected if you run long-running extract, transform, load (ETL) jobs.

Critical alerts

Health and performance

Service availability — Alert when the core database connection pool drops to zero or pool utilization exceeds 90%. See Is Connect healthy right now? for the relevant metrics.

Launcher retry exhaustion — Alert on any exhausted retries. This indicates jobs are failing to submit to Kubernetes. See Off-host execution health for query patterns.

Job failures

Job failure spike

sum:job.completion{job.status:failure}.as_rate()

rate(job_completion_total{job_status="failure"}[5m])

rate(job.completion WHERE job.status == "failure") over 5m

Set the threshold relative to your baseline failure rate. A reasonable starting point is 2-3x the normal rate. If your system normally sees 1 failure per hour, alert when the 5-minute rate exceeds that by a significant margin.

Out-of-memory (OOM) failures

sum:job.completion{job.exit_code:137}.as_count()

sum(job_completion_total{job_exit_code="137"})

job.completion WHERE job.exit_code == 137

OOM failures cause exit code 137 (SIGKILL), which almost always indicates the OS or container runtime killed the process for exceeding memory limits. Alert on any occurrence. This is actionable regardless of baseline.

Queue health

Queue backing up

avg:queue.items.size{queue.name:default}

queue_items_size{queue_name="default"}

queue.items.size WHERE queue.name == "default"

Base this on how many schedules run concurrently or near-simultaneously. If 50 reports are all scheduled for midnight, they’ll hit the queue at the same time and a queue size of 50 at that moment is expected. Alert when queue size exceeds the peak concurrency your scheduling patterns can explain.

Stuck job

avg:queue.items.active_duration{queue.name:default}

queue_items_active_duration_seconds{queue_name="default"}

queue.items.active_duration WHERE queue.name == "default"

Set this above the P95 job duration for your workload. If your longest-running content typically finishes in 5 minutes, alerting at 10-15 minutes is reasonable. If you have ETL jobs that run for 30 minutes, set the threshold accordingly.

Queue stale

avg:queue.items.age{queue.name:default}

queue_items_age_seconds{queue_name="default"}

queue.items.age WHERE queue.name == "default"

This measures how long items wait before processing starts. If workers are healthy, items should begin processing quickly. Alert when the oldest item has been waiting significantly longer than your typical queue processing time.

License capacity

Named user utilization:

Warning: avg:license.users.current{*} / avg:license.users.limit{*} > 0.8 (80%)
Critical: avg:license.users.current{*} / avg:license.users.limit{*} > 0.9 (90%)

Warning: license_users_current / license_users_limit > 0.8 (80%)
Critical: license_users_current / license_users_limit > 0.9 (90%)

Warning: license.users.current / license.users.limit > 0.8 (80%)
Critical: license.users.current / license.users.limit > 0.9 (90%)

Concurrent Shiny user utilization:

Warning: avg:license.shiny_users.current{*} / avg:license.shiny_users.limit{*} > 0.8 (80%)
Critical: avg:license.shiny_users.current{*} / avg:license.shiny_users.limit{*} > 0.9 (90%)

Warning: license_shiny_users_current / license_shiny_users_limit > 0.8 (80%)
Critical: license_shiny_users_current / license_shiny_users_limit > 0.9 (90%)

Warning: license.shiny_users.current / license.shiny_users.limit > 0.8 (80%)
Critical: license.shiny_users.current / license.shiny_users.limit > 0.9 (90%)

License expiration:

Warning: avg:license.expiration.days_remaining{*} < 30
Critical: avg:license.expiration.days_remaining{*} < 7

Warning: license_expiration_days_remaining < 30
Critical: license_expiration_days_remaining < 7

Warning: license.expiration.days_remaining < 30
Critical: license.expiration.days_remaining < 7

Warning alerts

Elevated failures

sum:job.completion{job.status:failure}.as_rate()

rate(job_completion_total{job_status="failure"}[5m])

rate(job.completion WHERE job.status == "failure") over 5m

Set this below your critical threshold but above the baseline. Use it to catch gradual degradation before it becomes critical.

Queue depth growing

avg:queue.items.size{queue.name:default}

increase(queue_items_size{queue_name="default"}[30m])

increase(queue.items.size WHERE queue.name == "default") over 30m

A sustained increase over 30 minutes outside of known scheduling windows suggests jobs are arriving faster than they can be processed. This is an early signal to investigate capacity before the queue becomes critically backed up.