Alerting recommendations
This guide provides alerting recommendations for monitoring Posit Connect with OpenTelemetry. Every deployment has a different workload profile — rather than prescribing fixed thresholds, the guidance below describes how to derive thresholds that are meaningful for your environment.
Tailor alert thresholds to your specific environment, infrastructure, and operational requirements. These thresholds provide reasonable starting points, but adjust them based on your usage patterns, infrastructure capacity, and organizational processes.
Establishing baselines
Before configuring alerts, observe your system under normal conditions for at least one full scheduling cycle (typically one week) to capture daily and weekly patterns. Record:
- Typical queue depth — How many items sit in the default queue during peak scheduling windows? If you have 50 scheduled reports that run at midnight, a queue size of 50 at that time is expected, not an emergency.
- Normal job failure rate — Some failure rate may be normal (e.g., content that depends on external APIs). Measure your steady-state failure rate so alerts fire on deviations, not on background noise.
- Typical job duration — Know the P95 duration for your workload. A 10-minute
active_durationis concerning if most jobs finish in 30 seconds, but expected if you run long-running extract, transform, load (ETL) jobs.
Critical alerts
Health and performance
Service availability — Alert when the core database connection pool drops to zero or pool utilization exceeds 90%. See Is Connect healthy right now? for the relevant metrics.
Launcher retry exhaustion — Alert on any exhausted retries. This indicates jobs are failing to submit to Kubernetes. See Off-host execution health for query patterns.
Job failures
Job failure spike
sum:job.completion{job.status:failure}.as_rate()
rate(job_completion_total{job_status="failure"}[5m])
rate(job.completion WHERE job.status == "failure") over 5m
Set the threshold relative to your baseline failure rate. A reasonable starting point is 2-3x the normal rate. If your system normally sees 1 failure per hour, alert when the 5-minute rate exceeds that by a significant margin.
Out-of-memory (OOM) failures
sum:job.completion{job.exit_code:137}.as_count()
sum(job_completion_total{job_exit_code="137"})
job.completion WHERE job.exit_code == 137
OOM failures cause exit code 137 (SIGKILL), which almost always indicates the OS or container runtime killed the process for exceeding memory limits. Alert on any occurrence. This is actionable regardless of baseline.
Queue health
Queue backing up
avg:queue.items.size{queue.name:default}
queue_items_size{queue_name="default"}
queue.items.size WHERE queue.name == "default"
Base this on how many schedules run concurrently or near-simultaneously. If 50 reports are all scheduled for midnight, they’ll hit the queue at the same time and a queue size of 50 at that moment is expected. Alert when queue size exceeds the peak concurrency your scheduling patterns can explain.
Stuck job
avg:queue.items.active_duration{queue.name:default}
queue_items_active_duration_seconds{queue_name="default"}
queue.items.active_duration WHERE queue.name == "default"
Set this above the P95 job duration for your workload. If your longest-running content typically finishes in 5 minutes, alerting at 10-15 minutes is reasonable. If you have ETL jobs that run for 30 minutes, set the threshold accordingly.
Queue stale
avg:queue.items.age{queue.name:default}
queue_items_age_seconds{queue_name="default"}
queue.items.age WHERE queue.name == "default"
This measures how long items wait before processing starts. If workers are healthy, items should begin processing quickly. Alert when the oldest item has been waiting significantly longer than your typical queue processing time.
License capacity
Named user utilization:
- Warning:
avg:license.users.current{*} / avg:license.users.limit{*} > 0.8(80%) - Critical:
avg:license.users.current{*} / avg:license.users.limit{*} > 0.9(90%)
- Warning:
license_users_current / license_users_limit > 0.8(80%) - Critical:
license_users_current / license_users_limit > 0.9(90%)
- Warning:
license.users.current / license.users.limit > 0.8(80%) - Critical:
license.users.current / license.users.limit > 0.9(90%)
Concurrent Shiny user utilization:
- Warning:
avg:license.shiny_users.current{*} / avg:license.shiny_users.limit{*} > 0.8(80%) - Critical:
avg:license.shiny_users.current{*} / avg:license.shiny_users.limit{*} > 0.9(90%)
- Warning:
license_shiny_users_current / license_shiny_users_limit > 0.8(80%) - Critical:
license_shiny_users_current / license_shiny_users_limit > 0.9(90%)
- Warning:
license.shiny_users.current / license.shiny_users.limit > 0.8(80%) - Critical:
license.shiny_users.current / license.shiny_users.limit > 0.9(90%)
License expiration:
- Warning:
avg:license.expiration.days_remaining{*} < 30 - Critical:
avg:license.expiration.days_remaining{*} < 7
- Warning:
license_expiration_days_remaining < 30 - Critical:
license_expiration_days_remaining < 7
- Warning:
license.expiration.days_remaining < 30 - Critical:
license.expiration.days_remaining < 7
Warning alerts
Elevated failures
sum:job.completion{job.status:failure}.as_rate()
rate(job_completion_total{job_status="failure"}[5m])
rate(job.completion WHERE job.status == "failure") over 5m
Set this below your critical threshold but above the baseline. Use it to catch gradual degradation before it becomes critical.
Queue depth growing
avg:queue.items.size{queue.name:default}
increase(queue_items_size{queue_name="default"}[30m])
increase(queue.items.size WHERE queue.name == "default") over 30m
A sustained increase over 30 minutes outside of known scheduling windows suggests jobs are arriving faster than they can be processed. This is an early signal to investigate capacity before the queue becomes critically backed up.