Job queue operations guide
This guide provides metrics, query patterns, and troubleshooting workflows for monitoring job queue health using OpenTelemetry signals.
Overview
This guide addresses operational questions focused on job queue health and content execution:
- Are scheduled jobs running? - Job execution health, worker availability, and timeout monitoring
- Why is this scheduled job taking longer than usual? - Job duration analysis, queue wait times, and trace-based investigation
- Is the job queue backing up? - Queue size and age monitoring, worker capacity, and backup detection
Are scheduled jobs running?
This question addresses job execution health: Are scheduled jobs completing? Are workers available to process the queue? Are jobs timing out or failing?
Job execution signals to check
Job completions
Question: How many jobs have completed, and what are their outcomes?
Primary Metric: job.duration (Histogram, seconds)
Dimensions:
job.status: Job outcome (success,failure)
Query Pattern:
sum:job.duration.count{*} by {job.status}.as_count().rollup(sum, 300)
sum by (job_status) (increase(job_duration_seconds_count[5m]))
COUNT(job.duration) GROUP BY job.status
Interpretation:
Use the distribution of job outcomes to understand processing patterns. Rising failure counts may warrant investigation. Compare counts across statuses over time to identify trends.
Timeouts and cancellations are tracked by separate counters (job.timeout and job.cancelled) rather than as job.duration status values.
Usage: Display as a stacked bar chart or grouped count by status. Use this to quickly see the distribution of job outcomes and identify if failures are increasing. Combine with job.timeout and job.cancelled counters for the full picture.
Schedule failure tracking
Question: Are scheduled reports failing to execute?
Primary Metric: schedule.count (Gauge)
Dimensions:
schedule.status: Status (queued,running,success,failure)
Query Pattern:
sum:schedule.count{schedule.status:failure}
schedule_count{schedule_status="failure"}
schedule.count WHERE schedule.status == "failure"
Interpretation:
- Rising failure count indicates issues with scheduled content
- Common causes: locked content, authentication failures, resource exhaustion
Usage: Track failure counts over time and alert on sustained increases. See Is Connect healthy right now? for additional details on schedule health.
Worker pool utilization
Question: How busy are the worker pools?
Primary Metric: worker.pool.utilization (Gauge, ratio 0-1)
Dimensions:
application.type: Application type (e.g.,shiny,python_dash,rmd)
Query Pattern:
avg:worker.pool.utilization{*} by {application.type}
avg by (application_type) (worker_pool_utilization)
worker.pool.utilization GROUP BY application.type
Interpretation:
Utilization increases or values close to 1 indicate worker pool exhaustion.
Usage: Display as a gauge per application type. Sustained high utilization indicates the need for more worker capacity or investigation into long-running jobs.
Job timeout and cancellation monitoring
Question: Are jobs timing out or being cancelled before completion?
Primary Metrics: job.timeout (Counter), job.cancelled (Counter)
Dimensions:
job.cancel.reason: Why the job was cancelled (user, app_deleted, variant_removed)
Query Pattern:
# Job timeouts
sum:job.timeout.count{*}.as_rate()
# Job cancellations by reason
sum:job.cancelled.count{*} by {job.cancel.reason}.as_rate()
# Job timeouts
rate(job_timeout_total[5m])
# Job cancellations by reason
sum by (job_cancel_reason) (rate(job_cancelled_total[5m]))
# Job timeouts
RATE(job.timeout) over 5 minutes
# Job cancellations by reason
RATE(job.cancelled) GROUP BY job.cancel.reason over 5 minutes
Interpretation:
- Any non-zero timeout rate indicates jobs are exceeding their time limits. Common causes: slow external dependencies, resource contention, misconfigured timeout settings.
- Cancellations by reason:
user— A user manually killed the jobapp_deleted— The application was deleted while the job was runningvariant_removed— The content variant was removed
Usage: Alert on elevated timeout or cancellation rates. Investigate individual job traces to identify what’s causing slow execution or unexpected cancellations.
Common causes of job execution failures
When jobs are not running as expected, investigate these areas:
Worker pool exhaustion: All workers are busy with long-running jobs. Check
worker.pool.busyrelative toworker.pool.size.Database issues: Connection pool exhaustion can prevent job state updates. See Is Connect healthy right now? and Why is Connect slow right now? for database health checks.
Content problems: Individual content items may be locked, have authentication issues, or missing dependencies.
Infrastructure: Kubernetes launcher issues (if using off-host execution), storage I/O problems, or network connectivity.
Why is this scheduled job taking longer than usual?
This question addresses job performance: How long are jobs taking? Where is time being spent? Is slowness due to queue wait time or execution time?
Job duration analysis
Job duration percentiles
Question: How long are jobs taking to execute?
Primary Metric: job.duration (Histogram, seconds)
Dimensions:
job.status: Job outcome (success,failure)
Query Pattern:
sum:job.duration.bucket{job.status:success} by {upper_bound}.as_count()
histogram_quantile(0.50, rate(job_duration_seconds_bucket{job_status="success"}[1h]))
histogram_quantile(0.95, rate(job_duration_seconds_bucket{job_status="success"}[1h]))
P50 of (job.duration WHERE job.status == "success") over 1 hour
P95 of (job.duration WHERE job.status == "success") over 1 hour
Interpretation:
- Compare current P50/P95 to historical baselines
- Large gaps between P50 and P95 indicate inconsistent performance
- Rising percentiles indicate gradual degradation
Usage: Display P50 and P95 as time-series graphs. Compare to the same time period in previous days/weeks to identify trends.
Datadog histogram configuration: By default, Datadog may only compute a limited set of percentiles for histogram metrics. To enable P50, P95, and other percentiles for job.duration, configure your Datadog agent or use the histogram aggregation settings in the Datadog UI. See Datadog’s documentation on Distribution metrics for details.
Histogram buckets: The job.duration histogram uses the following bucket boundaries (in seconds): [1, 5, 15, 30, 60, 300, 900, 1800, 3600, 10800, 21600, 43200, 86400]. This provides granularity from 1-second jobs up to 24-hour jobs.
Debugging
When metrics indicate a specific job is slow, use traces to understand where time is being spent.
Find the job trace: Filter traces by
job.keyorcontent.guidin your APM tool’s trace explorer.Examine the
queue.item.processspan for timing breakdown:queue.wait_time.ms: Time waiting in queue before processing startedqueue.processing_duration.ms: Time spent executing the jobqueue.name: Which queue processed the itemqueue.item.type: Type of work performed
Examine execution phase spans to see lifecycle timing:
worker.provision/worker.process.startup: Worker creation and startup timingreport.execute/report.setup: Report execution phases (usereport.typeattribute to identify format)deploy.contentLaunch: Deployment timing
Compare to fast jobs: Find traces for the same content when it ran quickly and compare span durations to identify timing differences.
For assistance with trace analysis, contact Posit Support.
See the signal reference guide for a complete list of available spans and their attributes.
Is the job queue backing up?
This question addresses queue health: Is the queue growing faster than it’s being processed? How old are items waiting in the queue? Are jobs being cancelled due to backlog?
Queue backup detection
Queue size monitoring
Question: How many items are waiting in each queue?
Primary Metric: queue.items.size (Gauge)
Dimensions:
queue.name: Queue identifier (default,git,memberships,job-finalizer)
Query Pattern:
avg:queue.items.size{*} by {queue.name}
avg by (queue_name) (queue_items_size)
queue.items.size GROUP BY queue.name
Interpretation:
Queue size reflects the number of pending items. Growing values over time may indicate the queue is not draining as fast as items are added. Spikes can occur during burst activity.
Usage: Display as a time-series graph per queue. Establish baseline values for your workload to identify when queue size warrants attention.
Queue age monitoring
Question: How long has the oldest item been waiting?
Primary Metric: queue.items.age (Gauge, seconds)
Dimensions:
queue.name: Queue identifier
Query Pattern:
avg:queue.items.age{*} by {queue.name}
avg by (queue_name) (queue_items_age_seconds)
queue.items.age GROUP BY queue.name
Interpretation:
This metric shows how long the oldest item has been waiting in the queue. Compare to your environment’s typical baseline to identify unusual delays.
Usage: Display as a gauge per queue. Establish baseline values for your workload to determine when queue age warrants attention.
Queue drain time
Question: How long will it take to clear the current queue backlog?
Primary Metrics:
queue.items.size(Gauge) - Current number of items in queuejob.duration.count(Counter) - Number of completed jobs
Query Pattern:
avg:queue.items.size{*} / clamp_min(job.duration.count{*}.as_rate(), 0.0001)
queue_items_size / clamp_min(rate(job_duration_seconds_count[5m]), 0.0001)
queue.items.size / RATE(job.duration.count) over 5 minutes
Interpretation:
Estimate for how long it would take to process all items currently in the queue at the current job completion rate.
Usage: Display as a time-series showing estimated seconds to drain the queue. Rising drain time indicates the queue is backing up faster than jobs are completing. It is normal to see spikes around job queue size spikes, but significant increases may indicate a problem.
Common causes and remediation
When queues are backing up, investigate these areas:
Insufficient worker capacity: Add more workers or increase concurrency limits. Check
worker.pool.utilizationto confirm saturation.Slow job execution: Jobs taking longer than expected reduce throughput. See Why is this scheduled job taking longer than usual? for job duration analysis.
Burst activity: Many jobs scheduled for the same time. Consider staggering schedules or implementing job throttling.
External dependencies: Slow database, network storage, or external APIs affect all jobs. Check Why is Connect slow right now? for infrastructure bottlenecks.
Stuck workers: Individual workers stuck on problematic jobs. Check for jobs with very long durations in traces.