Job queue operations guide

This guide provides metrics, query patterns, and troubleshooting workflows for monitoring job queue health using OpenTelemetry signals.

Overview

This guide addresses operational questions focused on job queue health and content execution:

  • Are scheduled jobs running? - Job execution health, worker availability, and timeout monitoring
  • Why is this scheduled job taking longer than usual? - Job duration analysis, queue wait times, and trace-based investigation
  • Is the job queue backing up? - Queue size and age monitoring, worker capacity, and backup detection

Are scheduled jobs running?

This question addresses job execution health: Are scheduled jobs completing? Are workers available to process the queue? Are jobs timing out or failing?

Job execution signals to check

Job completions

Question: How many jobs have completed, and what are their outcomes?

Primary Metric: job.duration (Histogram, seconds)

Dimensions:

  • job.status: Job outcome (success, failure)

Query Pattern:

sum:job.duration.count{*} by {job.status}.as_count().rollup(sum, 300)
sum by (job_status) (increase(job_duration_seconds_count[5m]))
COUNT(job.duration) GROUP BY job.status

Interpretation:

Use the distribution of job outcomes to understand processing patterns. Rising failure counts may warrant investigation. Compare counts across statuses over time to identify trends.

Note

Timeouts and cancellations are tracked by separate counters (job.timeout and job.cancelled) rather than as job.duration status values.

Usage: Display as a stacked bar chart or grouped count by status. Use this to quickly see the distribution of job outcomes and identify if failures are increasing. Combine with job.timeout and job.cancelled counters for the full picture.


Schedule failure tracking

Question: Are scheduled reports failing to execute?

Primary Metric: schedule.count (Gauge)

Dimensions:

  • schedule.status: Status (queued, running, success, failure)

Query Pattern:

sum:schedule.count{schedule.status:failure}
schedule_count{schedule_status="failure"}
schedule.count WHERE schedule.status == "failure"

Interpretation:

  • Rising failure count indicates issues with scheduled content
  • Common causes: locked content, authentication failures, resource exhaustion

Usage: Track failure counts over time and alert on sustained increases. See Is Connect healthy right now? for additional details on schedule health.


Worker pool utilization

Question: How busy are the worker pools?

Primary Metric: worker.pool.utilization (Gauge, ratio 0-1)

Dimensions:

  • application.type: Application type (e.g., shiny, python_dash, rmd)

Query Pattern:

avg:worker.pool.utilization{*} by {application.type}
avg by (application_type) (worker_pool_utilization)
worker.pool.utilization GROUP BY application.type

Interpretation:

Utilization increases or values close to 1 indicate worker pool exhaustion.

Usage: Display as a gauge per application type. Sustained high utilization indicates the need for more worker capacity or investigation into long-running jobs.


Job timeout and cancellation monitoring

Question: Are jobs timing out or being cancelled before completion?

Primary Metrics: job.timeout (Counter), job.cancelled (Counter)

Dimensions:

  • job.cancel.reason: Why the job was cancelled (user, app_deleted, variant_removed)

Query Pattern:

# Job timeouts
sum:job.timeout.count{*}.as_rate()

# Job cancellations by reason
sum:job.cancelled.count{*} by {job.cancel.reason}.as_rate()
# Job timeouts
rate(job_timeout_total[5m])

# Job cancellations by reason
sum by (job_cancel_reason) (rate(job_cancelled_total[5m]))
# Job timeouts
RATE(job.timeout) over 5 minutes

# Job cancellations by reason
RATE(job.cancelled) GROUP BY job.cancel.reason over 5 minutes

Interpretation:

  • Any non-zero timeout rate indicates jobs are exceeding their time limits. Common causes: slow external dependencies, resource contention, misconfigured timeout settings.
  • Cancellations by reason:
    • user — A user manually killed the job
    • app_deleted — The application was deleted while the job was running
    • variant_removed — The content variant was removed

Usage: Alert on elevated timeout or cancellation rates. Investigate individual job traces to identify what’s causing slow execution or unexpected cancellations.


Common causes of job execution failures

When jobs are not running as expected, investigate these areas:

  1. Worker pool exhaustion: All workers are busy with long-running jobs. Check worker.pool.busy relative to worker.pool.size.

  2. Database issues: Connection pool exhaustion can prevent job state updates. See Is Connect healthy right now? and Why is Connect slow right now? for database health checks.

  3. Content problems: Individual content items may be locked, have authentication issues, or missing dependencies.

  4. Infrastructure: Kubernetes launcher issues (if using off-host execution), storage I/O problems, or network connectivity.


Why is this scheduled job taking longer than usual?

This question addresses job performance: How long are jobs taking? Where is time being spent? Is slowness due to queue wait time or execution time?

Job duration analysis

Job duration percentiles

Question: How long are jobs taking to execute?

Primary Metric: job.duration (Histogram, seconds)

Dimensions:

  • job.status: Job outcome (success, failure)

Query Pattern:

sum:job.duration.bucket{job.status:success} by {upper_bound}.as_count()
histogram_quantile(0.50, rate(job_duration_seconds_bucket{job_status="success"}[1h]))
histogram_quantile(0.95, rate(job_duration_seconds_bucket{job_status="success"}[1h]))
P50 of (job.duration WHERE job.status == "success") over 1 hour
P95 of (job.duration WHERE job.status == "success") over 1 hour

Interpretation:

  • Compare current P50/P95 to historical baselines
  • Large gaps between P50 and P95 indicate inconsistent performance
  • Rising percentiles indicate gradual degradation

Usage: Display P50 and P95 as time-series graphs. Compare to the same time period in previous days/weeks to identify trends.

Note

Datadog histogram configuration: By default, Datadog may only compute a limited set of percentiles for histogram metrics. To enable P50, P95, and other percentiles for job.duration, configure your Datadog agent or use the histogram aggregation settings in the Datadog UI. See Datadog’s documentation on Distribution metrics for details.

Histogram buckets: The job.duration histogram uses the following bucket boundaries (in seconds): [1, 5, 15, 30, 60, 300, 900, 1800, 3600, 10800, 21600, 43200, 86400]. This provides granularity from 1-second jobs up to 24-hour jobs.


Debugging

When metrics indicate a specific job is slow, use traces to understand where time is being spent.

  1. Find the job trace: Filter traces by job.key or content.guid in your APM tool’s trace explorer.

  2. Examine the queue.item.process span for timing breakdown:

    • queue.wait_time.ms: Time waiting in queue before processing started
    • queue.processing_duration.ms: Time spent executing the job
    • queue.name: Which queue processed the item
    • queue.item.type: Type of work performed
  3. Examine execution phase spans to see lifecycle timing:

    • worker.provision / worker.process.startup: Worker creation and startup timing
    • report.execute / report.setup: Report execution phases (use report.type attribute to identify format)
    • deploy.contentLaunch: Deployment timing
  4. Compare to fast jobs: Find traces for the same content when it ran quickly and compare span durations to identify timing differences.

For assistance with trace analysis, contact Posit Support.

See the signal reference guide for a complete list of available spans and their attributes.


Is the job queue backing up?

This question addresses queue health: Is the queue growing faster than it’s being processed? How old are items waiting in the queue? Are jobs being cancelled due to backlog?

Queue backup detection

Queue size monitoring

Question: How many items are waiting in each queue?

Primary Metric: queue.items.size (Gauge)

Dimensions:

  • queue.name: Queue identifier (default, git, memberships, job-finalizer)

Query Pattern:

avg:queue.items.size{*} by {queue.name}
avg by (queue_name) (queue_items_size)
queue.items.size GROUP BY queue.name

Interpretation:

Queue size reflects the number of pending items. Growing values over time may indicate the queue is not draining as fast as items are added. Spikes can occur during burst activity.

Usage: Display as a time-series graph per queue. Establish baseline values for your workload to identify when queue size warrants attention.


Queue age monitoring

Question: How long has the oldest item been waiting?

Primary Metric: queue.items.age (Gauge, seconds)

Dimensions:

  • queue.name: Queue identifier

Query Pattern:

avg:queue.items.age{*} by {queue.name}
avg by (queue_name) (queue_items_age_seconds)
queue.items.age GROUP BY queue.name

Interpretation:

This metric shows how long the oldest item has been waiting in the queue. Compare to your environment’s typical baseline to identify unusual delays.

Usage: Display as a gauge per queue. Establish baseline values for your workload to determine when queue age warrants attention.


Queue drain time

Question: How long will it take to clear the current queue backlog?

Primary Metrics:

  • queue.items.size (Gauge) - Current number of items in queue
  • job.duration.count (Counter) - Number of completed jobs

Query Pattern:

avg:queue.items.size{*} / clamp_min(job.duration.count{*}.as_rate(), 0.0001)
queue_items_size / clamp_min(rate(job_duration_seconds_count[5m]), 0.0001)
queue.items.size / RATE(job.duration.count) over 5 minutes

Interpretation:

Estimate for how long it would take to process all items currently in the queue at the current job completion rate.

Usage: Display as a time-series showing estimated seconds to drain the queue. Rising drain time indicates the queue is backing up faster than jobs are completing. It is normal to see spikes around job queue size spikes, but significant increases may indicate a problem.


Common causes and remediation

When queues are backing up, investigate these areas:

  1. Insufficient worker capacity: Add more workers or increase concurrency limits. Check worker.pool.utilization to confirm saturation.

  2. Slow job execution: Jobs taking longer than expected reduce throughput. See Why is this scheduled job taking longer than usual? for job duration analysis.

  3. Burst activity: Many jobs scheduled for the same time. Consider staggering schedules or implementing job throttling.

  4. External dependencies: Slow database, network storage, or external APIs affect all jobs. Check Why is Connect slow right now? for infrastructure bottlenecks.

  5. Stuck workers: Individual workers stuck on problematic jobs. Check for jobs with very long durations in traces.