Content operations guide
This guide provides metrics, query patterns, and troubleshooting workflows for monitoring content execution and job queue health using OpenTelemetry signals.
Overview
This guide addresses operational questions focused on content execution:
- Which content is using the most resources? - Use queue metrics to identify resource pressure and correlate with external host metrics
- How do I route content failures to owners? - Use owner context in job failure signals for alerting
- Why did this content fail? - Investigate job failures using metrics, logs, and traces
These questions help operations teams monitor job queues, detect stuck jobs, investigate content failures, and route alerts to content owners.
Which content is using the most resources?
Use job metrics and queue metrics to identify resource pressure, then correlate with external host monitoring tools for detailed resource analysis.
Job pressure
Job completion and duration metrics reveal which content is consuming the most execution time and how overall job throughput is trending.
Primary Metrics:
job.completion(Counter) - Job completion count by statusjob.duration(Histogram, seconds) - Job execution duration by status
Dimensions:
job.status: Status (successorfailure)job.exit_code: Process exit code (when available)
Query Pattern:
# Overall job throughput
sum:job.completion{*}.as_rate()
# Long-running jobs by duration percentile
sum:job.duration.bucket{*} by {upper_bound}.as_count()
# Job failure ratio
sum:job.completion{job.status:failure}.as_rate() / sum:job.completion{*}.as_rate() * 100
# Overall job throughput
rate(job_completion_total[5m])
# Long-running jobs by duration percentile
histogram_quantile(0.95, rate(job_duration_seconds_bucket[5m]))
# Job failure ratio
rate(job_completion_total{job_status="failure"}[5m]) / rate(job_completion_total[5m]) * 100
# Overall job throughput
rate(job.completion) over 5m
# Long-running jobs by duration percentile
histogram_quantile(0.95, job.duration)
# Job failure ratio
rate(job.completion WHERE job.status == "failure") / rate(job.completion) over 5m
Interpretation:
- High
job.durationP95 values indicate resource-intensive content - Rising job throughput with stable queue depth means the system is keeping up
- Rising job throughput with growing queue depth means capacity is insufficient
Queue health
If job duration is high or throughput is dropping, check whether the queue is backing up. See Is the job queue backing up? for queue size, age monitoring, and drain time estimation.
Running processes and applications
These gauges show what’s currently consuming resources on the server.
Primary Metrics:
process.count(UpDownCounter) - Current running process countapplication.count(Gauge) - Current running application count
Dimensions:
process.tag: Process type (e.g.,run_shiny_app,run_dash_app)application.type: Application type
Query Pattern:
# Running processes by type
sum:otel.process.count{*} by {process.tag}
# Running applications by type
sum:application.count{*} by {application.type}
# Running processes by type
sum by (process_tag) (process_count)
# Running applications by type
sum by (application_type) (application_count)
# Running processes by type
process.count GROUP BY process.tag
# Running applications by type
application.count GROUP BY application.type
Interpretation:
- High
process.countfor a specific type indicates that content type dominates resource usage - Sudden drops in
application.countmay indicate crash loops - Unexpected increases in
process.countmay indicate processes not terminating properly
Host metrics
Connect does not emit host-level metrics (CPU, memory, disk) via OpenTelemetry. Use your platform’s existing infrastructure monitoring to correlate with Connect’s job and queue metrics. See the health and performance guide for detailed guidance on host metric monitoring, including NFS/EFS bottleneck diagnosis.
Correlation approach:
- Identify time ranges with high
job.durationor growingqueue.items.size - Cross-reference with host CPU, memory, and disk I/O metrics from your infrastructure monitoring (e.g., node_exporter, CloudWatch, Azure Monitor)
- Use
job.hostnamefrom queue metrics to pinpoint which host is under pressure in multi-node deployments
Kubernetes execution
When using off-host execution (Kubernetes launcher), per-process metrics are not available from Posit Connect. Instead:
- Use native Kubernetes metrics (metrics-server, kube-state-metrics)
- Correlate using
launcher.job.idfrom Connect traces/logs - Configure pod labels to include content GUID for easier correlation (requires custom job template configuration)
How do I route content failures to owners?
Connect emits content_owner_guid on job failure events. Use this to route alerts to content owners via your observability platform.
Owner context in failure events
Log event: job.completed (with job_status == "failure")
Key fields:
| Field | Description |
|---|---|
content_guid |
Content identifier |
content_owner_guid |
Owner’s user GUID (use with Connect API to resolve email) |
job_key |
Unique job tracking key |
job_tag |
Job type tag |
job_status |
failure |
error_type |
Error classification |
error_message |
Error details |
Query Pattern:
event == "job.completed" AND job_status == "failure" | fields content_guid, content_owner_guid, error_type
Alert routing options
Option A: External alert router
- Configure alerts on
job.completedevents wherejob_status == "failure"in your observability platform - Alert payload includes
content_owner_guid - Router service calls Connect API to resolve owner:
GET /v1/users/{owner_guid}returns email and other user details
- Route to owner’s notification channel (email, Slack, PagerDuty)
Option B: Team-based routing
Route by job tag rather than individual owner:
# Route Shiny failures to Shiny team
event == "job.completed" AND job_status == "failure" AND job_tag == "shiny" -> #shiny-team
# Route report failures to analytics team
event == "job.completed" AND job_status == "failure" AND job_tag IN ("rmd", "quarto", "jupyter") -> #analytics-team
Why did this content fail?
Use the combination of completion metrics, structured log events, and traces to investigate content failures.
Step 1: Detect failures
Primary Metrics:
job.completion(Counter) - Job completion count by statusjob.duration(Histogram) - Job execution duration by statusqueue.items.completed(Counter) - Queue item completion count by status
Dimensions:
job.status: Status (successorfailure)job.exit_code: Process exit code (when available)queue.name: Queue identifier (for queue items)
Query Pattern:
# Job failure rate over 5 minutes
sum:job.completion{job.status:failure}.as_rate()
# Queue item failure rate
sum:queue.items.completed{job.status:failure} by {queue.name}.as_rate()
# Failures by exit code (identify specific error patterns)
sum:job.completion{job.status:failure} by {job.exit_code}.as_count()
# Job failure rate over 5 minutes
rate(job_completion_total{job_status="failure"}[5m])
# Queue item failure rate
sum by (queue_name) (rate(queue_items_completed_total{job_status="failure"}[5m]))
# Failures by exit code (identify specific error patterns)
sum by (job_exit_code) (job_completion_total{job_status="failure"})
# Job failure rate over 5 minutes
rate(job.completion WHERE job.status == "failure") over 5m
# Queue item failure rate
rate(queue.items.completed WHERE job.status == "failure") over 5m
# Failures by exit code (identify specific error patterns)
job.completion WHERE job.status == "failure" GROUP BY job.exit_code
Interpretation:
- Rising failure rate indicates systematic issues
- Sudden spikes may indicate infrastructure problems or content issues
- Exit code patterns can reveal specific failure modes (e.g., exit code 137 = out of memory (OOM) killed)
Step 2: Find failure in logs
Two log events capture failures at different levels:
Job-level event: job.completed — emitted by job runners with content and owner context.
| Field | Description |
|---|---|
event |
Event name (job.completed) |
content_guid |
Content identifier |
content_owner_guid |
Owner’s user GUID |
job_key |
Unique job tracking key |
job_tag |
Job type tag |
job_status |
success or failure |
job_duration_ms |
Execution duration in milliseconds |
job_exit_code |
Process exit code (when available) |
error_type |
Error classification (failures only) |
error_message |
Error details (failures only) |
Queue-level event: queue.item.completed — emitted by the queue consumer with queue context.
| Field | Description |
|---|---|
event |
Event name (queue.item.completed) |
queue_name |
Queue where the system processed the item |
item_id |
Queue item database ID |
item_type |
Type of queued work |
duration_seconds |
Processing duration |
For queue-level failures, the system also emits a separate queue.item.failed event with error_type and error_message fields.
Query Pattern:
event == "job.completed" AND job_status == "failure"
Related log events:
job.started- When job execution began (fields:event,content_guid,job_key,job_tag)queue.item.started- When queue processing began (fields:event,queue_name,item_id,item_type)queue.item.failed- When queue processing failed (fields:event,queue_name,item_id,item_type,error_type,error_message)
Step 3: View execution trace
Search your Application Performance Monitoring (APM) platform using trace_id from the failure log, or search by job.key to find a specific job’s trace.
Finding a specific job trace
By trace_id: Use the trace_id from log events to find the complete execution trace.
By job.key: Search spans for the job.key attribute to find all spans related to a specific job:
job.key == "abc123-def456"
By content.guid: Find all job executions for specific content:
content.guid == "12345678-1234-1234-1234-123456789012"
Trace context propagation
Trace context flows through async boundaries via queue metadata. When the system queues a job, it serializes the trace context into the queue item metadata. The consumer then extracts this context when processing the item. This maintains a complete trace across:
- Schedule runner (producer) → Queue → Queue consumer
- Queue consumer → Worker lifecycle → Process execution
Step 4: Check queue health
Query Pattern:
# Was there queue backlog at failure time?
avg:queue.items.age{queue.name:default}
# Is queue growing?
avg:queue.items.size{queue.name:default}
# Is a job stuck on this host?
avg:queue.items.active_duration{queue.name:default}
# Was there queue backlog at failure time?
queue_items_age_seconds{queue_name="default"}
# Is queue growing?
increase(queue_items_size{queue_name="default"}[1h])
# Is a job stuck on this host?
queue_items_active_duration_seconds{queue_name="default"}
# Was there queue backlog at failure time?
queue.items.age WHERE queue.name == "default" > 300
# Is queue growing?
increase(queue.items.size WHERE queue.name == "default") over 1h
# Is a job stuck on this host?
queue.items.active_duration WHERE queue.name == "default" > 300
Step 5: Check launcher health (off-host execution)
If using Kubernetes launcher, check for communication failures:
Primary Metric: launcher.client.retry.exhausted (Counter)
per_second(sum:connect.launcher_client_retry_exhausted.count{*})
rate(launcher_client_retry_exhausted_total[5m])
rate(launcher.client.retry.exhausted) over 5m
Any non-zero rate indicates the launcher is unreachable after all retries. Common causes include launcher downtime, network issues, or resource exhaustion on the Kubernetes cluster.
Exit code reference
| Exit code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 13 | R environment needs rebuilding |
| 14 | R version changed since last run |
| 15 | Python environment needs rebuilding |
| 16 | Python version changed since last run |
| 130 | Terminated by SIGINT (normal for interactive content) |
| 137 | Killed by SIGKILL (OOM killed, or normal stop for interactive content) |
| 139 | Segmentation fault (SIGSEGV) |
| 143 | Terminated by SIGTERM (normal stop for interactive content) |
| 256 | Generic server error |
| 257 | Job interrupted by server restart |
| 258 | Job cancelled (timeout or content unavailable) |
Exit codes 130, 137, and 143 are expected for interactive content (Shiny apps, Dash apps, etc.) because the system uses signals to stop these long-running processes. These codes only indicate a problem for rendering jobs (scheduled reports, Quarto documents, etc.).
See Troubleshooting content for additional exit code context from the user perspective.
Span error types
When a job fails, the error.type attribute on spans indicates where in the execution pipeline the failure occurred. These are the actual values emitted by Connect:
| error.type | Description | Investigation |
|---|---|---|
connection_timeout |
Worker failed to accept connections within the timeout | Check worker.connection.await span duration, review content startup time |
worker_exited |
Worker process exited before accepting connections | Check job.exit_code on the span, review content logs |
worker_address_resolution_failed |
Could not resolve worker address | Check network configuration and hostname resolution |
connection_failed |
Could not establish connection to worker | Check worker host/port, review network issues |
launcher_error |
Off-host execution error from launcher | Check launcher.client.retry.exhausted, review Kubernetes cluster health |
The error_type field on log events (e.g., job.completed, queue.item.failed) contains the Go error type name (e.g., *errors.errorString), which differs from the span attribute values listed above.
For recommended alert thresholds on job failures, queue health, and OOM detection, see the alerting recommendations.