Content operations guide

This guide provides metrics, query patterns, and troubleshooting workflows for monitoring content execution and job queue health using OpenTelemetry signals.

Overview

This guide addresses operational questions focused on content execution:

Which content is using the most resources? - Use queue metrics to identify resource pressure and correlate with external host metrics
How do I route content failures to owners? - Use owner context in job failure signals for alerting
Why did this content fail? - Investigate job failures using metrics, logs, and traces

These questions help operations teams monitor job queues, detect stuck jobs, investigate content failures, and route alerts to content owners.

Which content is using the most resources?

Use job metrics and queue metrics to identify resource pressure, then correlate with external host monitoring tools for detailed resource analysis.

Job pressure

Job completion and duration metrics reveal which content is consuming the most execution time and how overall job throughput is trending.

Primary Metrics:

job.completion (Counter) - Job completion count by status
job.duration (Histogram, seconds) - Job execution duration by status

Dimensions:

job.status: Status (success or failure)
job.exit_code: Process exit code (when available)

Query Pattern:

# Overall job throughput
sum:job.completion{*}.as_rate()

# Long-running jobs by duration percentile
sum:job.duration.bucket{*} by {upper_bound}.as_count()

# Job failure ratio
sum:job.completion{job.status:failure}.as_rate() / sum:job.completion{*}.as_rate() * 100

# Overall job throughput
rate(job_completion_total[5m])

# Long-running jobs by duration percentile
histogram_quantile(0.95, rate(job_duration_seconds_bucket[5m]))

# Job failure ratio
rate(job_completion_total{job_status="failure"}[5m]) / rate(job_completion_total[5m]) * 100

# Overall job throughput
rate(job.completion) over 5m

# Long-running jobs by duration percentile
histogram_quantile(0.95, job.duration)

# Job failure ratio
rate(job.completion WHERE job.status == "failure") / rate(job.completion) over 5m

Interpretation:

High job.duration P95 values indicate resource-intensive content
Rising job throughput with stable queue depth means the system is keeping up
Rising job throughput with growing queue depth means capacity is insufficient

Queue health

If job duration is high or throughput is dropping, check whether the queue is backing up. See Is the job queue backing up? for queue size, age monitoring, and drain time estimation.

Running processes and applications

These gauges show what’s currently consuming resources on the server.

Primary Metrics:

process.count (UpDownCounter) - Current running process count
application.count (Gauge) - Current running application count

Dimensions:

process.tag: Process type (e.g., run_shiny_app, run_dash_app)
application.type: Application type

Query Pattern:

# Running processes by type
sum:otel.process.count{*} by {process.tag}

# Running applications by type
sum:application.count{*} by {application.type}

# Running processes by type
sum by (process_tag) (process_count)

# Running applications by type
sum by (application_type) (application_count)

# Running processes by type
process.count GROUP BY process.tag

# Running applications by type
application.count GROUP BY application.type

Interpretation:

High process.count for a specific type indicates that content type dominates resource usage
Sudden drops in application.count may indicate crash loops
Unexpected increases in process.count may indicate processes not terminating properly

Host metrics

Connect does not emit host-level metrics (CPU, memory, disk) via OpenTelemetry. Use your platform’s existing infrastructure monitoring to correlate with Connect’s job and queue metrics. See the health and performance guide for detailed guidance on host metric monitoring, including NFS/EFS bottleneck diagnosis.

Correlation approach:

Identify time ranges with high job.duration or growing queue.items.size
Cross-reference with host CPU, memory, and disk I/O metrics from your infrastructure monitoring (e.g., node_exporter, CloudWatch, Azure Monitor)
Use job.hostname from queue metrics to pinpoint which host is under pressure in multi-node deployments

Kubernetes execution

When using off-host execution (Kubernetes launcher), per-process metrics are not available from Posit Connect. Instead:

Use native Kubernetes metrics (metrics-server, kube-state-metrics)
Correlate using launcher.job.id from Connect traces/logs
Configure pod labels to include content GUID for easier correlation (requires custom job template configuration)

How do I route content failures to owners?

Connect emits content_owner_guid on job failure events. Use this to route alerts to content owners via your observability platform.

Owner context in failure events

Log event: job.completed (with job_status == "failure")

Key fields:

Field	Description
`content_guid`	Content identifier
`content_owner_guid`	Owner’s user GUID (use with Connect API to resolve email)
`job_key`	Unique job tracking key
`job_tag`	Job type tag
`job_status`	`failure`
`error_type`	Error classification
`error_message`	Error details

Query Pattern:

event == "job.completed" AND job_status == "failure" | fields content_guid, content_owner_guid, error_type

Alert routing options

Option A: External alert router

Configure alerts on job.completed events where job_status == "failure" in your observability platform
Alert payload includes content_owner_guid
Router service calls Connect API to resolve owner:
- GET /v1/users/{owner_guid} returns email and other user details
Route to owner’s notification channel (email, Slack, PagerDuty)

Option B: Team-based routing

Route by job tag rather than individual owner:

# Route Shiny failures to Shiny team
event == "job.completed" AND job_status == "failure" AND job_tag == "shiny" -> #shiny-team

# Route report failures to analytics team
event == "job.completed" AND job_status == "failure" AND job_tag IN ("rmd", "quarto", "jupyter") -> #analytics-team

Why did this content fail?

Use the combination of completion metrics, structured log events, and traces to investigate content failures.

Step 1: Detect failures

Primary Metrics:

job.completion (Counter) - Job completion count by status
job.duration (Histogram) - Job execution duration by status
queue.items.completed (Counter) - Queue item completion count by status

Dimensions:

job.status: Status (success or failure)
job.exit_code: Process exit code (when available)
queue.name: Queue identifier (for queue items)

Query Pattern:

# Job failure rate over 5 minutes
sum:job.completion{job.status:failure}.as_rate()

# Queue item failure rate
sum:queue.items.completed{job.status:failure} by {queue.name}.as_rate()

# Failures by exit code (identify specific error patterns)
sum:job.completion{job.status:failure} by {job.exit_code}.as_count()

# Job failure rate over 5 minutes
rate(job_completion_total{job_status="failure"}[5m])

# Queue item failure rate
sum by (queue_name) (rate(queue_items_completed_total{job_status="failure"}[5m]))

# Failures by exit code (identify specific error patterns)
sum by (job_exit_code) (job_completion_total{job_status="failure"})

# Job failure rate over 5 minutes
rate(job.completion WHERE job.status == "failure") over 5m

# Queue item failure rate
rate(queue.items.completed WHERE job.status == "failure") over 5m

# Failures by exit code (identify specific error patterns)
job.completion WHERE job.status == "failure" GROUP BY job.exit_code

Interpretation:

Rising failure rate indicates systematic issues
Sudden spikes may indicate infrastructure problems or content issues
Exit code patterns can reveal specific failure modes (e.g., exit code 137 = out of memory (OOM) killed)

Step 2: Find failure in logs

Two log events capture failures at different levels:

Job-level event: job.completed — emitted by job runners with content and owner context.

Field	Description
`event`	Event name (`job.completed`)
`content_guid`	Content identifier
`content_owner_guid`	Owner’s user GUID
`job_key`	Unique job tracking key
`job_tag`	Job type tag
`job_status`	`success` or `failure`
`job_duration_ms`	Execution duration in milliseconds
`job_exit_code`	Process exit code (when available)
`error_type`	Error classification (failures only)
`error_message`	Error details (failures only)

Queue-level event: queue.item.completed — emitted by the queue consumer with queue context.

Field	Description
`event`	Event name (`queue.item.completed`)
`queue_name`	Queue where the system processed the item
`item_id`	Queue item database ID
`item_type`	Type of queued work
`duration_seconds`	Processing duration

For queue-level failures, the system also emits a separate queue.item.failed event with error_type and error_message fields.

Query Pattern:

event == "job.completed" AND job_status == "failure"

Related log events:

job.started - When job execution began (fields: event, content_guid, job_key, job_tag)
queue.item.started - When queue processing began (fields: event, queue_name, item_id, item_type)
queue.item.failed - When queue processing failed (fields: event, queue_name, item_id, item_type, error_type, error_message)

Step 3: View execution trace

Search your Application Performance Monitoring (APM) platform using trace_id from the failure log, or search by job.key to find a specific job’s trace.

Finding a specific job trace

By trace_id: Use the trace_id from log events to find the complete execution trace.

By job.key: Search spans for the job.key attribute to find all spans related to a specific job:

job.key == "abc123-def456"

By content.guid: Find all job executions for specific content:

content.guid == "12345678-1234-1234-1234-123456789012"

Trace context propagation

Trace context flows through async boundaries via queue metadata. When the system queues a job, it serializes the trace context into the queue item metadata. The consumer then extracts this context when processing the item. This maintains a complete trace across:

Schedule runner (producer) → Queue → Queue consumer
Queue consumer → Worker lifecycle → Process execution

Step 4: Check queue health

Query Pattern:

# Was there queue backlog at failure time?
avg:queue.items.age{queue.name:default}

# Is queue growing?
avg:queue.items.size{queue.name:default}

# Is a job stuck on this host?
avg:queue.items.active_duration{queue.name:default}

# Was there queue backlog at failure time?
queue_items_age_seconds{queue_name="default"}

# Is queue growing?
increase(queue_items_size{queue_name="default"}[1h])

# Is a job stuck on this host?
queue_items_active_duration_seconds{queue_name="default"}

# Was there queue backlog at failure time?
queue.items.age WHERE queue.name == "default" > 300

# Is queue growing?
increase(queue.items.size WHERE queue.name == "default") over 1h

# Is a job stuck on this host?
queue.items.active_duration WHERE queue.name == "default" > 300

Step 5: Check launcher health (off-host execution)

If using Kubernetes launcher, check for communication failures:

Primary Metric: launcher.client.retry.exhausted (Counter)

per_second(sum:connect.launcher_client_retry_exhausted.count{*})

rate(launcher_client_retry_exhausted_total[5m])

rate(launcher.client.retry.exhausted) over 5m

Any non-zero rate indicates the launcher is unreachable after all retries. Common causes include launcher downtime, network issues, or resource exhaustion on the Kubernetes cluster.

Exit code reference

Exit code	Meaning
0	Success
1	General error
13	R environment needs rebuilding
14	R version changed since last run
15	Python environment needs rebuilding
16	Python version changed since last run
130	Terminated by SIGINT (normal for interactive content)
137	Killed by SIGKILL (OOM killed, or normal stop for interactive content)
139	Segmentation fault (SIGSEGV)
143	Terminated by SIGTERM (normal stop for interactive content)
256	Generic server error
257	Job interrupted by server restart
258	Job cancelled (timeout or content unavailable)

Note

Exit codes 130, 137, and 143 are expected for interactive content (Shiny apps, Dash apps, etc.) because the system uses signals to stop these long-running processes. These codes only indicate a problem for rendering jobs (scheduled reports, Quarto documents, etc.).

See Troubleshooting content for additional exit code context from the user perspective.

Span error types

When a job fails, the error.type attribute on spans indicates where in the execution pipeline the failure occurred. These are the actual values emitted by Connect:

error.type	Description	Investigation
`connection_timeout`	Worker failed to accept connections within the timeout	Check `worker.connection.await` span duration, review content startup time
`worker_exited`	Worker process exited before accepting connections	Check `job.exit_code` on the span, review content logs
`worker_address_resolution_failed`	Could not resolve worker address	Check network configuration and hostname resolution
`connection_failed`	Could not establish connection to worker	Check worker host/port, review network issues
`launcher_error`	Off-host execution error from launcher	Check `launcher.client.retry.exhausted`, review Kubernetes cluster health

Note

The error_type field on log events (e.g., job.completed, queue.item.failed) contains the Go error type name (e.g., *errors.errorString), which differs from the span attribute values listed above.

For recommended alert thresholds on job failures, queue health, and OOM detection, see the alerting recommendations.