Content operations guide

This guide provides metrics, query patterns, and troubleshooting workflows for monitoring content execution and job queue health using OpenTelemetry signals.

Overview

This guide addresses operational questions focused on content execution:

  • Which content is using the most resources? - Use queue metrics to identify resource pressure and correlate with external host metrics
  • How do I route content failures to owners? - Use owner context in job failure signals for alerting
  • Why did this content fail? - Investigate job failures using metrics, logs, and traces

These questions help operations teams monitor job queues, detect stuck jobs, investigate content failures, and route alerts to content owners.

Which content is using the most resources?

Use job metrics and queue metrics to identify resource pressure, then correlate with external host monitoring tools for detailed resource analysis.

Job pressure

Job completion and duration metrics reveal which content is consuming the most execution time and how overall job throughput is trending.

Primary Metrics:

  • job.completion (Counter) - Job completion count by status
  • job.duration (Histogram, seconds) - Job execution duration by status

Dimensions:

  • job.status: Status (success or failure)
  • job.exit_code: Process exit code (when available)

Query Pattern:

# Overall job throughput
sum:job.completion{*}.as_rate()

# Long-running jobs by duration percentile
sum:job.duration.bucket{*} by {upper_bound}.as_count()

# Job failure ratio
sum:job.completion{job.status:failure}.as_rate() / sum:job.completion{*}.as_rate() * 100
# Overall job throughput
rate(job_completion_total[5m])

# Long-running jobs by duration percentile
histogram_quantile(0.95, rate(job_duration_seconds_bucket[5m]))

# Job failure ratio
rate(job_completion_total{job_status="failure"}[5m]) / rate(job_completion_total[5m]) * 100
# Overall job throughput
rate(job.completion) over 5m

# Long-running jobs by duration percentile
histogram_quantile(0.95, job.duration)

# Job failure ratio
rate(job.completion WHERE job.status == "failure") / rate(job.completion) over 5m

Interpretation:

  • High job.duration P95 values indicate resource-intensive content
  • Rising job throughput with stable queue depth means the system is keeping up
  • Rising job throughput with growing queue depth means capacity is insufficient

Queue health

If job duration is high or throughput is dropping, check whether the queue is backing up. See Is the job queue backing up? for queue size, age monitoring, and drain time estimation.

Running processes and applications

These gauges show what’s currently consuming resources on the server.

Primary Metrics:

  • process.count (UpDownCounter) - Current running process count
  • application.count (Gauge) - Current running application count

Dimensions:

  • process.tag: Process type (e.g., run_shiny_app, run_dash_app)
  • application.type: Application type

Query Pattern:

# Running processes by type
sum:otel.process.count{*} by {process.tag}

# Running applications by type
sum:application.count{*} by {application.type}
# Running processes by type
sum by (process_tag) (process_count)

# Running applications by type
sum by (application_type) (application_count)
# Running processes by type
process.count GROUP BY process.tag

# Running applications by type
application.count GROUP BY application.type

Interpretation:

  • High process.count for a specific type indicates that content type dominates resource usage
  • Sudden drops in application.count may indicate crash loops
  • Unexpected increases in process.count may indicate processes not terminating properly

Host metrics

Connect does not emit host-level metrics (CPU, memory, disk) via OpenTelemetry. Use your platform’s existing infrastructure monitoring to correlate with Connect’s job and queue metrics. See the health and performance guide for detailed guidance on host metric monitoring, including NFS/EFS bottleneck diagnosis.

Correlation approach:

  1. Identify time ranges with high job.duration or growing queue.items.size
  2. Cross-reference with host CPU, memory, and disk I/O metrics from your infrastructure monitoring (e.g., node_exporter, CloudWatch, Azure Monitor)
  3. Use job.hostname from queue metrics to pinpoint which host is under pressure in multi-node deployments

Kubernetes execution

When using off-host execution (Kubernetes launcher), per-process metrics are not available from Posit Connect. Instead:

  1. Use native Kubernetes metrics (metrics-server, kube-state-metrics)
  2. Correlate using launcher.job.id from Connect traces/logs
  3. Configure pod labels to include content GUID for easier correlation (requires custom job template configuration)

How do I route content failures to owners?

Connect emits content_owner_guid on job failure events. Use this to route alerts to content owners via your observability platform.

Owner context in failure events

Log event: job.completed (with job_status == "failure")

Key fields:

Field Description
content_guid Content identifier
content_owner_guid Owner’s user GUID (use with Connect API to resolve email)
job_key Unique job tracking key
job_tag Job type tag
job_status failure
error_type Error classification
error_message Error details

Query Pattern:

event == "job.completed" AND job_status == "failure" | fields content_guid, content_owner_guid, error_type

Alert routing options

Option A: External alert router

  1. Configure alerts on job.completed events where job_status == "failure" in your observability platform
  2. Alert payload includes content_owner_guid
  3. Router service calls Connect API to resolve owner:
    • GET /v1/users/{owner_guid} returns email and other user details
  4. Route to owner’s notification channel (email, Slack, PagerDuty)

Option B: Team-based routing

Route by job tag rather than individual owner:

# Route Shiny failures to Shiny team
event == "job.completed" AND job_status == "failure" AND job_tag == "shiny" -> #shiny-team

# Route report failures to analytics team
event == "job.completed" AND job_status == "failure" AND job_tag IN ("rmd", "quarto", "jupyter") -> #analytics-team

Why did this content fail?

Use the combination of completion metrics, structured log events, and traces to investigate content failures.

Step 1: Detect failures

Primary Metrics:

  • job.completion (Counter) - Job completion count by status
  • job.duration (Histogram) - Job execution duration by status
  • queue.items.completed (Counter) - Queue item completion count by status

Dimensions:

  • job.status: Status (success or failure)
  • job.exit_code: Process exit code (when available)
  • queue.name: Queue identifier (for queue items)

Query Pattern:

# Job failure rate over 5 minutes
sum:job.completion{job.status:failure}.as_rate()

# Queue item failure rate
sum:queue.items.completed{job.status:failure} by {queue.name}.as_rate()

# Failures by exit code (identify specific error patterns)
sum:job.completion{job.status:failure} by {job.exit_code}.as_count()
# Job failure rate over 5 minutes
rate(job_completion_total{job_status="failure"}[5m])

# Queue item failure rate
sum by (queue_name) (rate(queue_items_completed_total{job_status="failure"}[5m]))

# Failures by exit code (identify specific error patterns)
sum by (job_exit_code) (job_completion_total{job_status="failure"})
# Job failure rate over 5 minutes
rate(job.completion WHERE job.status == "failure") over 5m

# Queue item failure rate
rate(queue.items.completed WHERE job.status == "failure") over 5m

# Failures by exit code (identify specific error patterns)
job.completion WHERE job.status == "failure" GROUP BY job.exit_code

Interpretation:

  • Rising failure rate indicates systematic issues
  • Sudden spikes may indicate infrastructure problems or content issues
  • Exit code patterns can reveal specific failure modes (e.g., exit code 137 = out of memory (OOM) killed)

Step 2: Find failure in logs

Two log events capture failures at different levels:

Job-level event: job.completed — emitted by job runners with content and owner context.

Field Description
event Event name (job.completed)
content_guid Content identifier
content_owner_guid Owner’s user GUID
job_key Unique job tracking key
job_tag Job type tag
job_status success or failure
job_duration_ms Execution duration in milliseconds
job_exit_code Process exit code (when available)
error_type Error classification (failures only)
error_message Error details (failures only)

Queue-level event: queue.item.completed — emitted by the queue consumer with queue context.

Field Description
event Event name (queue.item.completed)
queue_name Queue where the system processed the item
item_id Queue item database ID
item_type Type of queued work
duration_seconds Processing duration

For queue-level failures, the system also emits a separate queue.item.failed event with error_type and error_message fields.

Query Pattern:

event == "job.completed" AND job_status == "failure"

Related log events:

  • job.started - When job execution began (fields: event, content_guid, job_key, job_tag)
  • queue.item.started - When queue processing began (fields: event, queue_name, item_id, item_type)
  • queue.item.failed - When queue processing failed (fields: event, queue_name, item_id, item_type, error_type, error_message)

Step 3: View execution trace

Search your Application Performance Monitoring (APM) platform using trace_id from the failure log, or search by job.key to find a specific job’s trace.

Finding a specific job trace

By trace_id: Use the trace_id from log events to find the complete execution trace.

By job.key: Search spans for the job.key attribute to find all spans related to a specific job:

job.key == "abc123-def456"

By content.guid: Find all job executions for specific content:

content.guid == "12345678-1234-1234-1234-123456789012"

Trace context propagation

Trace context flows through async boundaries via queue metadata. When the system queues a job, it serializes the trace context into the queue item metadata. The consumer then extracts this context when processing the item. This maintains a complete trace across:

  1. Schedule runner (producer) → Queue → Queue consumer
  2. Queue consumer → Worker lifecycle → Process execution

Step 4: Check queue health

Query Pattern:

# Was there queue backlog at failure time?
avg:queue.items.age{queue.name:default}

# Is queue growing?
avg:queue.items.size{queue.name:default}

# Is a job stuck on this host?
avg:queue.items.active_duration{queue.name:default}
# Was there queue backlog at failure time?
queue_items_age_seconds{queue_name="default"}

# Is queue growing?
increase(queue_items_size{queue_name="default"}[1h])

# Is a job stuck on this host?
queue_items_active_duration_seconds{queue_name="default"}
# Was there queue backlog at failure time?
queue.items.age WHERE queue.name == "default" > 300

# Is queue growing?
increase(queue.items.size WHERE queue.name == "default") over 1h

# Is a job stuck on this host?
queue.items.active_duration WHERE queue.name == "default" > 300

Step 5: Check launcher health (off-host execution)

If using Kubernetes launcher, check for communication failures:

Primary Metric: launcher.client.retry.exhausted (Counter)

per_second(sum:connect.launcher_client_retry_exhausted.count{*})
rate(launcher_client_retry_exhausted_total[5m])
rate(launcher.client.retry.exhausted) over 5m

Any non-zero rate indicates the launcher is unreachable after all retries. Common causes include launcher downtime, network issues, or resource exhaustion on the Kubernetes cluster.

Exit code reference

Exit code Meaning
0 Success
1 General error
13 R environment needs rebuilding
14 R version changed since last run
15 Python environment needs rebuilding
16 Python version changed since last run
130 Terminated by SIGINT (normal for interactive content)
137 Killed by SIGKILL (OOM killed, or normal stop for interactive content)
139 Segmentation fault (SIGSEGV)
143 Terminated by SIGTERM (normal stop for interactive content)
256 Generic server error
257 Job interrupted by server restart
258 Job cancelled (timeout or content unavailable)
Note

Exit codes 130, 137, and 143 are expected for interactive content (Shiny apps, Dash apps, etc.) because the system uses signals to stop these long-running processes. These codes only indicate a problem for rendering jobs (scheduled reports, Quarto documents, etc.).

See Troubleshooting content for additional exit code context from the user perspective.

Span error types

When a job fails, the error.type attribute on spans indicates where in the execution pipeline the failure occurred. These are the actual values emitted by Connect:

error.type Description Investigation
connection_timeout Worker failed to accept connections within the timeout Check worker.connection.await span duration, review content startup time
worker_exited Worker process exited before accepting connections Check job.exit_code on the span, review content logs
worker_address_resolution_failed Could not resolve worker address Check network configuration and hostname resolution
connection_failed Could not establish connection to worker Check worker host/port, review network issues
launcher_error Off-host execution error from launcher Check launcher.client.retry.exhausted, review Kubernetes cluster health
Note

The error_type field on log events (e.g., job.completed, queue.item.failed) contains the Go error type name (e.g., *errors.errorString), which differs from the span attribute values listed above.

For recommended alert thresholds on job failures, queue health, and OOM detection, see the alerting recommendations.