Task & Queue Observatory

ERPNext fails quietly in the background - scheduler tasks, RQ queues, workers, and integrations. The Task & Queue Observatory makes that invisible work measurable and controllable: you can see failures, diagnose them, and take safe actions with evidence.

Built for operators. Designed for failure modes: retries, poisoned jobs, worker death, and silent drift.

Core promise
Visibility + control
Know what ran, what failed, and why - then act safely.
Primary surface
Queues + scheduler
RQ + scheduled tasks, with evidence trails and guardrails.
Operator outcome
Predictability
Less guessing. Faster diagnosis. Fewer repeated incidents.
Problem

ERPNext background work is critical - and usually invisible

When background jobs fail quietly, business operations degrade without a clear signal. The UI stays usable while the system accumulates damage.

Silent failure

Emails, notifications, integrations, and scheduled reconciliations fail in queues. Users only notice when it's already expensive.

  • Failures are buried in logs
  • No stable grouping by cause
  • No evidence of what ran vs what didn’t
No operational control

Operators resort to SSH, redis-cli, and restarts - risky actions that can create duplicates or drift under retries.

  • Blind retries cause duplication
  • Manual interventions are not auditable
  • Hard to isolate poisoned job classes
Queues become a single point of failure

Backlogs, worker starvation, and stuck jobs degrade the whole system. Without queue metrics, teams guess and overreact.

  • Oldest-job age rises silently
  • Throughput collapses without alarms
  • One poisoned job blocks progress
How it solves it

Instrument, cluster, and control - with guardrails

The Observatory reads queues and job execution state, creates stable failure clusters, and provides safe actions tied to evidence and audit.

Operator-grade queue visibility

See queue depth, throughput, oldest job age, and worker health per queue - with baselines and time-window analysis.

Included
  • Queue-level metrics: depth, oldest-job age, throughput, failure rate
  • Worker-level metrics: last-seen, busy/idle mix, crash indicators
  • Baselines and anomaly detection (not static thresholds only)
  • Time windows: 5m / 1h / 24h with p95/p99 rollups
Failure clustering by method and signature

Not all failures are equal. We group failures by method, error signature, topic, and recurrence so the hotspot is obvious.

Included
  • Group by: method, doctype event, webhook topic, integration channel
  • Signature hashing: stable clusters across deploys and retries
  • Recurring vs new failures (regression detection)
  • Top offenders: failure count, retry churn, capacity impact
Safe actions with guardrails

Retrying blindly is dangerous. Actions must be controlled, auditable, and safe under failure.

Included
  • Retry/requeue/cancel with role-based access
  • Bulk actions only on verified-safe job classes
  • Automatic quarantine for poisoned jobs (optional)
  • Audit trail: who acted, what changed, why, and evidence after
Evidence trails for operational truth

If a job ran, failed, retried, or was manually requeued - the system records it and you can prove it later.

Included
  • Execution timeline per job: queued → started → finished/failed
  • Error capture with stack traces + environment tags
  • Link jobs to business artifacts (DocType, integration entity, customer)
  • Exportable evidence for postmortems and compliance
Alerting tied to playbooks

Alerts without action lead to fatigue. Alerts must point to what to do next.

Included
  • Alert on backlog growth, oldest-job age, failure bursts, worker death
  • Route alerts by queue and owning team
  • Links to runbooks (Atlas) and remediation actions
  • Suppress repeated noise with cool-down rules
Metrics

Operational metrics the platform tracks

These are the signals operators need to detect drift early, diagnose issues quickly, and prevent recurring incidents.

Metric
Queue depth (per queue)
Definition

Number of enqueued jobs waiting to be processed, split by queue name (short/default/long/custom).

Why it matters

A rising depth is a backlog. Backlogs turn into user-visible delays (emails, integrations, scheduler tasks) even when the UI appears “fine”.

Example operator threshold

Alert if depth grows > 3× baseline for 10 minutes, or if depth is monotonically increasing.

Metric
Age of oldest job
Definition

Time since the oldest job in a queue was enqueued (p95/p99 is ideal).

Why it matters

Depth can look normal while jobs silently stall. Oldest-job age reveals starvation, dead workers, or a blocked queue.

Example operator threshold

Alert if oldest job age > 5–10 minutes in short queue; > 30 minutes in long queue (tune per environment).

Metric
Throughput
Definition

Jobs completed per minute, by queue and by method.

Why it matters

A healthy system drains work predictably. Throughput collapsing with stable depth often indicates worker death or external dependency failures.

Metric
Failure rate
Definition

Failed jobs ÷ total jobs, by method/topic/time window (e.g., 5m/1h/24h).

Why it matters

Small failure rates can hide recurring breakage (emails, webhooks, scheduled reconciliations). Grouping by method reveals hotspots.

Example operator threshold

Alert if any method exceeds 1–2% failures in 1 hour or shows repeating failure signatures.

Metric
Retry churn
Definition

Jobs that repeatedly fail and requeue; measures retry loops and wasted capacity.

Why it matters

Retry loops can DDoS your own workers, delay legitimate jobs, and create duplicate side effects if idempotency is missing.

Metric
Worker health
Definition

Active workers by queue, worker heartbeats, last-seen timestamps, and busy/idle distribution.

Why it matters

Queue problems are often worker problems (OOM kills, memory leaks, stuck threads). Health signals let operators act before backlog becomes visible.

Metric
Stuck job count
Definition

Jobs exceeding expected runtime, no progress markers, or locked resources; includes “started but not finished” classification.

Why it matters

Stuck jobs are the fastest path to systemic degradation. One poisoned job can block a whole queue depending on worker config.

Metric
Scheduler execution evidence
Definition

Verification that scheduled tasks ran (and how long they took), not just that they were configured.

Why it matters

ERPNext “scheduler is enabled” is not proof. Operators need evidence of execution, failures, and drift over time.

Failure modes

What breaks in production - and how we handle it

These are common, repeatable failure patterns in ERPNext production. The Observatory is built to surface them fast and make remediation safe.

Failure mode

Silent email failure

Symptom: Users stop receiving emails; the UI still looks normal.

Root cause: Email jobs failing in background due to SMTP auth, rate limits, DNS issues, or invalid payloads.

DetectDiagnoseAct safelyProve outcome
How we detect it
  • Failure clusters for email sending methods (burst patterns)
  • Queue backlog + oldest job age rising in short/default queue
  • Failure signature grouping (auth errors vs payload errors)
How we fix it safely
  • Expose exact failing method + error signature and affected docs
  • Enable safe retry only after dependency health checks pass
  • Add rate-limit aware retry backoff + idempotent send markers
Failure mode

Webhook and integration drift

Symptom: Partial syncs: some records update, others never do.

Root cause: Jobs fail mid-flight; retries create duplicates; missing idempotency causes divergence.

DetectDiagnoseAct safelyProve outcome
How we detect it
  • High retry churn on specific integration topics/methods
  • Mismatch between inbound events and processed jobs
  • Recurring failure signatures after deploys or config changes
How we fix it safely
  • Enforce idempotency keys and replay-safe job handlers
  • Quarantine poisoned messages and surface them for review
  • Add deterministic replay controls with audit trails
Failure mode

Worker starvation / dead workers

Symptom: System feels slow; background work delays grow over hours.

Root cause: Workers crash (OOM), hang on long calls, or get starved by a poisoned job class.

DetectDiagnoseAct safelyProve outcome
How we detect it
  • Worker last-seen timestamps and sudden throughput collapse
  • Oldest-job age rising while depth appears stable
  • Stuck-job classification and runtime outliers
How we fix it safely
  • Restart/scale workers with evidence-based triggers
  • Separate queues by job class and expected runtime
  • Add timeouts and circuit breakers for external calls
Technical design

Designed for correctness under failure

Operators need safety: retries must not create duplicates, actions must be auditable, and the system must be measurable over time.

Idempotency and replay safety

Retries are expected. Job handlers must be deterministic and safe under repeats. The platform enforces patterns that prevent duplicates.

  • Idempotency keys per business event
  • Replay-safe handlers for integrations
  • Guardrails on bulk retries
Evidence trails by default

Every action creates evidence: job state changes, retries, cancellations, and outcomes. Postmortems become factual.

  • Execution timeline per job
  • Error signature clustering
  • Operator actions are logged
Separation of concerns

Queues are split by runtime and blast radius. Long-running jobs and poisoned classes stop taking down everything else.

  • Queue segmentation strategy
  • Timeouts + circuit breakers
  • Isolation for risky integrations
What operators can do safely

Actions with guardrails, not hero work

The most dangerous moment in an incident is the “just retry it” phase. The Observatory provides controlled actions that reduce risk and preserve auditability.

Retry with backoffRequeue to correct queueCancel poisoned jobsQuarantine + reviewBulk actions (restricted)Export evidence
Next step

Want queues you can trust under pressure?

We’ll review your queue topology, failure hotspots, retry behavior, and worker health - then propose a practical plan to make background operations predictable and safe.

RQ / RedisScheduler evidenceFailure clusteringSafe retriesAudit trails