Task & Queue Observatory
ERPNext fails quietly in the background - scheduler tasks, RQ queues, workers, and integrations. The Task & Queue Observatory makes that invisible work measurable and controllable: you can see failures, diagnose them, and take safe actions with evidence.
Built for operators. Designed for failure modes: retries, poisoned jobs, worker death, and silent drift.
ERPNext background work is critical - and usually invisible
When background jobs fail quietly, business operations degrade without a clear signal. The UI stays usable while the system accumulates damage.
Emails, notifications, integrations, and scheduled reconciliations fail in queues. Users only notice when it's already expensive.
- Failures are buried in logs
- No stable grouping by cause
- No evidence of what ran vs what didn’t
Operators resort to SSH, redis-cli, and restarts - risky actions that can create duplicates or drift under retries.
- Blind retries cause duplication
- Manual interventions are not auditable
- Hard to isolate poisoned job classes
Backlogs, worker starvation, and stuck jobs degrade the whole system. Without queue metrics, teams guess and overreact.
- Oldest-job age rises silently
- Throughput collapses without alarms
- One poisoned job blocks progress
Instrument, cluster, and control - with guardrails
The Observatory reads queues and job execution state, creates stable failure clusters, and provides safe actions tied to evidence and audit.
See queue depth, throughput, oldest job age, and worker health per queue - with baselines and time-window analysis.
- Queue-level metrics: depth, oldest-job age, throughput, failure rate
- Worker-level metrics: last-seen, busy/idle mix, crash indicators
- Baselines and anomaly detection (not static thresholds only)
- Time windows: 5m / 1h / 24h with p95/p99 rollups
Not all failures are equal. We group failures by method, error signature, topic, and recurrence so the hotspot is obvious.
- Group by: method, doctype event, webhook topic, integration channel
- Signature hashing: stable clusters across deploys and retries
- Recurring vs new failures (regression detection)
- Top offenders: failure count, retry churn, capacity impact
Retrying blindly is dangerous. Actions must be controlled, auditable, and safe under failure.
- Retry/requeue/cancel with role-based access
- Bulk actions only on verified-safe job classes
- Automatic quarantine for poisoned jobs (optional)
- Audit trail: who acted, what changed, why, and evidence after
If a job ran, failed, retried, or was manually requeued - the system records it and you can prove it later.
- Execution timeline per job: queued → started → finished/failed
- Error capture with stack traces + environment tags
- Link jobs to business artifacts (DocType, integration entity, customer)
- Exportable evidence for postmortems and compliance
Alerts without action lead to fatigue. Alerts must point to what to do next.
- Alert on backlog growth, oldest-job age, failure bursts, worker death
- Route alerts by queue and owning team
- Links to runbooks (Atlas) and remediation actions
- Suppress repeated noise with cool-down rules
Operational metrics the platform tracks
These are the signals operators need to detect drift early, diagnose issues quickly, and prevent recurring incidents.
Number of enqueued jobs waiting to be processed, split by queue name (short/default/long/custom).
A rising depth is a backlog. Backlogs turn into user-visible delays (emails, integrations, scheduler tasks) even when the UI appears “fine”.
Alert if depth grows > 3× baseline for 10 minutes, or if depth is monotonically increasing.
Time since the oldest job in a queue was enqueued (p95/p99 is ideal).
Depth can look normal while jobs silently stall. Oldest-job age reveals starvation, dead workers, or a blocked queue.
Alert if oldest job age > 5–10 minutes in short queue; > 30 minutes in long queue (tune per environment).
Jobs completed per minute, by queue and by method.
A healthy system drains work predictably. Throughput collapsing with stable depth often indicates worker death or external dependency failures.
Failed jobs ÷ total jobs, by method/topic/time window (e.g., 5m/1h/24h).
Small failure rates can hide recurring breakage (emails, webhooks, scheduled reconciliations). Grouping by method reveals hotspots.
Alert if any method exceeds 1–2% failures in 1 hour or shows repeating failure signatures.
Jobs that repeatedly fail and requeue; measures retry loops and wasted capacity.
Retry loops can DDoS your own workers, delay legitimate jobs, and create duplicate side effects if idempotency is missing.
Active workers by queue, worker heartbeats, last-seen timestamps, and busy/idle distribution.
Queue problems are often worker problems (OOM kills, memory leaks, stuck threads). Health signals let operators act before backlog becomes visible.
Jobs exceeding expected runtime, no progress markers, or locked resources; includes “started but not finished” classification.
Stuck jobs are the fastest path to systemic degradation. One poisoned job can block a whole queue depending on worker config.
Verification that scheduled tasks ran (and how long they took), not just that they were configured.
ERPNext “scheduler is enabled” is not proof. Operators need evidence of execution, failures, and drift over time.
What breaks in production - and how we handle it
These are common, repeatable failure patterns in ERPNext production. The Observatory is built to surface them fast and make remediation safe.
Silent email failure
Symptom: Users stop receiving emails; the UI still looks normal.
Root cause: Email jobs failing in background due to SMTP auth, rate limits, DNS issues, or invalid payloads.
- Failure clusters for email sending methods (burst patterns)
- Queue backlog + oldest job age rising in short/default queue
- Failure signature grouping (auth errors vs payload errors)
- Expose exact failing method + error signature and affected docs
- Enable safe retry only after dependency health checks pass
- Add rate-limit aware retry backoff + idempotent send markers
Webhook and integration drift
Symptom: Partial syncs: some records update, others never do.
Root cause: Jobs fail mid-flight; retries create duplicates; missing idempotency causes divergence.
- High retry churn on specific integration topics/methods
- Mismatch between inbound events and processed jobs
- Recurring failure signatures after deploys or config changes
- Enforce idempotency keys and replay-safe job handlers
- Quarantine poisoned messages and surface them for review
- Add deterministic replay controls with audit trails
Worker starvation / dead workers
Symptom: System feels slow; background work delays grow over hours.
Root cause: Workers crash (OOM), hang on long calls, or get starved by a poisoned job class.
- Worker last-seen timestamps and sudden throughput collapse
- Oldest-job age rising while depth appears stable
- Stuck-job classification and runtime outliers
- Restart/scale workers with evidence-based triggers
- Separate queues by job class and expected runtime
- Add timeouts and circuit breakers for external calls
Designed for correctness under failure
Operators need safety: retries must not create duplicates, actions must be auditable, and the system must be measurable over time.
Retries are expected. Job handlers must be deterministic and safe under repeats. The platform enforces patterns that prevent duplicates.
- Idempotency keys per business event
- Replay-safe handlers for integrations
- Guardrails on bulk retries
Every action creates evidence: job state changes, retries, cancellations, and outcomes. Postmortems become factual.
- Execution timeline per job
- Error signature clustering
- Operator actions are logged
Queues are split by runtime and blast radius. Long-running jobs and poisoned classes stop taking down everything else.
- Queue segmentation strategy
- Timeouts + circuit breakers
- Isolation for risky integrations
Actions with guardrails, not hero work
The most dangerous moment in an incident is the “just retry it” phase. The Observatory provides controlled actions that reduce risk and preserve auditability.
Want queues you can trust under pressure?
We’ll review your queue topology, failure hotspots, retry behavior, and worker health - then propose a practical plan to make background operations predictable and safe.