Task & Queue Observatory

ERPNext fails quietly in the background - scheduler tasks, RQ queues, workers, and integrations. The Task & Queue Observatory makes that invisible work measurable and controllable: you can see failures, diagnose them, and take safe actions with evidence.

Request a technical walkthrough Back to platform →

Built for operators. Designed for failure modes: retries, poisoned jobs, worker death, and silent drift.

Core promise

Visibility + control

Know what ran, what failed, and why - then act safely.

Primary surface

Queues + scheduler

RQ + scheduled tasks, with evidence trails and guardrails.

Operator outcome

Predictability

Less guessing. Faster diagnosis. Fewer repeated incidents.

Problem

ERPNext background work is critical - and usually invisible

When background jobs fail quietly, business operations degrade without a clear signal. The UI stays usable while the system accumulates damage.

Silent failure

Emails, notifications, integrations, and scheduled reconciliations fail in queues. Users only notice when it's already expensive.

Failures are buried in logs
No stable grouping by cause
No evidence of what ran vs what didn’t

No operational control

Operators resort to SSH, redis-cli, and restarts - risky actions that can create duplicates or drift under retries.

Blind retries cause duplication
Manual interventions are not auditable
Hard to isolate poisoned job classes

Queues become a single point of failure

Backlogs, worker starvation, and stuck jobs degrade the whole system. Without queue metrics, teams guess and overreact.

Oldest-job age rises silently
Throughput collapses without alarms
One poisoned job blocks progress

How it solves it

Instrument, cluster, and control - with guardrails

The Observatory reads queues and job execution state, creates stable failure clusters, and provides safe actions tied to evidence and audit.

Operator-grade queue visibility

See queue depth, throughput, oldest job age, and worker health per queue - with baselines and time-window analysis.

Included

Queue-level metrics: depth, oldest-job age, throughput, failure rate
Worker-level metrics: last-seen, busy/idle mix, crash indicators
Baselines and anomaly detection (not static thresholds only)
Time windows: 5m / 1h / 24h with p95/p99 rollups

Failure clustering by method and signature

Not all failures are equal. We group failures by method, error signature, topic, and recurrence so the hotspot is obvious.

Included

Group by: method, doctype event, webhook topic, integration channel
Signature hashing: stable clusters across deploys and retries
Recurring vs new failures (regression detection)
Top offenders: failure count, retry churn, capacity impact

Safe actions with guardrails

Retrying blindly is dangerous. Actions must be controlled, auditable, and safe under failure.

Included

Retry/requeue/cancel with role-based access
Bulk actions only on verified-safe job classes
Automatic quarantine for poisoned jobs (optional)
Audit trail: who acted, what changed, why, and evidence after

Evidence trails for operational truth

If a job ran, failed, retried, or was manually requeued - the system records it and you can prove it later.

Included

Execution timeline per job: queued → started → finished/failed
Error capture with stack traces + environment tags
Link jobs to business artifacts (DocType, integration entity, customer)
Exportable evidence for postmortems and compliance

Alerting tied to playbooks

Alerts without action lead to fatigue. Alerts must point to what to do next.

Included

Alert on backlog growth, oldest-job age, failure bursts, worker death
Route alerts by queue and owning team
Links to runbooks (Atlas) and remediation actions
Suppress repeated noise with cool-down rules

Metrics

Operational metrics the platform tracks

These are the signals operators need to detect drift early, diagnose issues quickly, and prevent recurring incidents.

Metric

Queue depth (per queue)

Definition

Number of enqueued jobs waiting to be processed, split by queue name (short/default/long/custom).

Why it matters

A rising depth is a backlog. Backlogs turn into user-visible delays (emails, integrations, scheduler tasks) even when the UI appears “fine”.

Example operator threshold

Alert if depth grows > 3× baseline for 10 minutes, or if depth is monotonically increasing.

Metric

Age of oldest job

Definition

Time since the oldest job in a queue was enqueued (p95/p99 is ideal).

Why it matters

Depth can look normal while jobs silently stall. Oldest-job age reveals starvation, dead workers, or a blocked queue.

Example operator threshold

Alert if oldest job age > 5–10 minutes in short queue; > 30 minutes in long queue (tune per environment).

Metric

Throughput

Definition

Jobs completed per minute, by queue and by method.

Why it matters

A healthy system drains work predictably. Throughput collapsing with stable depth often indicates worker death or external dependency failures.

Metric

Failure rate

Definition

Failed jobs ÷ total jobs, by method/topic/time window (e.g., 5m/1h/24h).

Why it matters

Small failure rates can hide recurring breakage (emails, webhooks, scheduled reconciliations). Grouping by method reveals hotspots.

Example operator threshold

Alert if any method exceeds 1–2% failures in 1 hour or shows repeating failure signatures.

Metric

Retry churn

Definition

Jobs that repeatedly fail and requeue; measures retry loops and wasted capacity.

Why it matters

Retry loops can DDoS your own workers, delay legitimate jobs, and create duplicate side effects if idempotency is missing.

Metric

Worker health

Definition

Active workers by queue, worker heartbeats, last-seen timestamps, and busy/idle distribution.

Why it matters

Queue problems are often worker problems (OOM kills, memory leaks, stuck threads). Health signals let operators act before backlog becomes visible.

Metric

Stuck job count

Definition

Jobs exceeding expected runtime, no progress markers, or locked resources; includes “started but not finished” classification.

Why it matters

Stuck jobs are the fastest path to systemic degradation. One poisoned job can block a whole queue depending on worker config.

Metric

Scheduler execution evidence

Definition

Verification that scheduled tasks ran (and how long they took), not just that they were configured.

Why it matters

ERPNext “scheduler is enabled” is not proof. Operators need evidence of execution, failures, and drift over time.

Failure modes

What breaks in production - and how we handle it

These are common, repeatable failure patterns in ERPNext production. The Observatory is built to surface them fast and make remediation safe.

Failure mode

Silent email failure

Symptom: Users stop receiving emails; the UI still looks normal.

Root cause: Email jobs failing in background due to SMTP auth, rate limits, DNS issues, or invalid payloads.

DetectDiagnoseAct safelyProve outcome

How we detect it

Failure clusters for email sending methods (burst patterns)
Queue backlog + oldest job age rising in short/default queue
Failure signature grouping (auth errors vs payload errors)

How we fix it safely

Expose exact failing method + error signature and affected docs
Enable safe retry only after dependency health checks pass
Add rate-limit aware retry backoff + idempotent send markers

Failure mode

Webhook and integration drift

Symptom: Partial syncs: some records update, others never do.

Root cause: Jobs fail mid-flight; retries create duplicates; missing idempotency causes divergence.

DetectDiagnoseAct safelyProve outcome

How we detect it

High retry churn on specific integration topics/methods
Mismatch between inbound events and processed jobs
Recurring failure signatures after deploys or config changes

How we fix it safely

Enforce idempotency keys and replay-safe job handlers
Quarantine poisoned messages and surface them for review
Add deterministic replay controls with audit trails

Failure mode

Worker starvation / dead workers

Symptom: System feels slow; background work delays grow over hours.

Root cause: Workers crash (OOM), hang on long calls, or get starved by a poisoned job class.

DetectDiagnoseAct safelyProve outcome

How we detect it

Worker last-seen timestamps and sudden throughput collapse
Oldest-job age rising while depth appears stable
Stuck-job classification and runtime outliers

How we fix it safely

Restart/scale workers with evidence-based triggers
Separate queues by job class and expected runtime
Add timeouts and circuit breakers for external calls

Technical design

Designed for correctness under failure

Operators need safety: retries must not create duplicates, actions must be auditable, and the system must be measurable over time.

Idempotency and replay safety

Retries are expected. Job handlers must be deterministic and safe under repeats. The platform enforces patterns that prevent duplicates.

Idempotency keys per business event
Replay-safe handlers for integrations
Guardrails on bulk retries

Evidence trails by default

Every action creates evidence: job state changes, retries, cancellations, and outcomes. Postmortems become factual.

Execution timeline per job
Error signature clustering
Operator actions are logged

Separation of concerns

Queues are split by runtime and blast radius. Long-running jobs and poisoned classes stop taking down everything else.

Queue segmentation strategy
Timeouts + circuit breakers
Isolation for risky integrations

What operators can do safely

Actions with guardrails, not hero work

The most dangerous moment in an incident is the “just retry it” phase. The Observatory provides controlled actions that reduce risk and preserve auditability.

Retry with backoffRequeue to correct queueCancel poisoned jobsQuarantine + reviewBulk actions (restricted)Export evidence

Next step

Want queues you can trust under pressure?

We’ll review your queue topology, failure hotspots, retry behavior, and worker health - then propose a practical plan to make background operations predictable and safe.

Talk to us Back to platform →

RQ / RedisScheduler evidenceFailure clusteringSafe retriesAudit trails

Explore the ERPNext Platform