Backup & Restore System

Backups are not safety - restores are. This capability turns backup pipelines into proven recovery readiness: automated backups, retention discipline, restore workflows, and routine verification with evidence.

Request a technical walkthrough Back to platform →

Built for production: evidence, guardrails, and measured recovery objectives - not “we think it’s fine.”

Core promise

Proven recovery

If you can’t restore, you don’t have backups.

Primary surface

Restore points

Browse, validate, restore - with audits.

Operator outcome

Predictability

Measured RPO/RTO, fewer surprises during incidents.

Problem

Most teams have backups - and still fail recovery

The failure is rarely “no backups.” It’s missing verification, weak retention, unclear scope, and no operator control when pressure hits.

Backups without restore proof

Backups run for months - until the first restore attempt fails. Then you learn the pipeline has been rotting silently.

No routine restore tests
Missing attachments/private files
Secrets/keys drift breaks restores

Retention is inconsistent

Policies are informal and enforcement is weak. When incidents are discovered late, the restore points you need are missing.

Weekly/monthly points missing
Storage policies delete silently
No evidence trail during incidents

Operators lack safe controls

Under pressure, teams resort to ad-hoc scripts and SSH. Risk rises: accidental deletion, wrong target restores, and no audit trail.

No guarded restore workflows
Manual steps are error-prone
Hard to prove what happened

How it solves it

Automate, verify, and prove - before incidents

The system enforces backup policy, validates integrity, runs routine restore verification, and provides a controlled operator UI with evidence trails.

Backup automation with policy enforcement

Automate backups for database and files, enforce retention policy, and maintain evidence of execution.

Included

Schedules: daily / hourly / custom windows
DB + files coverage with clear scope and exclusions
Retention rules per environment and per destination
Execution evidence: timestamps, sizes, checksums, logs

Operator UI for browse, download, and restore

Give operators a controlled interface to view restore points, validate them, and perform restores safely.

Included

Browse restore points with metadata and health state
Download artifacts when needed (role-gated)
Restore workflows with confirmation + guardrails
Per-restore audit trail (who/what/when/why)

Restore verification as routine

Automate restore tests in a controlled target to prove that backups are usable and complete.

Included

Automated test restores on a schedule
Verification checks: DB opens, schema sanity, file presence, app boot checks
Measured restore time (RTO) and recovered point (RPO)
Evidence reports exportable for compliance and postmortems

Integrity and corruption detection

Detect partial backups, corruption, and drift early - before an incident forces a restore.

Included

Checksums + expected file count/size checks
DB dump sanity checks + decompression validation
Detect missing attachments / private files coverage gaps
Flag suspicious deltas (size collapse, sudden growth, repeated failures)

Defense against deletion and blast radius

Prevent “oops” and hostile deletion from wiping your restore points.

Included

Offsite copy support + freshness tracking
Optional immutability / write-once retention patterns (provider-dependent)
Least-privilege credentials and scoped access
Separation of duties: operators can restore, not delete (policy-driven)

Metrics

Recovery signals operators actually need

These metrics detect drift early, prove readiness, and turn recovery objectives into measured numbers.

Metric

Backup success rate

Definition

Successful backups ÷ attempted backups, tracked per schedule and destination (local + offsite).

Why it matters

A schedule is not proof. Success rate tells you whether backups are actually happening reliably, not just configured.

Example operator threshold

Alert if success rate drops below 99% over 7 days, or if any critical schedule fails twice consecutively.

Metric

Time since last successful backup

Definition

Elapsed time since the most recent successful backup for each site (database + files).

Why it matters

RPO silently grows when backups fail. Operators need a simple “how stale is our safety net?” signal.

Example operator threshold

Alert if > 24h for daily backups, > 2h for hourly backups (tune by business criticality).

Metric

Retention coverage

Definition

How many restore points exist across retention windows (daily/weekly/monthly) and where they live.

Why it matters

Most incidents are discovered late. Without retention discipline, the only backups you have are too recent (already corrupted) or already deleted.

Example operator threshold

Alert if weekly/monthly restore points fall below policy (e.g., < 4 weekly or < 6 monthly).

Metric

Restore verification pass rate

Definition

Percentage of automated restore tests that complete successfully (and meet verification checks).

Why it matters

Backups are not safety. Restore success is. Verification is how you prove recovery readiness before incidents.

Example operator threshold

Alert if pass rate < 95% over 30 days or if any critical restore verification fails.

Metric

Time since last verified restore

Definition

Elapsed time since the last successful restore verification for each environment.

Why it matters

A backup pipeline can slowly rot (permissions, encryption keys, file paths, storage policies). Regular restores catch drift early.

Example operator threshold

Alert if > 7 days for production, > 30 days for lower environments (tune per policy).

Metric

Backup integrity / checksum validation

Definition

Integrity checks such as checksums, expected file counts/sizes, and DB dump sanity verification.

Why it matters

Corrupt backups are common: partial uploads, interrupted dumps, disk errors. Integrity checks detect corruption before you need the backup.

Metric

Recovery objectives (RPO/RTO) evidence

Definition

Measured RPO (data loss window) and RTO (restore time) based on real restore verification runs.

Why it matters

Stated RPO/RTO without evidence is hope. Operators and management need measured numbers to plan and to trust the system.

Metric

Offsite replication freshness

Definition

Lag between latest local backup and offsite copy availability (object storage / remote destination).

Why it matters

Offsite is your last line of defense against host failure, deletion, or ransomware. Replication lag is how you know it’s actually current.

Example operator threshold

Alert if offsite lag exceeds 2× the backup interval or if offsite copy is missing for a recent restore point.

Failure modes

What breaks in recovery - and how we prevent it

These failure patterns are common in ERPNext production. The system is designed to surface them early and make remediation safe.

Failure mode

Backups exist, but restores fail

Symptom: During an incident, restore attempts fail or the environment won’t boot.

Root cause: Backups were never verified; missing files, wrong paths, encryption key drift, or DB dumps that don’t replay cleanly.

How we detect it

Restore verification failures with exact failing step recorded
Checksum/size anomalies across recent backups
“Time since last verified restore” breaching policy

How we fix it safely

Run routine restore verification and track pass rates
Introduce step-based restore workflow with preflight checks
Store and validate required secrets/keys for restores (with rotation awareness)

Failure mode

Silent retention collapse

Symptom: You only have a few recent backups; older restore points are missing.

Root cause: Retention misconfigured, storage policy deletes objects, or cleanup scripts run incorrectly without visibility.

How we detect it

Retention coverage dashboard vs policy baseline
Missing weekly/monthly restore points flagged
Unexpected deletions recorded in evidence timeline (where detectable)

How we fix it safely

Enforce retention policy from a single control plane
Require explicit approvals for destructive retention changes
Add periodic audits that compare expected vs actual restore points

Failure mode

Partial backups and corruption

Symptom: Restore succeeds but data is missing (attachments) or DB is incomplete/corrupted.

Root cause: Interrupted dumps, disk issues, partial file sync, or missing directories excluded by mistake.

How we detect it

Integrity checks fail (checksum mismatch, missing files)
Sharp size deltas compared to baseline trends
Verification checks fail on file presence / DB sanity probes

How we fix it safely

Add integrity validation to every backup
Make backup scope explicit (DB + files + private files + attachments)
Quarantine suspicious restore points and alert operators

Failure mode

Offsite copy is stale or missing

Symptom: Primary host fails; offsite restore point isn’t available or is outdated.

Root cause: Replication lag, credential drift, provider policy changes, or failed uploads not visible to operators.

How we detect it

Offsite freshness lag metric breaches threshold
Destination-level failure clustering and retry churn
Missing offsite copy for latest restore points

How we fix it safely

Track offsite status per restore point (not per day)
Retry uploads with backoff and evidence trails
Alert on missing offsite copies within defined time windows

Technical design

Designed for safety under pressure

Recovery is a crisis workflow. The design focuses on correctness, least privilege, and auditable outcomes.

Verification-first recovery

Routine restore tests convert hope into proof and expose drift early.

Scheduled restore verification
Measured restore time and recovered point
Clear failure step + remediation hints

Integrity and scope discipline

Backups must be complete, consistent, and validated. Scope is explicit and verifiable.

Checksums + sanity checks
Explicit coverage (DB + files + attachments)
Anomaly detection on size and file counts

Guardrails + auditability

Restores are high-risk actions. Access is controlled and every action is recorded.

Role-gated restore actions
Confirmation + safety prompts
Evidence trail for postmortems/compliance

What operators can do safely

Restore workflows you can trust

Incidents are not the time for improvised commands. The system provides controlled restore workflows with preflight checks, evidence capture, and predictable outcomes.

Browse restore pointsValidate integrityRestore to targetVerify environment bootExport evidencePolicy enforcement

Next step

Want recovery you can prove before an incident?

We’ll review your backup scope, retention discipline, restore workflow, and verification evidence - then propose a practical plan to make recovery predictable and auditable.

Talk to us Back to platform →

Explore the ERPNext Platform