Backup & Restore System

Backups are not safety - restores are. This capability turns backup pipelines into proven recovery readiness: automated backups, retention discipline, restore workflows, and routine verification with evidence.

Built for production: evidence, guardrails, and measured recovery objectives - not “we think it’s fine.”

Core promise
Proven recovery
If you can’t restore, you don’t have backups.
Primary surface
Restore points
Browse, validate, restore - with audits.
Operator outcome
Predictability
Measured RPO/RTO, fewer surprises during incidents.
Problem

Most teams have backups - and still fail recovery

The failure is rarely “no backups.” It’s missing verification, weak retention, unclear scope, and no operator control when pressure hits.

Backups without restore proof

Backups run for months - until the first restore attempt fails. Then you learn the pipeline has been rotting silently.

  • No routine restore tests
  • Missing attachments/private files
  • Secrets/keys drift breaks restores
Retention is inconsistent

Policies are informal and enforcement is weak. When incidents are discovered late, the restore points you need are missing.

  • Weekly/monthly points missing
  • Storage policies delete silently
  • No evidence trail during incidents
Operators lack safe controls

Under pressure, teams resort to ad-hoc scripts and SSH. Risk rises: accidental deletion, wrong target restores, and no audit trail.

  • No guarded restore workflows
  • Manual steps are error-prone
  • Hard to prove what happened
How it solves it

Automate, verify, and prove - before incidents

The system enforces backup policy, validates integrity, runs routine restore verification, and provides a controlled operator UI with evidence trails.

Backup automation with policy enforcement

Automate backups for database and files, enforce retention policy, and maintain evidence of execution.

Included
  • Schedules: daily / hourly / custom windows
  • DB + files coverage with clear scope and exclusions
  • Retention rules per environment and per destination
  • Execution evidence: timestamps, sizes, checksums, logs
Operator UI for browse, download, and restore

Give operators a controlled interface to view restore points, validate them, and perform restores safely.

Included
  • Browse restore points with metadata and health state
  • Download artifacts when needed (role-gated)
  • Restore workflows with confirmation + guardrails
  • Per-restore audit trail (who/what/when/why)
Restore verification as routine

Automate restore tests in a controlled target to prove that backups are usable and complete.

Included
  • Automated test restores on a schedule
  • Verification checks: DB opens, schema sanity, file presence, app boot checks
  • Measured restore time (RTO) and recovered point (RPO)
  • Evidence reports exportable for compliance and postmortems
Integrity and corruption detection

Detect partial backups, corruption, and drift early - before an incident forces a restore.

Included
  • Checksums + expected file count/size checks
  • DB dump sanity checks + decompression validation
  • Detect missing attachments / private files coverage gaps
  • Flag suspicious deltas (size collapse, sudden growth, repeated failures)
Defense against deletion and blast radius

Prevent “oops” and hostile deletion from wiping your restore points.

Included
  • Offsite copy support + freshness tracking
  • Optional immutability / write-once retention patterns (provider-dependent)
  • Least-privilege credentials and scoped access
  • Separation of duties: operators can restore, not delete (policy-driven)
Metrics

Recovery signals operators actually need

These metrics detect drift early, prove readiness, and turn recovery objectives into measured numbers.

Metric
Backup success rate
Definition

Successful backups ÷ attempted backups, tracked per schedule and destination (local + offsite).

Why it matters

A schedule is not proof. Success rate tells you whether backups are actually happening reliably, not just configured.

Example operator threshold

Alert if success rate drops below 99% over 7 days, or if any critical schedule fails twice consecutively.

Metric
Time since last successful backup
Definition

Elapsed time since the most recent successful backup for each site (database + files).

Why it matters

RPO silently grows when backups fail. Operators need a simple “how stale is our safety net?” signal.

Example operator threshold

Alert if > 24h for daily backups, > 2h for hourly backups (tune by business criticality).

Metric
Retention coverage
Definition

How many restore points exist across retention windows (daily/weekly/monthly) and where they live.

Why it matters

Most incidents are discovered late. Without retention discipline, the only backups you have are too recent (already corrupted) or already deleted.

Example operator threshold

Alert if weekly/monthly restore points fall below policy (e.g., < 4 weekly or < 6 monthly).

Metric
Restore verification pass rate
Definition

Percentage of automated restore tests that complete successfully (and meet verification checks).

Why it matters

Backups are not safety. Restore success is. Verification is how you prove recovery readiness before incidents.

Example operator threshold

Alert if pass rate < 95% over 30 days or if any critical restore verification fails.

Metric
Time since last verified restore
Definition

Elapsed time since the last successful restore verification for each environment.

Why it matters

A backup pipeline can slowly rot (permissions, encryption keys, file paths, storage policies). Regular restores catch drift early.

Example operator threshold

Alert if > 7 days for production, > 30 days for lower environments (tune per policy).

Metric
Backup integrity / checksum validation
Definition

Integrity checks such as checksums, expected file counts/sizes, and DB dump sanity verification.

Why it matters

Corrupt backups are common: partial uploads, interrupted dumps, disk errors. Integrity checks detect corruption before you need the backup.

Metric
Recovery objectives (RPO/RTO) evidence
Definition

Measured RPO (data loss window) and RTO (restore time) based on real restore verification runs.

Why it matters

Stated RPO/RTO without evidence is hope. Operators and management need measured numbers to plan and to trust the system.

Metric
Offsite replication freshness
Definition

Lag between latest local backup and offsite copy availability (object storage / remote destination).

Why it matters

Offsite is your last line of defense against host failure, deletion, or ransomware. Replication lag is how you know it’s actually current.

Example operator threshold

Alert if offsite lag exceeds 2× the backup interval or if offsite copy is missing for a recent restore point.

Failure modes

What breaks in recovery - and how we prevent it

These failure patterns are common in ERPNext production. The system is designed to surface them early and make remediation safe.

Failure mode

Backups exist, but restores fail

Symptom: During an incident, restore attempts fail or the environment won’t boot.

Root cause: Backups were never verified; missing files, wrong paths, encryption key drift, or DB dumps that don’t replay cleanly.

How we detect it
  • Restore verification failures with exact failing step recorded
  • Checksum/size anomalies across recent backups
  • “Time since last verified restore” breaching policy
How we fix it safely
  • Run routine restore verification and track pass rates
  • Introduce step-based restore workflow with preflight checks
  • Store and validate required secrets/keys for restores (with rotation awareness)
Failure mode

Silent retention collapse

Symptom: You only have a few recent backups; older restore points are missing.

Root cause: Retention misconfigured, storage policy deletes objects, or cleanup scripts run incorrectly without visibility.

How we detect it
  • Retention coverage dashboard vs policy baseline
  • Missing weekly/monthly restore points flagged
  • Unexpected deletions recorded in evidence timeline (where detectable)
How we fix it safely
  • Enforce retention policy from a single control plane
  • Require explicit approvals for destructive retention changes
  • Add periodic audits that compare expected vs actual restore points
Failure mode

Partial backups and corruption

Symptom: Restore succeeds but data is missing (attachments) or DB is incomplete/corrupted.

Root cause: Interrupted dumps, disk issues, partial file sync, or missing directories excluded by mistake.

How we detect it
  • Integrity checks fail (checksum mismatch, missing files)
  • Sharp size deltas compared to baseline trends
  • Verification checks fail on file presence / DB sanity probes
How we fix it safely
  • Add integrity validation to every backup
  • Make backup scope explicit (DB + files + private files + attachments)
  • Quarantine suspicious restore points and alert operators
Failure mode

Offsite copy is stale or missing

Symptom: Primary host fails; offsite restore point isn’t available or is outdated.

Root cause: Replication lag, credential drift, provider policy changes, or failed uploads not visible to operators.

How we detect it
  • Offsite freshness lag metric breaches threshold
  • Destination-level failure clustering and retry churn
  • Missing offsite copy for latest restore points
How we fix it safely
  • Track offsite status per restore point (not per day)
  • Retry uploads with backoff and evidence trails
  • Alert on missing offsite copies within defined time windows
Technical design

Designed for safety under pressure

Recovery is a crisis workflow. The design focuses on correctness, least privilege, and auditable outcomes.

Verification-first recovery

Routine restore tests convert hope into proof and expose drift early.

  • Scheduled restore verification
  • Measured restore time and recovered point
  • Clear failure step + remediation hints
Integrity and scope discipline

Backups must be complete, consistent, and validated. Scope is explicit and verifiable.

  • Checksums + sanity checks
  • Explicit coverage (DB + files + attachments)
  • Anomaly detection on size and file counts
Guardrails + auditability

Restores are high-risk actions. Access is controlled and every action is recorded.

  • Role-gated restore actions
  • Confirmation + safety prompts
  • Evidence trail for postmortems/compliance
What operators can do safely

Restore workflows you can trust

Incidents are not the time for improvised commands. The system provides controlled restore workflows with preflight checks, evidence capture, and predictable outcomes.

Browse restore pointsValidate integrityRestore to targetVerify environment bootExport evidencePolicy enforcement
Next step

Want recovery you can prove before an incident?

We’ll review your backup scope, retention discipline, restore workflow, and verification evidence - then propose a practical plan to make recovery predictable and auditable.