Backup & Restore System
Backups are not safety - restores are. This capability turns backup pipelines into proven recovery readiness: automated backups, retention discipline, restore workflows, and routine verification with evidence.
Built for production: evidence, guardrails, and measured recovery objectives - not “we think it’s fine.”
Most teams have backups - and still fail recovery
The failure is rarely “no backups.” It’s missing verification, weak retention, unclear scope, and no operator control when pressure hits.
Backups run for months - until the first restore attempt fails. Then you learn the pipeline has been rotting silently.
- No routine restore tests
- Missing attachments/private files
- Secrets/keys drift breaks restores
Policies are informal and enforcement is weak. When incidents are discovered late, the restore points you need are missing.
- Weekly/monthly points missing
- Storage policies delete silently
- No evidence trail during incidents
Under pressure, teams resort to ad-hoc scripts and SSH. Risk rises: accidental deletion, wrong target restores, and no audit trail.
- No guarded restore workflows
- Manual steps are error-prone
- Hard to prove what happened
Automate, verify, and prove - before incidents
The system enforces backup policy, validates integrity, runs routine restore verification, and provides a controlled operator UI with evidence trails.
Automate backups for database and files, enforce retention policy, and maintain evidence of execution.
- Schedules: daily / hourly / custom windows
- DB + files coverage with clear scope and exclusions
- Retention rules per environment and per destination
- Execution evidence: timestamps, sizes, checksums, logs
Give operators a controlled interface to view restore points, validate them, and perform restores safely.
- Browse restore points with metadata and health state
- Download artifacts when needed (role-gated)
- Restore workflows with confirmation + guardrails
- Per-restore audit trail (who/what/when/why)
Automate restore tests in a controlled target to prove that backups are usable and complete.
- Automated test restores on a schedule
- Verification checks: DB opens, schema sanity, file presence, app boot checks
- Measured restore time (RTO) and recovered point (RPO)
- Evidence reports exportable for compliance and postmortems
Detect partial backups, corruption, and drift early - before an incident forces a restore.
- Checksums + expected file count/size checks
- DB dump sanity checks + decompression validation
- Detect missing attachments / private files coverage gaps
- Flag suspicious deltas (size collapse, sudden growth, repeated failures)
Prevent “oops” and hostile deletion from wiping your restore points.
- Offsite copy support + freshness tracking
- Optional immutability / write-once retention patterns (provider-dependent)
- Least-privilege credentials and scoped access
- Separation of duties: operators can restore, not delete (policy-driven)
Recovery signals operators actually need
These metrics detect drift early, prove readiness, and turn recovery objectives into measured numbers.
Successful backups ÷ attempted backups, tracked per schedule and destination (local + offsite).
A schedule is not proof. Success rate tells you whether backups are actually happening reliably, not just configured.
Alert if success rate drops below 99% over 7 days, or if any critical schedule fails twice consecutively.
Elapsed time since the most recent successful backup for each site (database + files).
RPO silently grows when backups fail. Operators need a simple “how stale is our safety net?” signal.
Alert if > 24h for daily backups, > 2h for hourly backups (tune by business criticality).
How many restore points exist across retention windows (daily/weekly/monthly) and where they live.
Most incidents are discovered late. Without retention discipline, the only backups you have are too recent (already corrupted) or already deleted.
Alert if weekly/monthly restore points fall below policy (e.g., < 4 weekly or < 6 monthly).
Percentage of automated restore tests that complete successfully (and meet verification checks).
Backups are not safety. Restore success is. Verification is how you prove recovery readiness before incidents.
Alert if pass rate < 95% over 30 days or if any critical restore verification fails.
Elapsed time since the last successful restore verification for each environment.
A backup pipeline can slowly rot (permissions, encryption keys, file paths, storage policies). Regular restores catch drift early.
Alert if > 7 days for production, > 30 days for lower environments (tune per policy).
Integrity checks such as checksums, expected file counts/sizes, and DB dump sanity verification.
Corrupt backups are common: partial uploads, interrupted dumps, disk errors. Integrity checks detect corruption before you need the backup.
Measured RPO (data loss window) and RTO (restore time) based on real restore verification runs.
Stated RPO/RTO without evidence is hope. Operators and management need measured numbers to plan and to trust the system.
Lag between latest local backup and offsite copy availability (object storage / remote destination).
Offsite is your last line of defense against host failure, deletion, or ransomware. Replication lag is how you know it’s actually current.
Alert if offsite lag exceeds 2× the backup interval or if offsite copy is missing for a recent restore point.
What breaks in recovery - and how we prevent it
These failure patterns are common in ERPNext production. The system is designed to surface them early and make remediation safe.
Backups exist, but restores fail
Symptom: During an incident, restore attempts fail or the environment won’t boot.
Root cause: Backups were never verified; missing files, wrong paths, encryption key drift, or DB dumps that don’t replay cleanly.
- Restore verification failures with exact failing step recorded
- Checksum/size anomalies across recent backups
- “Time since last verified restore” breaching policy
- Run routine restore verification and track pass rates
- Introduce step-based restore workflow with preflight checks
- Store and validate required secrets/keys for restores (with rotation awareness)
Silent retention collapse
Symptom: You only have a few recent backups; older restore points are missing.
Root cause: Retention misconfigured, storage policy deletes objects, or cleanup scripts run incorrectly without visibility.
- Retention coverage dashboard vs policy baseline
- Missing weekly/monthly restore points flagged
- Unexpected deletions recorded in evidence timeline (where detectable)
- Enforce retention policy from a single control plane
- Require explicit approvals for destructive retention changes
- Add periodic audits that compare expected vs actual restore points
Partial backups and corruption
Symptom: Restore succeeds but data is missing (attachments) or DB is incomplete/corrupted.
Root cause: Interrupted dumps, disk issues, partial file sync, or missing directories excluded by mistake.
- Integrity checks fail (checksum mismatch, missing files)
- Sharp size deltas compared to baseline trends
- Verification checks fail on file presence / DB sanity probes
- Add integrity validation to every backup
- Make backup scope explicit (DB + files + private files + attachments)
- Quarantine suspicious restore points and alert operators
Offsite copy is stale or missing
Symptom: Primary host fails; offsite restore point isn’t available or is outdated.
Root cause: Replication lag, credential drift, provider policy changes, or failed uploads not visible to operators.
- Offsite freshness lag metric breaches threshold
- Destination-level failure clustering and retry churn
- Missing offsite copy for latest restore points
- Track offsite status per restore point (not per day)
- Retry uploads with backoff and evidence trails
- Alert on missing offsite copies within defined time windows
Designed for safety under pressure
Recovery is a crisis workflow. The design focuses on correctness, least privilege, and auditable outcomes.
Routine restore tests convert hope into proof and expose drift early.
- Scheduled restore verification
- Measured restore time and recovered point
- Clear failure step + remediation hints
Backups must be complete, consistent, and validated. Scope is explicit and verifiable.
- Checksums + sanity checks
- Explicit coverage (DB + files + attachments)
- Anomaly detection on size and file counts
Restores are high-risk actions. Access is controlled and every action is recorded.
- Role-gated restore actions
- Confirmation + safety prompts
- Evidence trail for postmortems/compliance
Restore workflows you can trust
Incidents are not the time for improvised commands. The system provides controlled restore workflows with preflight checks, evidence capture, and predictable outcomes.
Want recovery you can prove before an incident?
We’ll review your backup scope, retention discipline, restore workflow, and verification evidence - then propose a practical plan to make recovery predictable and auditable.