feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots, Helm values, kubeconfig). No external dependencies. age encryption reuses SOPS key pair. Output to /opt/backup/railiance/cluster/. DECISIONS.md D4: integrated backup per capability, not centralized. EP-RAIL-005 registered in state hub: custodian orchestration deferred until all layers implement the standard interface. The old monolithic backup (custodian DB + operator config) was not S2's concern and has been removed from this workplan scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
50
DECISIONS.md
50
DECISIONS.md
@@ -64,3 +64,53 @@ has been tested before it matters.
|
||||
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||
|
||||
---
|
||||
|
||||
## D4 — Integrated backup per capability, not centralized backup service
|
||||
|
||||
**Date:** 2026-03-10
|
||||
**Decided by:** Tegwick
|
||||
|
||||
**Decision:** Each railiance repo implements its own backup for the
|
||||
infrastructure it owns. There is no central backup service.
|
||||
|
||||
**Rationale:**
|
||||
|
||||
A centralized backup service (e.g., in railiance-enablement) couples every
|
||||
stack layer to a shared component. As each layer matures and evolves at its
|
||||
own pace, this coupling repeatedly breaks the backup. A service that breaks
|
||||
when the thing it is supposed to protect is being changed is not a safety net.
|
||||
|
||||
Integrated backup per repo means:
|
||||
- The backup for S1 lives in railiance-infra and knows exactly what S1 owns
|
||||
- The backup for S2 lives in railiance-cluster and knows what S2 owns
|
||||
- Each repo can be backed up independently, without any other repo, service,
|
||||
or network connection being available
|
||||
- Each backup implementation matures with its layer
|
||||
|
||||
**Standard interface (Q3 Operability & Resilience):**
|
||||
|
||||
Every railiance repo that manages persistent state must provide:
|
||||
|
||||
1. `make backup` — creates an encrypted backup of what this layer owns,
|
||||
writes to a local directory on the server (`/opt/backup/railiance/<layer>/`)
|
||||
2. `make restore` — restores from the most recent local backup
|
||||
3. Encryption: age, reusing the same key pair used for SOPS secrets
|
||||
4. No runtime dependencies: must work without custodian, state-hub, network
|
||||
file share, or any other external service being available
|
||||
|
||||
**Extension point EP-RAIL-005:** The custodian can provide orchestration
|
||||
guidelines. If each repo follows the standard interface, the custodian can
|
||||
call `make backup` across the full stack in dependency order (S1 → S5)
|
||||
and aggregate results. This is deliberately deferred — integrate first,
|
||||
orchestrate later.
|
||||
|
||||
**What changes from the previous approach (D2):**
|
||||
|
||||
D2 established Nextcloud as the backup destination for a single monolithic
|
||||
script in railiance-cluster. That script backed up the wrong things (custodian
|
||||
DB and operator config — neither of which are S2 concerns). The Nextcloud
|
||||
upload becomes an optional extension, not a requirement.
|
||||
|
||||
See: `workplans/RAIL-BS-WP-0004-safety-net.md`
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user