feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled

WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-10 17:43:30 +01:00
parent 719e4f40d1
commit 5b0cfbf10a
2 changed files with 171 additions and 81 deletions

View File

@@ -64,3 +64,53 @@ has been tested before it matters.
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
---
## D4 — Integrated backup per capability, not centralized backup service
**Date:** 2026-03-10
**Decided by:** Tegwick
**Decision:** Each railiance repo implements its own backup for the
infrastructure it owns. There is no central backup service.
**Rationale:**
A centralized backup service (e.g., in railiance-enablement) couples every
stack layer to a shared component. As each layer matures and evolves at its
own pace, this coupling repeatedly breaks the backup. A service that breaks
when the thing it is supposed to protect is being changed is not a safety net.
Integrated backup per repo means:
- The backup for S1 lives in railiance-infra and knows exactly what S1 owns
- The backup for S2 lives in railiance-cluster and knows what S2 owns
- Each repo can be backed up independently, without any other repo, service,
or network connection being available
- Each backup implementation matures with its layer
**Standard interface (Q3 Operability & Resilience):**
Every railiance repo that manages persistent state must provide:
1. `make backup` — creates an encrypted backup of what this layer owns,
writes to a local directory on the server (`/opt/backup/railiance/<layer>/`)
2. `make restore` — restores from the most recent local backup
3. Encryption: age, reusing the same key pair used for SOPS secrets
4. No runtime dependencies: must work without custodian, state-hub, network
file share, or any other external service being available
**Extension point EP-RAIL-005:** The custodian can provide orchestration
guidelines. If each repo follows the standard interface, the custodian can
call `make backup` across the full stack in dependency order (S1 → S5)
and aggregate results. This is deliberately deferred — integrate first,
orchestrate later.
**What changes from the previous approach (D2):**
D2 established Nextcloud as the backup destination for a single monolithic
script in railiance-cluster. That script backed up the wrong things (custodian
DB and operator config — neither of which are S2 concerns). The Nextcloud
upload becomes an optional extension, not a requirement.
See: `workplans/RAIL-BS-WP-0004-safety-net.md`
---