WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots, Helm values, kubeconfig). No external dependencies. age encryption reuses SOPS key pair. Output to /opt/backup/railiance/cluster/. DECISIONS.md D4: integrated backup per capability, not centralized. EP-RAIL-005 registered in state hub: custodian orchestration deferred until all layers implement the standard interface. The old monolithic backup (custodian DB + operator config) was not S2's concern and has been removed from this workplan scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.8 KiB
Decision Log
Auto-generated by the Custodian State Hub.
D1 — Ingress controller: Traefik (K3s default) vs Nginx for ThreePhoenix
Date: 2026-02-25
Decided by: Tegwick
I want to go with C and separate concerns. Nginx for external SSL will need security and functional updates on a completly different schedule to Traefik canary and production workload splitting. The second area of implementation is more complicated, volatile and will need time to settle.
D2 — Durable offsite backup destination for single-server safety net
Date: 2026-02-25 Decided by: Tegwick
We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
D3 — HA and failover scenarios must be tested before a workplan is considered done
Date: 2026-03-10 Decided by: Tegwick
On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing pgpool-password
secret key) that had been present since initial deployment on 2025-08-31 but was
never discovered because no pod restart had occurred in 20 days. The immediate
symptom was Gitea logins hanging silently for hours.
This incident showed that deploying an HA component and declaring it "done" without ever triggering a failover gives false confidence. Infrastructure that has never failed over is not HA — it is just redundant hardware.
Policy:
Any workplan that deploys or configures a High Availability component (database cluster, replicated storage, redundant ingress, etc.) is not complete until a failover test passes. Specifically:
-
A test script in
tests/must exist that deliberately kills the primary component and asserts the service remains available within an acceptable recovery window. -
The test must be run against a live cluster and exit 0 before the workplan status is set to
completed. -
Smoke tests (
tests/smoke_kube.shor equivalent) must include a health check for each HA component's connection pooler, proxy, or load balancer — not just the backing nodes. -
Any Helm chart values required to make HA work correctly (secrets, passwords, topology settings) must be present in the versioned values file before the workplan is closed, so that a
helm upgradecannot silently regress the fix.
Rationale: A failure that only surfaces on the first real event (restart, failover, node loss) is a deployment bug, not an operational surprise. Railiance aims for calm ops — and calm ops requires that every failure mode we know about has been tested before it matters.
See: workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
D4 — Integrated backup per capability, not centralized backup service
Date: 2026-03-10 Decided by: Tegwick
Decision: Each railiance repo implements its own backup for the infrastructure it owns. There is no central backup service.
Rationale:
A centralized backup service (e.g., in railiance-enablement) couples every stack layer to a shared component. As each layer matures and evolves at its own pace, this coupling repeatedly breaks the backup. A service that breaks when the thing it is supposed to protect is being changed is not a safety net.
Integrated backup per repo means:
- The backup for S1 lives in railiance-infra and knows exactly what S1 owns
- The backup for S2 lives in railiance-cluster and knows what S2 owns
- Each repo can be backed up independently, without any other repo, service, or network connection being available
- Each backup implementation matures with its layer
Standard interface (Q3 Operability & Resilience):
Every railiance repo that manages persistent state must provide:
make backup— creates an encrypted backup of what this layer owns, writes to a local directory on the server (/opt/backup/railiance/<layer>/)make restore— restores from the most recent local backup- Encryption: age, reusing the same key pair used for SOPS secrets
- No runtime dependencies: must work without custodian, state-hub, network file share, or any other external service being available
Extension point EP-RAIL-005: The custodian can provide orchestration
guidelines. If each repo follows the standard interface, the custodian can
call make backup across the full stack in dependency order (S1 → S5)
and aggregate results. This is deliberately deferred — integrate first,
orchestrate later.
What changes from the previous approach (D2):
D2 established Nextcloud as the backup destination for a single monolithic script in railiance-cluster. That script backed up the wrong things (custodian DB and operator config — neither of which are S2 concerns). The Nextcloud upload becomes an optional extension, not a requirement.
See: workplans/RAIL-BS-WP-0004-safety-net.md