Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots, Helm values, kubeconfig). No external dependencies. age encryption reuses SOPS key pair. Output to /opt/backup/railiance/cluster/. DECISIONS.md D4: integrated backup per capability, not centralized. EP-RAIL-005 registered in state hub: custodian orchestration deferred until all layers implement the standard interface. The old monolithic backup (custodian DB + operator config) was not S2's concern and has been removed from this workplan scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
117 lines
4.8 KiB
Markdown
117 lines
4.8 KiB
Markdown
# Decision Log
|
|
|
|
_Auto-generated by the Custodian State Hub._
|
|
|
|
## D1 — Ingress controller: Traefik (K3s default) vs Nginx for ThreePhoenix
|
|
|
|
**Date:** 2026-02-25
|
|
**Decided by:** Tegwick
|
|
|
|
I want to go with C and separate concerns. Nginx for external SSL will need security and functional updates on a completly different schedule to Traefik canary and production workload splitting. The second area of implementation is more complicated, volatile and will need time to settle.
|
|
|
|
---
|
|
|
|
## D2 — Durable offsite backup destination for single-server safety net
|
|
|
|
**Date:** 2026-02-25
|
|
**Decided by:** Tegwick
|
|
|
|
We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
|
|
|
|
---
|
|
|
|
## D3 — HA and failover scenarios must be tested before a workplan is considered done
|
|
|
|
**Date:** 2026-03-10
|
|
**Decided by:** Tegwick
|
|
|
|
On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
|
|
secret key) that had been present since initial deployment on 2025-08-31 but was
|
|
never discovered because no pod restart had occurred in 20 days. The immediate
|
|
symptom was Gitea logins hanging silently for hours.
|
|
|
|
This incident showed that deploying an HA component and declaring it "done"
|
|
without ever triggering a failover gives false confidence. Infrastructure that
|
|
has never failed over is not HA — it is just redundant hardware.
|
|
|
|
**Policy:**
|
|
|
|
Any workplan that deploys or configures a High Availability component
|
|
(database cluster, replicated storage, redundant ingress, etc.) is **not
|
|
complete** until a failover test passes. Specifically:
|
|
|
|
1. A test script in `tests/` must exist that deliberately kills the primary
|
|
component and asserts the service remains available within an acceptable
|
|
recovery window.
|
|
|
|
2. The test must be run against a live cluster and exit 0 before the workplan
|
|
status is set to `completed`.
|
|
|
|
3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
|
|
check for each HA component's connection pooler, proxy, or load balancer —
|
|
not just the backing nodes.
|
|
|
|
4. Any Helm chart values required to make HA work correctly (secrets,
|
|
passwords, topology settings) must be present in the versioned values file
|
|
before the workplan is closed, so that a `helm upgrade` cannot silently
|
|
regress the fix.
|
|
|
|
**Rationale:** A failure that only surfaces on the first real event (restart,
|
|
failover, node loss) is a deployment bug, not an operational surprise. Railiance
|
|
aims for calm ops — and calm ops requires that every failure mode we know about
|
|
has been tested before it matters.
|
|
|
|
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
|
|
|
---
|
|
|
|
## D4 — Integrated backup per capability, not centralized backup service
|
|
|
|
**Date:** 2026-03-10
|
|
**Decided by:** Tegwick
|
|
|
|
**Decision:** Each railiance repo implements its own backup for the
|
|
infrastructure it owns. There is no central backup service.
|
|
|
|
**Rationale:**
|
|
|
|
A centralized backup service (e.g., in railiance-enablement) couples every
|
|
stack layer to a shared component. As each layer matures and evolves at its
|
|
own pace, this coupling repeatedly breaks the backup. A service that breaks
|
|
when the thing it is supposed to protect is being changed is not a safety net.
|
|
|
|
Integrated backup per repo means:
|
|
- The backup for S1 lives in railiance-infra and knows exactly what S1 owns
|
|
- The backup for S2 lives in railiance-cluster and knows what S2 owns
|
|
- Each repo can be backed up independently, without any other repo, service,
|
|
or network connection being available
|
|
- Each backup implementation matures with its layer
|
|
|
|
**Standard interface (Q3 Operability & Resilience):**
|
|
|
|
Every railiance repo that manages persistent state must provide:
|
|
|
|
1. `make backup` — creates an encrypted backup of what this layer owns,
|
|
writes to a local directory on the server (`/opt/backup/railiance/<layer>/`)
|
|
2. `make restore` — restores from the most recent local backup
|
|
3. Encryption: age, reusing the same key pair used for SOPS secrets
|
|
4. No runtime dependencies: must work without custodian, state-hub, network
|
|
file share, or any other external service being available
|
|
|
|
**Extension point EP-RAIL-005:** The custodian can provide orchestration
|
|
guidelines. If each repo follows the standard interface, the custodian can
|
|
call `make backup` across the full stack in dependency order (S1 → S5)
|
|
and aggregate results. This is deliberately deferred — integrate first,
|
|
orchestrate later.
|
|
|
|
**What changes from the previous approach (D2):**
|
|
|
|
D2 established Nextcloud as the backup destination for a single monolithic
|
|
script in railiance-cluster. That script backed up the wrong things (custodian
|
|
DB and operator config — neither of which are S2 concerns). The Nextcloud
|
|
upload becomes an optional extension, not a requirement.
|
|
|
|
See: `workplans/RAIL-BS-WP-0004-safety-net.md`
|
|
|
|
---
|