bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
49
DECISIONS.md
49
DECISIONS.md
@@ -13,9 +13,54 @@ I want to go with C and separate concerns. Nginx for external SSL will need secu
|
||||
|
||||
## D2 — Durable offsite backup destination for single-server safety net
|
||||
|
||||
**Date:** 2026-02-25
|
||||
**Decided by:** Tegwick
|
||||
**Date:** 2026-02-25
|
||||
**Decided by:** Tegwick
|
||||
|
||||
We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
|
||||
|
||||
---
|
||||
|
||||
## D3 — HA and failover scenarios must be tested before a workplan is considered done
|
||||
|
||||
**Date:** 2026-03-10
|
||||
**Decided by:** Tegwick
|
||||
|
||||
On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
|
||||
secret key) that had been present since initial deployment on 2025-08-31 but was
|
||||
never discovered because no pod restart had occurred in 20 days. The immediate
|
||||
symptom was Gitea logins hanging silently for hours.
|
||||
|
||||
This incident showed that deploying an HA component and declaring it "done"
|
||||
without ever triggering a failover gives false confidence. Infrastructure that
|
||||
has never failed over is not HA — it is just redundant hardware.
|
||||
|
||||
**Policy:**
|
||||
|
||||
Any workplan that deploys or configures a High Availability component
|
||||
(database cluster, replicated storage, redundant ingress, etc.) is **not
|
||||
complete** until a failover test passes. Specifically:
|
||||
|
||||
1. A test script in `tests/` must exist that deliberately kills the primary
|
||||
component and asserts the service remains available within an acceptable
|
||||
recovery window.
|
||||
|
||||
2. The test must be run against a live cluster and exit 0 before the workplan
|
||||
status is set to `completed`.
|
||||
|
||||
3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
|
||||
check for each HA component's connection pooler, proxy, or load balancer —
|
||||
not just the backing nodes.
|
||||
|
||||
4. Any Helm chart values required to make HA work correctly (secrets,
|
||||
passwords, topology settings) must be present in the versioned values file
|
||||
before the workplan is closed, so that a `helm upgrade` cannot silently
|
||||
regress the fix.
|
||||
|
||||
**Rationale:** A failure that only surfaces on the first real event (restart,
|
||||
failover, node loss) is a deployment bug, not an operational surprise. Railiance
|
||||
aims for calm ops — and calm ops requires that every failure mode we know about
|
||||
has been tested before it matters.
|
||||
|
||||
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user