bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy

Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:03:36 +00:00
parent ada406f327
commit 359d5b8b5b
2 changed files with 236 additions and 2 deletions
--- a/DECISIONS.md
+++ b/DECISIONS.md
@@ -19,3 +19,48 @@ I want to go with C and separate concerns. Nginx for external SSL will need secu
 We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
 ---
 ## D3 — HA and failover scenarios must be tested before a workplan is considered done
 **Date:** 2026-03-10
 **Decided by:** Tegwick
 On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
 secret key) that had been present since initial deployment on 2025-08-31 but was
 never discovered because no pod restart had occurred in 20 days. The immediate
 symptom was Gitea logins hanging silently for hours.
 This incident showed that deploying an HA component and declaring it "done"
 without ever triggering a failover gives false confidence. Infrastructure that
 has never failed over is not HA — it is just redundant hardware.
 **Policy:**
 Any workplan that deploys or configures a High Availability component
 (database cluster, replicated storage, redundant ingress, etc.) is **not
 complete** until a failover test passes. Specifically:
 1. A test script in `tests/` must exist that deliberately kills the primary
   component and asserts the service remains available within an acceptable
   recovery window.
 2. The test must be run against a live cluster and exit 0 before the workplan
   status is set to `completed`.
 3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
   check for each HA component's connection pooler, proxy, or load balancer —
   not just the backing nodes.
 4. Any Helm chart values required to make HA work correctly (secrets,
   passwords, topology settings) must be present in the versioned values file
   before the workplan is closed, so that a `helm upgrade` cannot silently
   regress the fix.
 **Rationale:** A failure that only surfaces on the first real event (restart,
 failover, node loss) is a deployment bug, not an operational surprise. Railiance
 aims for calm ops — and calm ops requires that every failure mode we know about
 has been tested before it matters.
 See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
 ---
--- a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
+++ b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
@@ -0,0 +1,189 @@
 ---
 id: RAIL-BS-WP-0003
 type: bug-report
 title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
 domain: railiance
 repo: railiance-cluster
 status: open
 owner: tegwick
 created: "2026-03-10"
 updated: "2026-03-10"
 ---
 # Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
 ## Summary
 On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
 restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
 entered CrashLoopBackOff and produced no logs. As a result Gitea's login
 and all write operations hung indefinitely. The root page was still served
 (from Valkey cache) which masked the failure.
 The fix was to patch a missing key in a Kubernetes secret. The root cause is
 that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
 populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
 secret, even though the pgpool pod requires it at startup.
 ---
 ## Timeline
 | Time (UTC) | Event |
 |---|---|
 | ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
 | ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
 | ~11:00 | User noticed Gitea login hanging; home page still loading |
 | ~13:00 | Root cause identified: missing `pgpool-password` secret key |
 | ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
 | ~13:15 | Gitea fully operational |
 ---
 ## Root Cause
 The Bitnami `pgpool` container startup script reads the file
 `/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
 `gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
 mount. That secret key was never created by the Helm chart, so the file did
 not exist. The container exited immediately with no logs.
 The pod had been running for 20 days without a restart, so this gap was
 never discovered during initial deployment.
 ---
 ## Evidence
 ```bash
 # Secret was missing the pgpool-password key
 sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
 # data: keys were password, postgres-password, repmgr-password only
 # pgpool-password was absent
 # pgpool pod describe showed 824 back-off restarts over 173 minutes
 # No logs in either current or --previous output
 sudo k3s kubectl logs -n default <pgpool-pod> --previous
 # (empty)
 # Gitea process had zero TCP connections to PostgreSQL port 5432
 # but many connections to Valkey port 6379
 cat /proc/<gitea-pid>/net/tcp | grep 1538  # 1538 = 5432 hex — no results
 ```
 ---
 ## Immediate Fix Applied
 ```bash
 # Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
 sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
  --type='json' \
  -p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
 # Restart pgpool
 sudo k3s kubectl delete pod -n default <pgpool-pod-name>
 ```
 ---
 ## Risk: Fix Will Be Lost on helm upgrade
 The patched secret is managed by Helm (annotation:
 `meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
 secret from the chart template, which does not include `pgpool-password`,
 and the bug will recur.
 ---
 ## Tasks
 ### T01 — Add pgpool-password to Helm values
 ```task
 id: T01
 status: open
 priority: high
 ```
 Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
 include the pgpool-password so it survives `helm upgrade`:
 ```yaml
 postgresql-ha:
  postgresql:
    pgpoolPassword: <value matching sr-check-password>
 ```
 **Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
 without manual secret patching.
 ---
 ### T02 — Add pgpool health check to smoke test
 ```task
 id: T02
 status: open
 priority: high
 ```
 Extend `tests/smoke_kube.sh` to assert:
 ```bash
 # All postgresql-ha pods Running
 kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
 # pgpool specifically not in CrashLoopBackOff
 kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
  -o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
 ```
 **Done when:** the smoke test catches a pgpool failure within 5 minutes.
 ---
 ### T03 — Add HA failover test
 ```task
 id: T03
 status: open
 priority: high
 ```
 Create `tests/test_ha_failover.sh` that:
 1. Records Gitea login response time (baseline)
 2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
 3. Waits for repmgr to promote a replica (max 60s)
 4. Asserts Gitea login POST still succeeds within 10s
 5. Asserts pgpool pod is Running (not CrashLoopBackOff)
 6. Asserts all postgresql pods return to Running
 This test must pass before any PostgreSQL HA deployment is considered done.
 **Done when:** script exits 0 against a live cluster.
 ---
 ### T04 — Document the incident in docs/
 ```task
 id: T04
 status: open
 priority: medium
 ```
 Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
 timeline, root cause, and fix, so future operators understand what happened
 and how to recover.
 **Done when:** doc committed and linked from `docs/README.md`.
 ---
 ## References
 - Bitnami postgresql-ha chart v16.2.2
 - Gitea Helm chart v12.2.0
 - Related decision: D3 (HA testing policy) in `DECISIONS.md`