bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy

Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:03:36 +00:00
parent ada406f327
commit 359d5b8b5b
2 changed files with 236 additions and 2 deletions
--- a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
+++ b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
@@ -0,0 +1,189 @@
+---
+id: RAIL-BS-WP-0003
+type: bug-report
+title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
+domain: railiance
+repo: railiance-cluster
+status: open
+owner: tegwick
+created: "2026-03-10"
+updated: "2026-03-10"
+---
+
+# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
+
+## Summary
+
+On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
+restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
+entered CrashLoopBackOff and produced no logs. As a result Gitea's login
+and all write operations hung indefinitely. The root page was still served
+(from Valkey cache) which masked the failure.
+
+The fix was to patch a missing key in a Kubernetes secret. The root cause is
+that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
+populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
+secret, even though the pgpool pod requires it at startup.
+
+---
+
+## Timeline
+
+| Time (UTC) | Event |
+|---|---|
+| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
+| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
+| ~11:00 | User noticed Gitea login hanging; home page still loading |
+| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
+| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
+| ~13:15 | Gitea fully operational |
+
+---
+
+## Root Cause
+
+The Bitnami `pgpool` container startup script reads the file
+`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
+`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
+mount. That secret key was never created by the Helm chart, so the file did
+not exist. The container exited immediately with no logs.
+
+The pod had been running for 20 days without a restart, so this gap was
+never discovered during initial deployment.
+
+---
+
+## Evidence
+
+```bash
+# Secret was missing the pgpool-password key
+sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
+# data: keys were password, postgres-password, repmgr-password only
+# pgpool-password was absent
+
+# pgpool pod describe showed 824 back-off restarts over 173 minutes
+# No logs in either current or --previous output
+sudo k3s kubectl logs -n default <pgpool-pod> --previous
+# (empty)
+
+# Gitea process had zero TCP connections to PostgreSQL port 5432
+# but many connections to Valkey port 6379
+cat /proc/<gitea-pid>/net/tcp | grep 1538  # 1538 = 5432 hex — no results
+```
+
+---
+
+## Immediate Fix Applied
+
+```bash
+# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
+sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
+  --type='json' \
+  -p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
+
+# Restart pgpool
+sudo k3s kubectl delete pod -n default <pgpool-pod-name>
+```
+
+---
+
+## Risk: Fix Will Be Lost on helm upgrade
+
+The patched secret is managed by Helm (annotation:
+`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
+secret from the chart template, which does not include `pgpool-password`,
+and the bug will recur.
+
+---
+
+## Tasks
+
+### T01 — Add pgpool-password to Helm values
+
+```task
+id: T01
+status: open
+priority: high
+```
+
+Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
+include the pgpool-password so it survives `helm upgrade`:
+
+```yaml
+postgresql-ha:
+  postgresql:
+    pgpoolPassword: <value matching sr-check-password>
+```
+
+**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
+without manual secret patching.
+
+---
+
+### T02 — Add pgpool health check to smoke test
+
+```task
+id: T02
+status: open
+priority: high
+```
+
+Extend `tests/smoke_kube.sh` to assert:
+
+```bash
+# All postgresql-ha pods Running
+kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
+
+# pgpool specifically not in CrashLoopBackOff
+kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
+  -o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
+```
+
+**Done when:** the smoke test catches a pgpool failure within 5 minutes.
+
+---
+
+### T03 — Add HA failover test
+
+```task
+id: T03
+status: open
+priority: high
+```
+
+Create `tests/test_ha_failover.sh` that:
+
+1. Records Gitea login response time (baseline)
+2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
+3. Waits for repmgr to promote a replica (max 60s)
+4. Asserts Gitea login POST still succeeds within 10s
+5. Asserts pgpool pod is Running (not CrashLoopBackOff)
+6. Asserts all postgresql pods return to Running
+
+This test must pass before any PostgreSQL HA deployment is considered done.
+
+**Done when:** script exits 0 against a live cluster.
+
+---
+
+### T04 — Document the incident in docs/
+
+```task
+id: T04
+status: open
+priority: medium
+```
+
+Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
+timeline, root cause, and fix, so future operators understand what happened
+and how to recover.
+
+**Done when:** doc committed and linked from `docs/README.md`.
+
+---
+
+## References
+
+- Bitnami postgresql-ha chart v16.2.2
+- Gitea Helm chart v12.2.0
+- Related decision: D3 (HA testing policy) in `DECISIONS.md`