diff --git a/DECISIONS.md b/DECISIONS.md index 4ad1c4c..49f4a70 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -13,9 +13,54 @@ I want to go with C and separate concerns. Nginx for external SSL will need secu ## D2 — Durable offsite backup destination for single-server safety net -**Date:** 2026-02-25 -**Decided by:** Tegwick +**Date:** 2026-02-25 +**Decided by:** Tegwick We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend. --- + +## D3 — HA and failover scenarios must be tested before a workplan is considered done + +**Date:** 2026-03-10 +**Decided by:** Tegwick + +On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password` +secret key) that had been present since initial deployment on 2025-08-31 but was +never discovered because no pod restart had occurred in 20 days. The immediate +symptom was Gitea logins hanging silently for hours. + +This incident showed that deploying an HA component and declaring it "done" +without ever triggering a failover gives false confidence. Infrastructure that +has never failed over is not HA — it is just redundant hardware. + +**Policy:** + +Any workplan that deploys or configures a High Availability component +(database cluster, replicated storage, redundant ingress, etc.) is **not +complete** until a failover test passes. Specifically: + +1. A test script in `tests/` must exist that deliberately kills the primary + component and asserts the service remains available within an acceptable + recovery window. + +2. The test must be run against a live cluster and exit 0 before the workplan + status is set to `completed`. + +3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health + check for each HA component's connection pooler, proxy, or load balancer — + not just the backing nodes. + +4. Any Helm chart values required to make HA work correctly (secrets, + passwords, topology settings) must be present in the versioned values file + before the workplan is closed, so that a `helm upgrade` cannot silently + regress the fix. + +**Rationale:** A failure that only surfaces on the first real event (restart, +failover, node loss) is a deployment bug, not an operational surprise. Railiance +aims for calm ops — and calm ops requires that every failure mode we know about +has been tested before it matters. + +See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md` + +--- diff --git a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md new file mode 100644 index 0000000..40be6ea --- /dev/null +++ b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md @@ -0,0 +1,189 @@ +--- +id: RAIL-BS-WP-0003 +type: bug-report +title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key" +domain: railiance +repo: railiance-cluster +status: open +owner: tegwick +created: "2026-03-10" +updated: "2026-03-10" +--- + +# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover + +## Summary + +On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to +restart. pgpool — the connection pooler between Gitea and PostgreSQL — then +entered CrashLoopBackOff and produced no logs. As a result Gitea's login +and all write operations hung indefinitely. The root page was still served +(from Valkey cache) which masked the failure. + +The fix was to patch a missing key in a Kubernetes secret. The root cause is +that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not +populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql` +secret, even though the pgpool pod requires it at startup. + +--- + +## Timeline + +| Time (UTC) | Event | +|---|---| +| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) | +| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff | +| ~11:00 | User noticed Gitea login hanging; home page still loading | +| ~13:00 | Root cause identified: missing `pgpool-password` secret key | +| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly | +| ~13:15 | Gitea fully operational | + +--- + +## Root Cause + +The Bitnami `pgpool` container startup script reads the file +`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the +`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume +mount. That secret key was never created by the Helm chart, so the file did +not exist. The container exited immediately with no logs. + +The pod had been running for 20 days without a restart, so this gap was +never discovered during initial deployment. + +--- + +## Evidence + +```bash +# Secret was missing the pgpool-password key +sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml +# data: keys were password, postgres-password, repmgr-password only +# pgpool-password was absent + +# pgpool pod describe showed 824 back-off restarts over 173 minutes +# No logs in either current or --previous output +sudo k3s kubectl logs -n default --previous +# (empty) + +# Gitea process had zero TCP connections to PostgreSQL port 5432 +# but many connections to Valkey port 6379 +cat /proc//net/tcp | grep 1538 # 1538 = 5432 hex — no results +``` + +--- + +## Immediate Fix Applied + +```bash +# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0) +sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \ + --type='json' \ + -p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]' + +# Restart pgpool +sudo k3s kubectl delete pod -n default +``` + +--- + +## Risk: Fix Will Be Lost on helm upgrade + +The patched secret is managed by Helm (annotation: +`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the +secret from the chart template, which does not include `pgpool-password`, +and the bug will recur. + +--- + +## Tasks + +### T01 — Add pgpool-password to Helm values + +```task +id: T01 +status: open +priority: high +``` + +Create or update `helm/gitea-values.yaml` (or equivalent) to permanently +include the pgpool-password so it survives `helm upgrade`: + +```yaml +postgresql-ha: + postgresql: + pgpoolPassword: +``` + +**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly +without manual secret patching. + +--- + +### T02 — Add pgpool health check to smoke test + +```task +id: T02 +status: open +priority: high +``` + +Extend `tests/smoke_kube.sh` to assert: + +```bash +# All postgresql-ha pods Running +kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1 + +# pgpool specifically not in CrashLoopBackOff +kubectl get pod -n default -l app.kubernetes.io/component=pgpool \ + -o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash +``` + +**Done when:** the smoke test catches a pgpool failure within 5 minutes. + +--- + +### T03 — Add HA failover test + +```task +id: T03 +status: open +priority: high +``` + +Create `tests/test_ha_failover.sh` that: + +1. Records Gitea login response time (baseline) +2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default` +3. Waits for repmgr to promote a replica (max 60s) +4. Asserts Gitea login POST still succeeds within 10s +5. Asserts pgpool pod is Running (not CrashLoopBackOff) +6. Asserts all postgresql pods return to Running + +This test must pass before any PostgreSQL HA deployment is considered done. + +**Done when:** script exits 0 against a live cluster. + +--- + +### T04 — Document the incident in docs/ + +```task +id: T04 +status: open +priority: medium +``` + +Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full +timeline, root cause, and fix, so future operators understand what happened +and how to recover. + +**Done when:** doc committed and linked from `docs/README.md`. + +--- + +## References + +- Bitnami postgresql-ha chart v16.2.2 +- Gitea Helm chart v12.2.0 +- Related decision: D3 (HA testing policy) in `DECISIONS.md`