feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
(fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link
Also: make test-ha-failover target, Makefile .PHONY updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -67,6 +67,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
|
||||
4. **Deploy services**
|
||||
Install baseline services via Helm from the helm/ directory.
|
||||
|
||||
## Incidents
|
||||
|
||||
- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
|
||||
|
||||
## 👥 Contributing
|
||||
|
||||
See CONTRIBUTING.md for rules, coding style, and workflow.
|
||||
|
||||
127
docs/incidents/2026-03-10-pgpool-missing-secret.md
Normal file
127
docs/incidents/2026-03-10-pgpool-missing-secret.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover
|
||||
|
||||
**Date:** 2026-03-10
|
||||
**Severity:** High (Gitea write operations unavailable for ~4 hours)
|
||||
**Component:** postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0
|
||||
**Status:** Resolved — permanent fix pending `helm upgrade` with correct values
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
A PostgreSQL HA failover caused the pgpool connection pooler to enter
|
||||
CrashLoopBackOff. Gitea logins and all write operations hung silently for
|
||||
approximately 4 hours. The root page continued to load (served from Valkey
|
||||
cache), masking the failure.
|
||||
|
||||
Root cause: the `pgpool-password` key was absent from the
|
||||
`gitea-postgresql-ha-postgresql` Kubernetes Secret. The Bitnami postgresql-ha
|
||||
subchart does not populate this key automatically. The missing key had been
|
||||
present since initial deployment (2025-08-31) but was never discovered because
|
||||
the pgpool pod had not restarted in 20 days.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|---|---|
|
||||
| ~09:45 | `postgresql-0`, `postgresql-2` pods restarted (repmgr failover) |
|
||||
| ~09:45 | pgpool pod restarted → CrashLoopBackOff (silent, no logs) |
|
||||
| ~11:00 | User noticed Gitea login hanging; home page still loading |
|
||||
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
|
||||
| ~13:10 | Secret patched manually; pgpool pod deleted and restarted |
|
||||
| ~13:15 | Gitea fully operational |
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
The Bitnami `pgpool` container startup script reads
|
||||
`/opt/bitnami/pgpool/secrets/pgpool-password`, mounted from the
|
||||
`gitea-postgresql-ha-postgresql` Secret via `subPath`. That key was never
|
||||
written by the Helm chart. The container exited immediately with no log output,
|
||||
making it appear as a silent crash.
|
||||
|
||||
---
|
||||
|
||||
## Evidence
|
||||
|
||||
```bash
|
||||
# Secret was missing pgpool-password — only these keys existed:
|
||||
kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool
|
||||
# password, postgres-password, repmgr-password — pgpool-password absent
|
||||
|
||||
# pgpool had 824 back-off restarts over 173 minutes with no logs
|
||||
kubectl logs -n default <pgpool-pod> --previous
|
||||
# (empty output)
|
||||
|
||||
# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538)
|
||||
cat /proc/<gitea-pid>/net/tcp | grep 1538 # no results
|
||||
# All connections were to Valkey (6379 = 0x18EB)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Immediate Fix (manual — will regress on helm upgrade)
|
||||
|
||||
```bash
|
||||
# Base64 of the pgpool admin password
|
||||
PASSWORD_B64=$(echo -n "<pgpool-admin-password>" | base64)
|
||||
|
||||
kubectl patch secret -n default gitea-postgresql-ha-postgresql \
|
||||
--type='json' \
|
||||
-p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]"
|
||||
|
||||
kubectl delete pod -n default <pgpool-pod-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Permanent Fix
|
||||
|
||||
Add `pgpool.adminPassword` to `helm/gitea-values.yaml` so the key is
|
||||
present after every `helm upgrade`:
|
||||
|
||||
```bash
|
||||
helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
|
||||
```
|
||||
|
||||
See: `helm/gitea-values.yaml` — must be filled with the actual pgpool password
|
||||
before running the upgrade.
|
||||
|
||||
---
|
||||
|
||||
## Decisions Triggered
|
||||
|
||||
**D3 — HA and failover scenarios must be tested before a workplan is considered done.**
|
||||
|
||||
Any workplan deploying an HA component is not complete until:
|
||||
1. A failover test script in `tests/` passes against a live cluster
|
||||
2. Smoke tests check the connection pooler/proxy, not just backing nodes
|
||||
3. All required Helm values are in the versioned values file
|
||||
|
||||
See: `DECISIONS.md` and `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||
|
||||
---
|
||||
|
||||
## Recovery Checklist
|
||||
|
||||
If pgpool enters CrashLoopBackOff again:
|
||||
|
||||
```bash
|
||||
# 1. Verify the secret key exists
|
||||
kubectl get secret -n default gitea-postgresql-ha-postgresql \
|
||||
-o jsonpath='{.data.pgpool-password}'
|
||||
# Empty output = key missing → apply patch above
|
||||
|
||||
# 2. After patching, force pgpool restart
|
||||
kubectl delete pod -n default \
|
||||
$(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name)
|
||||
|
||||
# 3. Confirm Running state
|
||||
kubectl get pods -n default | grep pgpool
|
||||
|
||||
# 4. Confirm Gitea can reach PostgreSQL
|
||||
# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432
|
||||
```
|
||||
Reference in New Issue
Block a user