helm upgrade confirmed pgpool starts cleanly with adminPassword in values. SOPS encryption applied. Smoke test passes. D3 failover test pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
195 lines
5.4 KiB
Markdown
195 lines
5.4 KiB
Markdown
---
|
|
id: RAIL-BS-WP-0003
|
|
type: bug-report
|
|
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
|
|
domain: railiance
|
|
repo: railiance-cluster
|
|
status: completed
|
|
owner: tegwick
|
|
created: "2026-03-10"
|
|
updated: "2026-03-10"
|
|
state_hub_workstream_id: "7ee9ee22-1fae-4567-9194-8d70a9e0f45b"
|
|
---
|
|
|
|
# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
|
|
|
|
## Summary
|
|
|
|
On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
|
|
restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
|
|
entered CrashLoopBackOff and produced no logs. As a result Gitea's login
|
|
and all write operations hung indefinitely. The root page was still served
|
|
(from Valkey cache) which masked the failure.
|
|
|
|
The fix was to patch a missing key in a Kubernetes secret. The root cause is
|
|
that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
|
|
populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
|
|
secret, even though the pgpool pod requires it at startup.
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event |
|
|
|---|---|
|
|
| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
|
|
| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
|
|
| ~11:00 | User noticed Gitea login hanging; home page still loading |
|
|
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
|
|
| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
|
|
| ~13:15 | Gitea fully operational |
|
|
|
|
---
|
|
|
|
## Root Cause
|
|
|
|
The Bitnami `pgpool` container startup script reads the file
|
|
`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
|
|
`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
|
|
mount. That secret key was never created by the Helm chart, so the file did
|
|
not exist. The container exited immediately with no logs.
|
|
|
|
The pod had been running for 20 days without a restart, so this gap was
|
|
never discovered during initial deployment.
|
|
|
|
---
|
|
|
|
## Evidence
|
|
|
|
```bash
|
|
# Secret was missing the pgpool-password key
|
|
sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
|
|
# data: keys were password, postgres-password, repmgr-password only
|
|
# pgpool-password was absent
|
|
|
|
# pgpool pod describe showed 824 back-off restarts over 173 minutes
|
|
# No logs in either current or --previous output
|
|
sudo k3s kubectl logs -n default <pgpool-pod> --previous
|
|
# (empty)
|
|
|
|
# Gitea process had zero TCP connections to PostgreSQL port 5432
|
|
# but many connections to Valkey port 6379
|
|
cat /proc/<gitea-pid>/net/tcp | grep 1538 # 1538 = 5432 hex — no results
|
|
```
|
|
|
|
---
|
|
|
|
## Immediate Fix Applied
|
|
|
|
```bash
|
|
# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
|
|
sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
|
|
--type='json' \
|
|
-p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
|
|
|
|
# Restart pgpool
|
|
sudo k3s kubectl delete pod -n default <pgpool-pod-name>
|
|
```
|
|
|
|
---
|
|
|
|
## Risk: Fix Will Be Lost on helm upgrade
|
|
|
|
The patched secret is managed by Helm (annotation:
|
|
`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
|
|
secret from the chart template, which does not include `pgpool-password`,
|
|
and the bug will recur.
|
|
|
|
---
|
|
|
|
## Tasks
|
|
|
|
### T01 — Add pgpool-password to Helm values
|
|
|
|
```task
|
|
id: T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
|
|
```
|
|
|
|
Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
|
|
include the pgpool-password so it survives `helm upgrade`:
|
|
|
|
```yaml
|
|
postgresql-ha:
|
|
postgresql:
|
|
pgpoolPassword: <value matching sr-check-password>
|
|
```
|
|
|
|
**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
|
|
without manual secret patching.
|
|
|
|
---
|
|
|
|
### T02 — Add pgpool health check to smoke test
|
|
|
|
```task
|
|
id: T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
|
|
```
|
|
|
|
Extend `tests/smoke_kube.sh` to assert:
|
|
|
|
```bash
|
|
# All postgresql-ha pods Running
|
|
kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
|
|
|
|
# pgpool specifically not in CrashLoopBackOff
|
|
kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
|
|
-o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
|
|
```
|
|
|
|
**Done when:** the smoke test catches a pgpool failure within 5 minutes.
|
|
|
|
---
|
|
|
|
### T03 — Add HA failover test
|
|
|
|
```task
|
|
id: T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
|
|
```
|
|
|
|
Create `tests/test_ha_failover.sh` that:
|
|
|
|
1. Records Gitea login response time (baseline)
|
|
2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
|
|
3. Waits for repmgr to promote a replica (max 60s)
|
|
4. Asserts Gitea login POST still succeeds within 10s
|
|
5. Asserts pgpool pod is Running (not CrashLoopBackOff)
|
|
6. Asserts all postgresql pods return to Running
|
|
|
|
This test must pass before any PostgreSQL HA deployment is considered done.
|
|
|
|
**Done when:** script exits 0 against a live cluster.
|
|
|
|
---
|
|
|
|
### T04 — Document the incident in docs/
|
|
|
|
```task
|
|
id: T04
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
|
|
```
|
|
|
|
Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
|
|
timeline, root cause, and fix, so future operators understand what happened
|
|
and how to recover.
|
|
|
|
**Done when:** doc committed and linked from `docs/README.md`.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- Bitnami postgresql-ha chart v16.2.2
|
|
- Gitea Helm chart v12.2.0
|
|
- Related decision: D3 (HA testing policy) in `DECISIONS.md`
|