Files
railiance-cluster/docs/incidents/2026-03-10-pgpool-missing-secret.md
tegwick 660a63c674
Some checks failed
railiance-tests / smoke (push) Has been cancelled
feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
     (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link

Also: make test-ha-failover target, Makefile .PHONY updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:16:22 +01:00

4.0 KiB

Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover

Date: 2026-03-10 Severity: High (Gitea write operations unavailable for ~4 hours) Component: postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0 Status: Resolved — permanent fix pending helm upgrade with correct values


Summary

A PostgreSQL HA failover caused the pgpool connection pooler to enter CrashLoopBackOff. Gitea logins and all write operations hung silently for approximately 4 hours. The root page continued to load (served from Valkey cache), masking the failure.

Root cause: the pgpool-password key was absent from the gitea-postgresql-ha-postgresql Kubernetes Secret. The Bitnami postgresql-ha subchart does not populate this key automatically. The missing key had been present since initial deployment (2025-08-31) but was never discovered because the pgpool pod had not restarted in 20 days.


Timeline

Time (UTC) Event
~09:45 postgresql-0, postgresql-2 pods restarted (repmgr failover)
~09:45 pgpool pod restarted → CrashLoopBackOff (silent, no logs)
~11:00 User noticed Gitea login hanging; home page still loading
~13:00 Root cause identified: missing pgpool-password secret key
~13:10 Secret patched manually; pgpool pod deleted and restarted
~13:15 Gitea fully operational

Root Cause

The Bitnami pgpool container startup script reads /opt/bitnami/pgpool/secrets/pgpool-password, mounted from the gitea-postgresql-ha-postgresql Secret via subPath. That key was never written by the Helm chart. The container exited immediately with no log output, making it appear as a silent crash.


Evidence

# Secret was missing pgpool-password — only these keys existed:
kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool
# password, postgres-password, repmgr-password — pgpool-password absent

# pgpool had 824 back-off restarts over 173 minutes with no logs
kubectl logs -n default <pgpool-pod> --previous
# (empty output)

# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538)
cat /proc/<gitea-pid>/net/tcp | grep 1538   # no results
# All connections were to Valkey (6379 = 0x18EB)

Immediate Fix (manual — will regress on helm upgrade)

# Base64 of the pgpool admin password
PASSWORD_B64=$(echo -n "<pgpool-admin-password>" | base64)

kubectl patch secret -n default gitea-postgresql-ha-postgresql \
  --type='json' \
  -p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]"

kubectl delete pod -n default <pgpool-pod-name>

Permanent Fix

Add pgpool.adminPassword to helm/gitea-values.yaml so the key is present after every helm upgrade:

helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml

See: helm/gitea-values.yaml — must be filled with the actual pgpool password before running the upgrade.


Decisions Triggered

D3 — HA and failover scenarios must be tested before a workplan is considered done.

Any workplan deploying an HA component is not complete until:

  1. A failover test script in tests/ passes against a live cluster
  2. Smoke tests check the connection pooler/proxy, not just backing nodes
  3. All required Helm values are in the versioned values file

See: DECISIONS.md and workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md


Recovery Checklist

If pgpool enters CrashLoopBackOff again:

# 1. Verify the secret key exists
kubectl get secret -n default gitea-postgresql-ha-postgresql \
  -o jsonpath='{.data.pgpool-password}'
# Empty output = key missing → apply patch above

# 2. After patching, force pgpool restart
kubectl delete pod -n default \
  $(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name)

# 3. Confirm Running state
kubectl get pods -n default | grep pgpool

# 4. Confirm Gitea can reach PostgreSQL
# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432