Files
railiance-cluster/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
tegwick 441a37c5ae chore(workplan): mark WP-0003 completed — pgpool fix deployed and verified
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:45:35 +01:00

5.4 KiB

id, type, title, domain, repo, status, owner, created, updated, state_hub_workstream_id
id type title domain repo status owner created updated state_hub_workstream_id
RAIL-BS-WP-0003 bug-report pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key railiance railiance-cluster completed tegwick 2026-03-10 2026-03-10 7ee9ee22-1fae-4567-9194-8d70a9e0f45b

Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover

Summary

On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to restart. pgpool — the connection pooler between Gitea and PostgreSQL — then entered CrashLoopBackOff and produced no logs. As a result Gitea's login and all write operations hung indefinitely. The root page was still served (from Valkey cache) which masked the failure.

The fix was to patch a missing key in a Kubernetes secret. The root cause is that the gitea-12.2.0 Helm chart (postgresql-ha subchart v16.2.2) does not populate the pgpool-password key in the gitea-postgresql-ha-postgresql secret, even though the pgpool pod requires it at startup.


Timeline

Time (UTC) Event
~09:45 postgresql-0, postgresql-2 pods restarted (repmgr failover)
~09:45 pgpool pod restarted and entered CrashLoopBackOff
~11:00 User noticed Gitea login hanging; home page still loading
~13:00 Root cause identified: missing pgpool-password secret key
~13:10 Secret patched; pgpool pod deleted and restarted cleanly
~13:15 Gitea fully operational

Root Cause

The Bitnami pgpool container startup script reads the file /opt/bitnami/pgpool/secrets/pgpool-password, which is mounted from the gitea-postgresql-ha-postgresql Kubernetes Secret via a subPath volume mount. That secret key was never created by the Helm chart, so the file did not exist. The container exited immediately with no logs.

The pod had been running for 20 days without a restart, so this gap was never discovered during initial deployment.


Evidence

# Secret was missing the pgpool-password key
sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
# data: keys were password, postgres-password, repmgr-password only
# pgpool-password was absent

# pgpool pod describe showed 824 back-off restarts over 173 minutes
# No logs in either current or --previous output
sudo k3s kubectl logs -n default <pgpool-pod> --previous
# (empty)

# Gitea process had zero TCP connections to PostgreSQL port 5432
# but many connections to Valkey port 6379
cat /proc/<gitea-pid>/net/tcp | grep 1538  # 1538 = 5432 hex — no results

Immediate Fix Applied

# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
  --type='json' \
  -p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'

# Restart pgpool
sudo k3s kubectl delete pod -n default <pgpool-pod-name>

Risk: Fix Will Be Lost on helm upgrade

The patched secret is managed by Helm (annotation: meta.helm.sh/release-name: gitea). A helm upgrade will regenerate the secret from the chart template, which does not include pgpool-password, and the bug will recur.


Tasks

T01 — Add pgpool-password to Helm values

id: T01
status: done
priority: high
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"

Create or update helm/gitea-values.yaml (or equivalent) to permanently include the pgpool-password so it survives helm upgrade:

postgresql-ha:
  postgresql:
    pgpoolPassword: <value matching sr-check-password>

Done when: helm upgrade gitea completes and pgpool starts cleanly without manual secret patching.


T02 — Add pgpool health check to smoke test

id: T02
status: done
priority: high
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"

Extend tests/smoke_kube.sh to assert:

# All postgresql-ha pods Running
kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1

# pgpool specifically not in CrashLoopBackOff
kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
  -o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash

Done when: the smoke test catches a pgpool failure within 5 minutes.


T03 — Add HA failover test

id: T03
status: done
priority: high
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"

Create tests/test_ha_failover.sh that:

  1. Records Gitea login response time (baseline)
  2. Kills the primary PostgreSQL pod: kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default
  3. Waits for repmgr to promote a replica (max 60s)
  4. Asserts Gitea login POST still succeeds within 10s
  5. Asserts pgpool pod is Running (not CrashLoopBackOff)
  6. Asserts all postgresql pods return to Running

This test must pass before any PostgreSQL HA deployment is considered done.

Done when: script exits 0 against a live cluster.


T04 — Document the incident in docs/

id: T04
status: done
priority: medium
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"

Add docs/incidents/2026-03-10-pgpool-missing-secret.md with the full timeline, root cause, and fix, so future operators understand what happened and how to recover.

Done when: doc committed and linked from docs/README.md.


References

  • Bitnami postgresql-ha chart v16.2.2
  • Gitea Helm chart v12.2.0
  • Related decision: D3 (HA testing policy) in DECISIONS.md