feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled

T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
     (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link

Also: make test-ha-failover target, Makefile .PHONY updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-10 14:16:22 +01:00
parent 42391c3b61
commit 660a63c674
7 changed files with 333 additions and 5 deletions

View File

@@ -10,6 +10,9 @@ k3s-install: ## Install k3s and Helm on all inventory hosts
smoke: ## Run Kubernetes smoke tests
bash tests/smoke_kube.sh
test-ha-failover: ## Run HA failover test (D3) — kills primary PG pod, asserts recovery
bash tests/test_ha_failover.sh $(if $(GITEA_URL),$(GITEA_URL),)
##@ Help
help: ## Show this help

View File

@@ -67,6 +67,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
4. **Deploy services**
Install baseline services via Helm from the helm/ directory.
## Incidents
- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
## 👥 Contributing
See CONTRIBUTING.md for rules, coding style, and workflow.

View File

@@ -0,0 +1,127 @@
# Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover
**Date:** 2026-03-10
**Severity:** High (Gitea write operations unavailable for ~4 hours)
**Component:** postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0
**Status:** Resolved — permanent fix pending `helm upgrade` with correct values
---
## Summary
A PostgreSQL HA failover caused the pgpool connection pooler to enter
CrashLoopBackOff. Gitea logins and all write operations hung silently for
approximately 4 hours. The root page continued to load (served from Valkey
cache), masking the failure.
Root cause: the `pgpool-password` key was absent from the
`gitea-postgresql-ha-postgresql` Kubernetes Secret. The Bitnami postgresql-ha
subchart does not populate this key automatically. The missing key had been
present since initial deployment (2025-08-31) but was never discovered because
the pgpool pod had not restarted in 20 days.
---
## Timeline
| Time (UTC) | Event |
|---|---|
| ~09:45 | `postgresql-0`, `postgresql-2` pods restarted (repmgr failover) |
| ~09:45 | pgpool pod restarted → CrashLoopBackOff (silent, no logs) |
| ~11:00 | User noticed Gitea login hanging; home page still loading |
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
| ~13:10 | Secret patched manually; pgpool pod deleted and restarted |
| ~13:15 | Gitea fully operational |
---
## Root Cause
The Bitnami `pgpool` container startup script reads
`/opt/bitnami/pgpool/secrets/pgpool-password`, mounted from the
`gitea-postgresql-ha-postgresql` Secret via `subPath`. That key was never
written by the Helm chart. The container exited immediately with no log output,
making it appear as a silent crash.
---
## Evidence
```bash
# Secret was missing pgpool-password — only these keys existed:
kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool
# password, postgres-password, repmgr-password — pgpool-password absent
# pgpool had 824 back-off restarts over 173 minutes with no logs
kubectl logs -n default <pgpool-pod> --previous
# (empty output)
# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538)
cat /proc/<gitea-pid>/net/tcp | grep 1538 # no results
# All connections were to Valkey (6379 = 0x18EB)
```
---
## Immediate Fix (manual — will regress on helm upgrade)
```bash
# Base64 of the pgpool admin password
PASSWORD_B64=$(echo -n "<pgpool-admin-password>" | base64)
kubectl patch secret -n default gitea-postgresql-ha-postgresql \
--type='json' \
-p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]"
kubectl delete pod -n default <pgpool-pod-name>
```
---
## Permanent Fix
Add `pgpool.adminPassword` to `helm/gitea-values.yaml` so the key is
present after every `helm upgrade`:
```bash
helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
```
See: `helm/gitea-values.yaml` — must be filled with the actual pgpool password
before running the upgrade.
---
## Decisions Triggered
**D3 — HA and failover scenarios must be tested before a workplan is considered done.**
Any workplan deploying an HA component is not complete until:
1. A failover test script in `tests/` passes against a live cluster
2. Smoke tests check the connection pooler/proxy, not just backing nodes
3. All required Helm values are in the versioned values file
See: `DECISIONS.md` and `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
---
## Recovery Checklist
If pgpool enters CrashLoopBackOff again:
```bash
# 1. Verify the secret key exists
kubectl get secret -n default gitea-postgresql-ha-postgresql \
-o jsonpath='{.data.pgpool-password}'
# Empty output = key missing → apply patch above
# 2. After patching, force pgpool restart
kubectl delete pod -n default \
$(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name)
# 3. Confirm Running state
kubectl get pods -n default | grep pgpool
# 4. Confirm Gitea can reach PostgreSQL
# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432
```

19
helm/gitea-values.yaml Normal file
View File

@@ -0,0 +1,19 @@
# Gitea Helm values — railiance-cluster
# Chart: gitea v12.2.0 / postgresql-ha subchart v16.2.2
#
# SECURITY: This file contains sensitive values.
# Encrypt before committing: sops --encrypt --in-place helm/gitea-values.yaml
# Usage: helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
#
# To find current values on the cluster:
# sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
postgresql-ha:
pgpool:
# FIX for WP-0003 / D3:
# The Bitnami postgresql-ha subchart (v16.2.2) does not write pgpool-password
# into the postgresql secret automatically. Without this key, pgpool enters
# CrashLoopBackOff on any pod restart (including HA failover).
# Value must match the sr-check-password used during initial deployment.
# Decode current value: kubectl get secret gitea-postgresql-ha-postgresql -o jsonpath='{.data.pgpool-password}' | base64 -d
adminPassword: "REPLACE_WITH_PGPOOL_ADMIN_PASSWORD"

View File

@@ -35,6 +35,30 @@ else
fail "Traefik ingress controller not running in kube-system"
fi
# ── postgresql-ha pods ───────────────────────────────────────────────────────
PG_NOT_RUNNING=$(kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
| grep -v "^NAME" | grep -v " Running " | wc -l)
if kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null | grep -q "Running"; then
if [[ "$PG_NOT_RUNNING" -eq 0 ]]; then
ok "All postgresql-ha pods Running"
else
fail "${PG_NOT_RUNNING} postgresql-ha pod(s) not in Running state"
fi
else
fail "No postgresql-ha pods found (is Gitea deployed?)"
fi
# ── pgpool (D3 requirement) ───────────────────────────────────────────────────
# pgpool CrashLoopBackOff is silent and only surfaces on pod restart/failover.
# A passing check here means the pgpool-password secret key is present.
PGPOOL_STATE=$(kubectl get pods -n default -l app.kubernetes.io/component=pgpool 2>/dev/null \
| grep -v "^NAME" | awk '{print $3}' | head -1)
if [[ "$PGPOOL_STATE" == "Running" ]]; then
ok "pgpool pod Running"
else
fail "pgpool pod not Running (state: ${PGPOOL_STATE:-not found}) — check pgpool-password secret key"
fi
# ── Summary ──────────────────────────────────────────────────────────────────
echo ""
echo "Results: ${PASS} passed, ${FAIL} failed"

151
tests/test_ha_failover.sh Executable file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env bash
# HA Failover Test — Decision D3
#
# Deliberately kills the primary PostgreSQL pod and asserts that:
# 1. Gitea remains accessible during failover
# 2. pgpool recovers to Running state
# 3. All postgresql-ha pods return to Running
#
# Must be run against a live cluster. Exits 0 on full pass.
# Run: bash tests/test_ha_failover.sh [GITEA_URL]
#
# GITEA_URL defaults to http://localhost:3000 — override for your ingress:
# bash tests/test_ha_failover.sh https://git.example.com
set -uo pipefail
GITEA_URL="${1:-http://localhost:3000}"
NAMESPACE="default"
FAILOVER_TIMEOUT=60 # seconds to wait for repmgr promotion
RECOVERY_TIMEOUT=120 # seconds to wait for all pods Running again
PASS=0
FAIL=0
ok() { echo "[OK] $*"; ((PASS++)) || true; }
fail() { echo "[FAIL] $*"; ((FAIL++)) || true; }
info() { echo "[INFO] $*"; }
# ── Pre-flight ────────────────────────────────────────────────────────────────
info "Target cluster: $(kubectl config current-context 2>/dev/null || echo 'default')"
info "Gitea URL: ${GITEA_URL}"
info "Namespace: ${NAMESPACE}"
echo ""
# Confirm postgresql-ha primary pod exists
PRIMARY_POD=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -z "$PRIMARY_POD" ]]; then
fail "No postgresql-ha pods found — is Gitea deployed?"
exit 1
fi
info "Primary pod to kill: ${PRIMARY_POD}"
# ── Baseline: Gitea accessible before failover ────────────────────────────────
info "Checking Gitea baseline..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "${GITEA_URL}" 2>/dev/null || echo "000")
if [[ "$HTTP_CODE" =~ ^[23] ]]; then
ok "Gitea accessible before failover (HTTP ${HTTP_CODE})"
else
fail "Gitea not accessible before failover (HTTP ${HTTP_CODE}) — aborting test"
exit 1
fi
# ── Trigger failover: kill primary pod ───────────────────────────────────────
info "Deleting primary pod ${PRIMARY_POD} to trigger failover..."
kubectl delete pod -n "${NAMESPACE}" "${PRIMARY_POD}" --grace-period=0
FAILOVER_START=$(date +%s)
# ── Wait for repmgr promotion ─────────────────────────────────────────────────
info "Waiting up to ${FAILOVER_TIMEOUT}s for a replica to be promoted..."
PROMOTED=false
while (( $(date +%s) - FAILOVER_START < FAILOVER_TIMEOUT )); do
RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
| grep " Running " | wc -l)
if [[ "$RUNNING" -ge 1 ]]; then
PROMOTED=true
ELAPSED=$(( $(date +%s) - FAILOVER_START ))
info "Replica promoted in ${ELAPSED}s"
break
fi
sleep 3
done
if $PROMOTED; then
ok "PostgreSQL replica promoted within ${FAILOVER_TIMEOUT}s"
else
fail "No replica promoted within ${FAILOVER_TIMEOUT}s"
fi
# ── Gitea accessible after failover ──────────────────────────────────────────
info "Checking Gitea accessibility after failover..."
GITEA_OK=false
for i in $(seq 1 10); do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "${GITEA_URL}" 2>/dev/null || echo "000")
if [[ "$HTTP_CODE" =~ ^[23] ]]; then
GITEA_OK=true
break
fi
sleep 1
done
if $GITEA_OK; then
ok "Gitea accessible after failover (HTTP ${HTTP_CODE})"
else
fail "Gitea not accessible within 10s of failover (last HTTP ${HTTP_CODE})"
fi
# ── pgpool Running after failover ─────────────────────────────────────────────
info "Checking pgpool state..."
PGPOOL_OK=false
for i in $(seq 1 20); do
PGPOOL_STATE=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/component=pgpool 2>/dev/null \
| grep -v "^NAME" | awk '{print $3}' | head -1)
if [[ "$PGPOOL_STATE" == "Running" ]]; then
PGPOOL_OK=true
break
fi
sleep 3
done
if $PGPOOL_OK; then
ok "pgpool pod Running after failover"
else
fail "pgpool not Running after failover (state: ${PGPOOL_STATE:-not found}) — missing pgpool-password?"
fi
# ── All postgresql-ha pods recover ───────────────────────────────────────────
info "Waiting up to ${RECOVERY_TIMEOUT}s for all postgresql-ha pods to return to Running..."
ALL_OK=false
RECOVERY_START=$(date +%s)
while (( $(date +%s) - RECOVERY_START < RECOVERY_TIMEOUT )); do
TOTAL=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
| grep -v "^NAME" | wc -l)
RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
| grep " Running " | wc -l)
if [[ "$TOTAL" -gt 0 && "$TOTAL" -eq "$RUNNING" ]]; then
ALL_OK=true
ELAPSED=$(( $(date +%s) - RECOVERY_START ))
info "All ${TOTAL} postgresql-ha pods Running after ${ELAPSED}s"
break
fi
sleep 5
done
if $ALL_OK; then
ok "All postgresql-ha pods recovered to Running"
else
fail "Not all postgresql-ha pods recovered within ${RECOVERY_TIMEOUT}s"
kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null || true
fi
# ── Summary ───────────────────────────────────────────────────────────────────
echo ""
echo "Results: ${PASS} passed, ${FAIL} failed"
echo ""
if [[ "$FAIL" -gt 0 ]]; then
echo "FAILOVER TEST FAILED — review output above"
exit 1
else
echo "FAILOVER TEST PASSED — cluster is HA-verified (D3 satisfied)"
exit 0
fi

View File

@@ -4,7 +4,7 @@ type: bug-report
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
domain: railiance
repo: railiance-cluster
status: open
status: active
owner: tegwick
created: "2026-03-10"
updated: "2026-03-10"
@@ -103,7 +103,7 @@ and the bug will recur.
```task
id: T01
status: open
status: done
priority: high
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
```
@@ -126,7 +126,7 @@ without manual secret patching.
```task
id: T02
status: open
status: done
priority: high
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
```
@@ -150,7 +150,7 @@ kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
```task
id: T03
status: open
status: done
priority: high
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
```
@@ -174,7 +174,7 @@ This test must pass before any PostgreSQL HA deployment is considered done.
```task
id: T04
status: open
status: done
priority: medium
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
```