feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
(fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link
Also: make test-ha-failover target, Makefile .PHONY updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
3
Makefile
3
Makefile
@@ -10,6 +10,9 @@ k3s-install: ## Install k3s and Helm on all inventory hosts
|
||||
smoke: ## Run Kubernetes smoke tests
|
||||
bash tests/smoke_kube.sh
|
||||
|
||||
test-ha-failover: ## Run HA failover test (D3) — kills primary PG pod, asserts recovery
|
||||
bash tests/test_ha_failover.sh $(if $(GITEA_URL),$(GITEA_URL),)
|
||||
|
||||
##@ Help
|
||||
|
||||
help: ## Show this help
|
||||
|
||||
@@ -67,6 +67,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
|
||||
4. **Deploy services**
|
||||
Install baseline services via Helm from the helm/ directory.
|
||||
|
||||
## Incidents
|
||||
|
||||
- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
|
||||
|
||||
## 👥 Contributing
|
||||
|
||||
See CONTRIBUTING.md for rules, coding style, and workflow.
|
||||
|
||||
127
docs/incidents/2026-03-10-pgpool-missing-secret.md
Normal file
127
docs/incidents/2026-03-10-pgpool-missing-secret.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover
|
||||
|
||||
**Date:** 2026-03-10
|
||||
**Severity:** High (Gitea write operations unavailable for ~4 hours)
|
||||
**Component:** postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0
|
||||
**Status:** Resolved — permanent fix pending `helm upgrade` with correct values
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
A PostgreSQL HA failover caused the pgpool connection pooler to enter
|
||||
CrashLoopBackOff. Gitea logins and all write operations hung silently for
|
||||
approximately 4 hours. The root page continued to load (served from Valkey
|
||||
cache), masking the failure.
|
||||
|
||||
Root cause: the `pgpool-password` key was absent from the
|
||||
`gitea-postgresql-ha-postgresql` Kubernetes Secret. The Bitnami postgresql-ha
|
||||
subchart does not populate this key automatically. The missing key had been
|
||||
present since initial deployment (2025-08-31) but was never discovered because
|
||||
the pgpool pod had not restarted in 20 days.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|---|---|
|
||||
| ~09:45 | `postgresql-0`, `postgresql-2` pods restarted (repmgr failover) |
|
||||
| ~09:45 | pgpool pod restarted → CrashLoopBackOff (silent, no logs) |
|
||||
| ~11:00 | User noticed Gitea login hanging; home page still loading |
|
||||
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
|
||||
| ~13:10 | Secret patched manually; pgpool pod deleted and restarted |
|
||||
| ~13:15 | Gitea fully operational |
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
The Bitnami `pgpool` container startup script reads
|
||||
`/opt/bitnami/pgpool/secrets/pgpool-password`, mounted from the
|
||||
`gitea-postgresql-ha-postgresql` Secret via `subPath`. That key was never
|
||||
written by the Helm chart. The container exited immediately with no log output,
|
||||
making it appear as a silent crash.
|
||||
|
||||
---
|
||||
|
||||
## Evidence
|
||||
|
||||
```bash
|
||||
# Secret was missing pgpool-password — only these keys existed:
|
||||
kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool
|
||||
# password, postgres-password, repmgr-password — pgpool-password absent
|
||||
|
||||
# pgpool had 824 back-off restarts over 173 minutes with no logs
|
||||
kubectl logs -n default <pgpool-pod> --previous
|
||||
# (empty output)
|
||||
|
||||
# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538)
|
||||
cat /proc/<gitea-pid>/net/tcp | grep 1538 # no results
|
||||
# All connections were to Valkey (6379 = 0x18EB)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Immediate Fix (manual — will regress on helm upgrade)
|
||||
|
||||
```bash
|
||||
# Base64 of the pgpool admin password
|
||||
PASSWORD_B64=$(echo -n "<pgpool-admin-password>" | base64)
|
||||
|
||||
kubectl patch secret -n default gitea-postgresql-ha-postgresql \
|
||||
--type='json' \
|
||||
-p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]"
|
||||
|
||||
kubectl delete pod -n default <pgpool-pod-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Permanent Fix
|
||||
|
||||
Add `pgpool.adminPassword` to `helm/gitea-values.yaml` so the key is
|
||||
present after every `helm upgrade`:
|
||||
|
||||
```bash
|
||||
helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
|
||||
```
|
||||
|
||||
See: `helm/gitea-values.yaml` — must be filled with the actual pgpool password
|
||||
before running the upgrade.
|
||||
|
||||
---
|
||||
|
||||
## Decisions Triggered
|
||||
|
||||
**D3 — HA and failover scenarios must be tested before a workplan is considered done.**
|
||||
|
||||
Any workplan deploying an HA component is not complete until:
|
||||
1. A failover test script in `tests/` passes against a live cluster
|
||||
2. Smoke tests check the connection pooler/proxy, not just backing nodes
|
||||
3. All required Helm values are in the versioned values file
|
||||
|
||||
See: `DECISIONS.md` and `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||
|
||||
---
|
||||
|
||||
## Recovery Checklist
|
||||
|
||||
If pgpool enters CrashLoopBackOff again:
|
||||
|
||||
```bash
|
||||
# 1. Verify the secret key exists
|
||||
kubectl get secret -n default gitea-postgresql-ha-postgresql \
|
||||
-o jsonpath='{.data.pgpool-password}'
|
||||
# Empty output = key missing → apply patch above
|
||||
|
||||
# 2. After patching, force pgpool restart
|
||||
kubectl delete pod -n default \
|
||||
$(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name)
|
||||
|
||||
# 3. Confirm Running state
|
||||
kubectl get pods -n default | grep pgpool
|
||||
|
||||
# 4. Confirm Gitea can reach PostgreSQL
|
||||
# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432
|
||||
```
|
||||
19
helm/gitea-values.yaml
Normal file
19
helm/gitea-values.yaml
Normal file
@@ -0,0 +1,19 @@
|
||||
# Gitea Helm values — railiance-cluster
|
||||
# Chart: gitea v12.2.0 / postgresql-ha subchart v16.2.2
|
||||
#
|
||||
# SECURITY: This file contains sensitive values.
|
||||
# Encrypt before committing: sops --encrypt --in-place helm/gitea-values.yaml
|
||||
# Usage: helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
|
||||
#
|
||||
# To find current values on the cluster:
|
||||
# sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
|
||||
|
||||
postgresql-ha:
|
||||
pgpool:
|
||||
# FIX for WP-0003 / D3:
|
||||
# The Bitnami postgresql-ha subchart (v16.2.2) does not write pgpool-password
|
||||
# into the postgresql secret automatically. Without this key, pgpool enters
|
||||
# CrashLoopBackOff on any pod restart (including HA failover).
|
||||
# Value must match the sr-check-password used during initial deployment.
|
||||
# Decode current value: kubectl get secret gitea-postgresql-ha-postgresql -o jsonpath='{.data.pgpool-password}' | base64 -d
|
||||
adminPassword: "REPLACE_WITH_PGPOOL_ADMIN_PASSWORD"
|
||||
@@ -35,6 +35,30 @@ else
|
||||
fail "Traefik ingress controller not running in kube-system"
|
||||
fi
|
||||
|
||||
# ── postgresql-ha pods ───────────────────────────────────────────────────────
|
||||
PG_NOT_RUNNING=$(kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
|
||||
| grep -v "^NAME" | grep -v " Running " | wc -l)
|
||||
if kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null | grep -q "Running"; then
|
||||
if [[ "$PG_NOT_RUNNING" -eq 0 ]]; then
|
||||
ok "All postgresql-ha pods Running"
|
||||
else
|
||||
fail "${PG_NOT_RUNNING} postgresql-ha pod(s) not in Running state"
|
||||
fi
|
||||
else
|
||||
fail "No postgresql-ha pods found (is Gitea deployed?)"
|
||||
fi
|
||||
|
||||
# ── pgpool (D3 requirement) ───────────────────────────────────────────────────
|
||||
# pgpool CrashLoopBackOff is silent and only surfaces on pod restart/failover.
|
||||
# A passing check here means the pgpool-password secret key is present.
|
||||
PGPOOL_STATE=$(kubectl get pods -n default -l app.kubernetes.io/component=pgpool 2>/dev/null \
|
||||
| grep -v "^NAME" | awk '{print $3}' | head -1)
|
||||
if [[ "$PGPOOL_STATE" == "Running" ]]; then
|
||||
ok "pgpool pod Running"
|
||||
else
|
||||
fail "pgpool pod not Running (state: ${PGPOOL_STATE:-not found}) — check pgpool-password secret key"
|
||||
fi
|
||||
|
||||
# ── Summary ──────────────────────────────────────────────────────────────────
|
||||
echo ""
|
||||
echo "Results: ${PASS} passed, ${FAIL} failed"
|
||||
|
||||
151
tests/test_ha_failover.sh
Executable file
151
tests/test_ha_failover.sh
Executable file
@@ -0,0 +1,151 @@
|
||||
#!/usr/bin/env bash
|
||||
# HA Failover Test — Decision D3
|
||||
#
|
||||
# Deliberately kills the primary PostgreSQL pod and asserts that:
|
||||
# 1. Gitea remains accessible during failover
|
||||
# 2. pgpool recovers to Running state
|
||||
# 3. All postgresql-ha pods return to Running
|
||||
#
|
||||
# Must be run against a live cluster. Exits 0 on full pass.
|
||||
# Run: bash tests/test_ha_failover.sh [GITEA_URL]
|
||||
#
|
||||
# GITEA_URL defaults to http://localhost:3000 — override for your ingress:
|
||||
# bash tests/test_ha_failover.sh https://git.example.com
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
GITEA_URL="${1:-http://localhost:3000}"
|
||||
NAMESPACE="default"
|
||||
FAILOVER_TIMEOUT=60 # seconds to wait for repmgr promotion
|
||||
RECOVERY_TIMEOUT=120 # seconds to wait for all pods Running again
|
||||
PASS=0
|
||||
FAIL=0
|
||||
|
||||
ok() { echo "[OK] $*"; ((PASS++)) || true; }
|
||||
fail() { echo "[FAIL] $*"; ((FAIL++)) || true; }
|
||||
info() { echo "[INFO] $*"; }
|
||||
|
||||
# ── Pre-flight ────────────────────────────────────────────────────────────────
|
||||
info "Target cluster: $(kubectl config current-context 2>/dev/null || echo 'default')"
|
||||
info "Gitea URL: ${GITEA_URL}"
|
||||
info "Namespace: ${NAMESPACE}"
|
||||
echo ""
|
||||
|
||||
# Confirm postgresql-ha primary pod exists
|
||||
PRIMARY_POD=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha \
|
||||
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
if [[ -z "$PRIMARY_POD" ]]; then
|
||||
fail "No postgresql-ha pods found — is Gitea deployed?"
|
||||
exit 1
|
||||
fi
|
||||
info "Primary pod to kill: ${PRIMARY_POD}"
|
||||
|
||||
# ── Baseline: Gitea accessible before failover ────────────────────────────────
|
||||
info "Checking Gitea baseline..."
|
||||
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "${GITEA_URL}" 2>/dev/null || echo "000")
|
||||
if [[ "$HTTP_CODE" =~ ^[23] ]]; then
|
||||
ok "Gitea accessible before failover (HTTP ${HTTP_CODE})"
|
||||
else
|
||||
fail "Gitea not accessible before failover (HTTP ${HTTP_CODE}) — aborting test"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── Trigger failover: kill primary pod ───────────────────────────────────────
|
||||
info "Deleting primary pod ${PRIMARY_POD} to trigger failover..."
|
||||
kubectl delete pod -n "${NAMESPACE}" "${PRIMARY_POD}" --grace-period=0
|
||||
FAILOVER_START=$(date +%s)
|
||||
|
||||
# ── Wait for repmgr promotion ─────────────────────────────────────────────────
|
||||
info "Waiting up to ${FAILOVER_TIMEOUT}s for a replica to be promoted..."
|
||||
PROMOTED=false
|
||||
while (( $(date +%s) - FAILOVER_START < FAILOVER_TIMEOUT )); do
|
||||
RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
|
||||
| grep " Running " | wc -l)
|
||||
if [[ "$RUNNING" -ge 1 ]]; then
|
||||
PROMOTED=true
|
||||
ELAPSED=$(( $(date +%s) - FAILOVER_START ))
|
||||
info "Replica promoted in ${ELAPSED}s"
|
||||
break
|
||||
fi
|
||||
sleep 3
|
||||
done
|
||||
|
||||
if $PROMOTED; then
|
||||
ok "PostgreSQL replica promoted within ${FAILOVER_TIMEOUT}s"
|
||||
else
|
||||
fail "No replica promoted within ${FAILOVER_TIMEOUT}s"
|
||||
fi
|
||||
|
||||
# ── Gitea accessible after failover ──────────────────────────────────────────
|
||||
info "Checking Gitea accessibility after failover..."
|
||||
GITEA_OK=false
|
||||
for i in $(seq 1 10); do
|
||||
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "${GITEA_URL}" 2>/dev/null || echo "000")
|
||||
if [[ "$HTTP_CODE" =~ ^[23] ]]; then
|
||||
GITEA_OK=true
|
||||
break
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
|
||||
if $GITEA_OK; then
|
||||
ok "Gitea accessible after failover (HTTP ${HTTP_CODE})"
|
||||
else
|
||||
fail "Gitea not accessible within 10s of failover (last HTTP ${HTTP_CODE})"
|
||||
fi
|
||||
|
||||
# ── pgpool Running after failover ─────────────────────────────────────────────
|
||||
info "Checking pgpool state..."
|
||||
PGPOOL_OK=false
|
||||
for i in $(seq 1 20); do
|
||||
PGPOOL_STATE=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/component=pgpool 2>/dev/null \
|
||||
| grep -v "^NAME" | awk '{print $3}' | head -1)
|
||||
if [[ "$PGPOOL_STATE" == "Running" ]]; then
|
||||
PGPOOL_OK=true
|
||||
break
|
||||
fi
|
||||
sleep 3
|
||||
done
|
||||
|
||||
if $PGPOOL_OK; then
|
||||
ok "pgpool pod Running after failover"
|
||||
else
|
||||
fail "pgpool not Running after failover (state: ${PGPOOL_STATE:-not found}) — missing pgpool-password?"
|
||||
fi
|
||||
|
||||
# ── All postgresql-ha pods recover ───────────────────────────────────────────
|
||||
info "Waiting up to ${RECOVERY_TIMEOUT}s for all postgresql-ha pods to return to Running..."
|
||||
ALL_OK=false
|
||||
RECOVERY_START=$(date +%s)
|
||||
while (( $(date +%s) - RECOVERY_START < RECOVERY_TIMEOUT )); do
|
||||
TOTAL=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
|
||||
| grep -v "^NAME" | wc -l)
|
||||
RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
|
||||
| grep " Running " | wc -l)
|
||||
if [[ "$TOTAL" -gt 0 && "$TOTAL" -eq "$RUNNING" ]]; then
|
||||
ALL_OK=true
|
||||
ELAPSED=$(( $(date +%s) - RECOVERY_START ))
|
||||
info "All ${TOTAL} postgresql-ha pods Running after ${ELAPSED}s"
|
||||
break
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
|
||||
if $ALL_OK; then
|
||||
ok "All postgresql-ha pods recovered to Running"
|
||||
else
|
||||
fail "Not all postgresql-ha pods recovered within ${RECOVERY_TIMEOUT}s"
|
||||
kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# ── Summary ───────────────────────────────────────────────────────────────────
|
||||
echo ""
|
||||
echo "Results: ${PASS} passed, ${FAIL} failed"
|
||||
echo ""
|
||||
if [[ "$FAIL" -gt 0 ]]; then
|
||||
echo "FAILOVER TEST FAILED — review output above"
|
||||
exit 1
|
||||
else
|
||||
echo "FAILOVER TEST PASSED — cluster is HA-verified (D3 satisfied)"
|
||||
exit 0
|
||||
fi
|
||||
@@ -4,7 +4,7 @@ type: bug-report
|
||||
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
|
||||
domain: railiance
|
||||
repo: railiance-cluster
|
||||
status: open
|
||||
status: active
|
||||
owner: tegwick
|
||||
created: "2026-03-10"
|
||||
updated: "2026-03-10"
|
||||
@@ -103,7 +103,7 @@ and the bug will recur.
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: open
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
|
||||
```
|
||||
@@ -126,7 +126,7 @@ without manual secret patching.
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: open
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
|
||||
```
|
||||
@@ -150,7 +150,7 @@ kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: open
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
|
||||
```
|
||||
@@ -174,7 +174,7 @@ This test must pass before any PostgreSQL HA deployment is considered done.
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: open
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user