feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug

T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade) T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks T03: tests/test_ha_failover.sh — D3 HA failover test script T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link Also: make test-ha-failover target, Makefile .PHONY updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:16:22 +01:00
parent 42391c3b61
commit 660a63c674
7 changed files with 333 additions and 5 deletions
--- a/3
+++ b/3
@@ -10,6 +10,9 @@ k3s-install: ## Install k3s and Helm on all inventory hosts
 smoke: ## Run Kubernetes smoke tests
 	bash tests/smoke_kube.sh

+test-ha-failover: ## Run HA failover test (D3) — kills primary PG pod, asserts recovery
+	bash tests/test_ha_failover.sh $(if $(GITEA_URL),$(GITEA_URL),)
+
 ##@ Help

 help: ## Show this help
--- a/docs/README.md
+++ b/docs/README.md
@@ -67,6 +67,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
 4. **Deploy services**
   Install baseline services via Helm from the helm/ directory.

+## Incidents
+
+- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
+
 ## 👥 Contributing

 See CONTRIBUTING.md for rules, coding style, and workflow.
--- a/docs/incidents/2026-03-10-pgpool-missing-secret.md
+++ b/docs/incidents/2026-03-10-pgpool-missing-secret.md
@@ -0,0 +1,127 @@
+# Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover
+
+**Date:** 2026-03-10
+**Severity:** High (Gitea write operations unavailable for ~4 hours)
+**Component:** postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0
+**Status:** Resolved — permanent fix pending `helm upgrade` with correct values
+
+---
+
+## Summary
+
+A PostgreSQL HA failover caused the pgpool connection pooler to enter
+CrashLoopBackOff. Gitea logins and all write operations hung silently for
+approximately 4 hours. The root page continued to load (served from Valkey
+cache), masking the failure.
+
+Root cause: the `pgpool-password` key was absent from the
+`gitea-postgresql-ha-postgresql` Kubernetes Secret. The Bitnami postgresql-ha
+subchart does not populate this key automatically. The missing key had been
+present since initial deployment (2025-08-31) but was never discovered because
+the pgpool pod had not restarted in 20 days.
+
+---
+
+## Timeline
+
+| Time (UTC) | Event |
+|---|---|
+| ~09:45 | `postgresql-0`, `postgresql-2` pods restarted (repmgr failover) |
+| ~09:45 | pgpool pod restarted → CrashLoopBackOff (silent, no logs) |
+| ~11:00 | User noticed Gitea login hanging; home page still loading |
+| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
+| ~13:10 | Secret patched manually; pgpool pod deleted and restarted |
+| ~13:15 | Gitea fully operational |
+
+---
+
+## Root Cause
+
+The Bitnami `pgpool` container startup script reads
+`/opt/bitnami/pgpool/secrets/pgpool-password`, mounted from the
+`gitea-postgresql-ha-postgresql` Secret via `subPath`. That key was never
+written by the Helm chart. The container exited immediately with no log output,
+making it appear as a silent crash.
+
+---
+
+## Evidence
+
+```bash
+# Secret was missing pgpool-password — only these keys existed:
+kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool
+# password, postgres-password, repmgr-password — pgpool-password absent
+
+# pgpool had 824 back-off restarts over 173 minutes with no logs
+kubectl logs -n default <pgpool-pod> --previous
+# (empty output)
+
+# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538)
+cat /proc/<gitea-pid>/net/tcp | grep 1538   # no results
+# All connections were to Valkey (6379 = 0x18EB)
+```
+
+---
+
+## Immediate Fix (manual — will regress on helm upgrade)
+
+```bash
+# Base64 of the pgpool admin password
+PASSWORD_B64=$(echo -n "<pgpool-admin-password>" | base64)
+
+kubectl patch secret -n default gitea-postgresql-ha-postgresql \
+  --type='json' \
+  -p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]"
+
+kubectl delete pod -n default <pgpool-pod-name>
+```
+
+---
+
+## Permanent Fix
+
+Add `pgpool.adminPassword` to `helm/gitea-values.yaml` so the key is
+present after every `helm upgrade`:
+
+```bash
+helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
+```
+
+See: `helm/gitea-values.yaml` — must be filled with the actual pgpool password
+before running the upgrade.
+
+---
+
+## Decisions Triggered
+
+**D3 — HA and failover scenarios must be tested before a workplan is considered done.**
+
+Any workplan deploying an HA component is not complete until:
+1. A failover test script in `tests/` passes against a live cluster
+2. Smoke tests check the connection pooler/proxy, not just backing nodes
+3. All required Helm values are in the versioned values file
+
+See: `DECISIONS.md` and `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
+
+---
+
+## Recovery Checklist
+
+If pgpool enters CrashLoopBackOff again:
+
+```bash
+# 1. Verify the secret key exists
+kubectl get secret -n default gitea-postgresql-ha-postgresql \
+  -o jsonpath='{.data.pgpool-password}'
+# Empty output = key missing → apply patch above
+
+# 2. After patching, force pgpool restart
+kubectl delete pod -n default \
+  $(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name)
+
+# 3. Confirm Running state
+kubectl get pods -n default | grep pgpool
+
+# 4. Confirm Gitea can reach PostgreSQL
+# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432
+```
--- a/helm/gitea-values.yaml
+++ b/helm/gitea-values.yaml
@@ -0,0 +1,19 @@
+# Gitea Helm values — railiance-cluster
+# Chart: gitea v12.2.0 / postgresql-ha subchart v16.2.2
+#
+# SECURITY: This file contains sensitive values.
+# Encrypt before committing: sops --encrypt --in-place helm/gitea-values.yaml
+# Usage: helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml
+#
+# To find current values on the cluster:
+#   sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
+
+postgresql-ha:
+  pgpool:
+    # FIX for WP-0003 / D3:
+    # The Bitnami postgresql-ha subchart (v16.2.2) does not write pgpool-password
+    # into the postgresql secret automatically. Without this key, pgpool enters
+    # CrashLoopBackOff on any pod restart (including HA failover).
+    # Value must match the sr-check-password used during initial deployment.
+    # Decode current value: kubectl get secret gitea-postgresql-ha-postgresql -o jsonpath='{.data.pgpool-password}' | base64 -d
+    adminPassword: "REPLACE_WITH_PGPOOL_ADMIN_PASSWORD"
--- a/tests/smoke_kube.sh
+++ b/tests/smoke_kube.sh
@@ -35,6 +35,30 @@ else
  fail "Traefik ingress controller not running in kube-system"
 fi

+# ── postgresql-ha pods ───────────────────────────────────────────────────────
+PG_NOT_RUNNING=$(kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
+  | grep -v "^NAME" | grep -v " Running " | wc -l)
+if kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null | grep -q "Running"; then
+  if [[ "$PG_NOT_RUNNING" -eq 0 ]]; then
+    ok "All postgresql-ha pods Running"
+  else
+    fail "${PG_NOT_RUNNING} postgresql-ha pod(s) not in Running state"
+  fi
+else
+  fail "No postgresql-ha pods found (is Gitea deployed?)"
+fi
+
+# ── pgpool (D3 requirement) ───────────────────────────────────────────────────
+# pgpool CrashLoopBackOff is silent and only surfaces on pod restart/failover.
+# A passing check here means the pgpool-password secret key is present.
+PGPOOL_STATE=$(kubectl get pods -n default -l app.kubernetes.io/component=pgpool 2>/dev/null \
+  | grep -v "^NAME" | awk '{print $3}' | head -1)
+if [[ "$PGPOOL_STATE" == "Running" ]]; then
+  ok "pgpool pod Running"
+else
+  fail "pgpool pod not Running (state: ${PGPOOL_STATE:-not found}) — check pgpool-password secret key"
+fi
+
 # ── Summary ──────────────────────────────────────────────────────────────────
 echo ""
 echo "Results: ${PASS} passed, ${FAIL} failed"
--- a/tests/test_ha_failover.sh
+++ b/tests/test_ha_failover.sh
@@ -0,0 +1,151 @@
+#!/usr/bin/env bash
+# HA Failover Test — Decision D3
+#
+# Deliberately kills the primary PostgreSQL pod and asserts that:
+#   1. Gitea remains accessible during failover
+#   2. pgpool recovers to Running state
+#   3. All postgresql-ha pods return to Running
+#
+# Must be run against a live cluster. Exits 0 on full pass.
+# Run: bash tests/test_ha_failover.sh [GITEA_URL]
+#
+# GITEA_URL defaults to http://localhost:3000 — override for your ingress:
+#   bash tests/test_ha_failover.sh https://git.example.com
+
+set -uo pipefail
+
+GITEA_URL="${1:-http://localhost:3000}"
+NAMESPACE="default"
+FAILOVER_TIMEOUT=60   # seconds to wait for repmgr promotion
+RECOVERY_TIMEOUT=120  # seconds to wait for all pods Running again
+PASS=0
+FAIL=0
+
+ok()   { echo "[OK]   $*"; ((PASS++)) || true; }
+fail() { echo "[FAIL] $*"; ((FAIL++)) || true; }
+info() { echo "[INFO] $*"; }
+
+# ── Pre-flight ────────────────────────────────────────────────────────────────
+info "Target cluster: $(kubectl config current-context 2>/dev/null || echo 'default')"
+info "Gitea URL: ${GITEA_URL}"
+info "Namespace: ${NAMESPACE}"
+echo ""
+
+# Confirm postgresql-ha primary pod exists
+PRIMARY_POD=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha \
+  -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+if [[ -z "$PRIMARY_POD" ]]; then
+  fail "No postgresql-ha pods found — is Gitea deployed?"
+  exit 1
+fi
+info "Primary pod to kill: ${PRIMARY_POD}"
+
+# ── Baseline: Gitea accessible before failover ────────────────────────────────
+info "Checking Gitea baseline..."
+HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "${GITEA_URL}" 2>/dev/null || echo "000")
+if [[ "$HTTP_CODE" =~ ^[23] ]]; then
+  ok "Gitea accessible before failover (HTTP ${HTTP_CODE})"
+else
+  fail "Gitea not accessible before failover (HTTP ${HTTP_CODE}) — aborting test"
+  exit 1
+fi
+
+# ── Trigger failover: kill primary pod ───────────────────────────────────────
+info "Deleting primary pod ${PRIMARY_POD} to trigger failover..."
+kubectl delete pod -n "${NAMESPACE}" "${PRIMARY_POD}" --grace-period=0
+FAILOVER_START=$(date +%s)
+
+# ── Wait for repmgr promotion ─────────────────────────────────────────────────
+info "Waiting up to ${FAILOVER_TIMEOUT}s for a replica to be promoted..."
+PROMOTED=false
+while (( $(date +%s) - FAILOVER_START < FAILOVER_TIMEOUT )); do
+  RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
+    | grep " Running " | wc -l)
+  if [[ "$RUNNING" -ge 1 ]]; then
+    PROMOTED=true
+    ELAPSED=$(( $(date +%s) - FAILOVER_START ))
+    info "Replica promoted in ${ELAPSED}s"
+    break
+  fi
+  sleep 3
+done
+
+if $PROMOTED; then
+  ok "PostgreSQL replica promoted within ${FAILOVER_TIMEOUT}s"
+else
+  fail "No replica promoted within ${FAILOVER_TIMEOUT}s"
+fi
+
+# ── Gitea accessible after failover ──────────────────────────────────────────
+info "Checking Gitea accessibility after failover..."
+GITEA_OK=false
+for i in $(seq 1 10); do
+  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "${GITEA_URL}" 2>/dev/null || echo "000")
+  if [[ "$HTTP_CODE" =~ ^[23] ]]; then
+    GITEA_OK=true
+    break
+  fi
+  sleep 1
+done
+
+if $GITEA_OK; then
+  ok "Gitea accessible after failover (HTTP ${HTTP_CODE})"
+else
+  fail "Gitea not accessible within 10s of failover (last HTTP ${HTTP_CODE})"
+fi
+
+# ── pgpool Running after failover ─────────────────────────────────────────────
+info "Checking pgpool state..."
+PGPOOL_OK=false
+for i in $(seq 1 20); do
+  PGPOOL_STATE=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/component=pgpool 2>/dev/null \
+    | grep -v "^NAME" | awk '{print $3}' | head -1)
+  if [[ "$PGPOOL_STATE" == "Running" ]]; then
+    PGPOOL_OK=true
+    break
+  fi
+  sleep 3
+done
+
+if $PGPOOL_OK; then
+  ok "pgpool pod Running after failover"
+else
+  fail "pgpool not Running after failover (state: ${PGPOOL_STATE:-not found}) — missing pgpool-password?"
+fi
+
+# ── All postgresql-ha pods recover ───────────────────────────────────────────
+info "Waiting up to ${RECOVERY_TIMEOUT}s for all postgresql-ha pods to return to Running..."
+ALL_OK=false
+RECOVERY_START=$(date +%s)
+while (( $(date +%s) - RECOVERY_START < RECOVERY_TIMEOUT )); do
+  TOTAL=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
+    | grep -v "^NAME" | wc -l)
+  RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \
+    | grep " Running " | wc -l)
+  if [[ "$TOTAL" -gt 0 && "$TOTAL" -eq "$RUNNING" ]]; then
+    ALL_OK=true
+    ELAPSED=$(( $(date +%s) - RECOVERY_START ))
+    info "All ${TOTAL} postgresql-ha pods Running after ${ELAPSED}s"
+    break
+  fi
+  sleep 5
+done
+
+if $ALL_OK; then
+  ok "All postgresql-ha pods recovered to Running"
+else
+  fail "Not all postgresql-ha pods recovered within ${RECOVERY_TIMEOUT}s"
+  kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null || true
+fi
+
+# ── Summary ───────────────────────────────────────────────────────────────────
+echo ""
+echo "Results: ${PASS} passed, ${FAIL} failed"
+echo ""
+if [[ "$FAIL" -gt 0 ]]; then
+  echo "FAILOVER TEST FAILED — review output above"
+  exit 1
+else
+  echo "FAILOVER TEST PASSED — cluster is HA-verified (D3 satisfied)"
+  exit 0
+fi
--- a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
+++ b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
@@ -4,7 +4,7 @@ type: bug-report
 title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
 domain: railiance
 repo: railiance-cluster
-status: open
+status: active
 owner: tegwick
 created: "2026-03-10"
 updated: "2026-03-10"
@@ -103,7 +103,7 @@ and the bug will recur.

 ```task
 id: T01
-status: open
+status: done
 priority: high
 state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
 ```
@@ -126,7 +126,7 @@ without manual secret patching.

 ```task
 id: T02
-status: open
+status: done
 priority: high
 state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
 ```
@@ -150,7 +150,7 @@ kubectl get pod -n default -l app.kubernetes.io/component=pgpool \

 ```task
 id: T03
-status: open
+status: done
 priority: high
 state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
 ```
@@ -174,7 +174,7 @@ This test must pass before any PostgreSQL HA deployment is considered done.

 ```task
 id: T04
-status: open
+status: done
 priority: medium
 state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
 ```