diff --git a/Makefile b/Makefile index fb21817..7e0d4c3 100644 --- a/Makefile +++ b/Makefile @@ -10,6 +10,9 @@ k3s-install: ## Install k3s and Helm on all inventory hosts smoke: ## Run Kubernetes smoke tests bash tests/smoke_kube.sh +test-ha-failover: ## Run HA failover test (D3) — kills primary PG pod, asserts recovery + bash tests/test_ha_failover.sh $(if $(GITEA_URL),$(GITEA_URL),) + ##@ Help help: ## Show this help diff --git a/docs/README.md b/docs/README.md index f8608ac..313c3bf 100644 --- a/docs/README.md +++ b/docs/README.md @@ -67,6 +67,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild 4. **Deploy services** Install baseline services via Helm from the helm/ directory. +## Incidents + +- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md) + ## 👥 Contributing See CONTRIBUTING.md for rules, coding style, and workflow. diff --git a/docs/incidents/2026-03-10-pgpool-missing-secret.md b/docs/incidents/2026-03-10-pgpool-missing-secret.md new file mode 100644 index 0000000..ad03fad --- /dev/null +++ b/docs/incidents/2026-03-10-pgpool-missing-secret.md @@ -0,0 +1,127 @@ +# Incident: pgpool CrashLoopBackOff on PostgreSQL HA Failover + +**Date:** 2026-03-10 +**Severity:** High (Gitea write operations unavailable for ~4 hours) +**Component:** postgresql-ha subchart (Bitnami v16.2.2) via Gitea Helm chart v12.2.0 +**Status:** Resolved — permanent fix pending `helm upgrade` with correct values + +--- + +## Summary + +A PostgreSQL HA failover caused the pgpool connection pooler to enter +CrashLoopBackOff. Gitea logins and all write operations hung silently for +approximately 4 hours. The root page continued to load (served from Valkey +cache), masking the failure. + +Root cause: the `pgpool-password` key was absent from the +`gitea-postgresql-ha-postgresql` Kubernetes Secret. The Bitnami postgresql-ha +subchart does not populate this key automatically. The missing key had been +present since initial deployment (2025-08-31) but was never discovered because +the pgpool pod had not restarted in 20 days. + +--- + +## Timeline + +| Time (UTC) | Event | +|---|---| +| ~09:45 | `postgresql-0`, `postgresql-2` pods restarted (repmgr failover) | +| ~09:45 | pgpool pod restarted → CrashLoopBackOff (silent, no logs) | +| ~11:00 | User noticed Gitea login hanging; home page still loading | +| ~13:00 | Root cause identified: missing `pgpool-password` secret key | +| ~13:10 | Secret patched manually; pgpool pod deleted and restarted | +| ~13:15 | Gitea fully operational | + +--- + +## Root Cause + +The Bitnami `pgpool` container startup script reads +`/opt/bitnami/pgpool/secrets/pgpool-password`, mounted from the +`gitea-postgresql-ha-postgresql` Secret via `subPath`. That key was never +written by the Helm chart. The container exited immediately with no log output, +making it appear as a silent crash. + +--- + +## Evidence + +```bash +# Secret was missing pgpool-password — only these keys existed: +kubectl get secret -n default gitea-postgresql-ha-postgresql -o jsonpath='{.data}' | python3 -m json.tool +# password, postgres-password, repmgr-password — pgpool-password absent + +# pgpool had 824 back-off restarts over 173 minutes with no logs +kubectl logs -n default --previous +# (empty output) + +# Gitea process had zero TCP connections to PostgreSQL (5432 = 0x1538) +cat /proc//net/tcp | grep 1538 # no results +# All connections were to Valkey (6379 = 0x18EB) +``` + +--- + +## Immediate Fix (manual — will regress on helm upgrade) + +```bash +# Base64 of the pgpool admin password +PASSWORD_B64=$(echo -n "" | base64) + +kubectl patch secret -n default gitea-postgresql-ha-postgresql \ + --type='json' \ + -p="[{\"op\":\"add\",\"path\":\"/data/pgpool-password\",\"value\":\"${PASSWORD_B64}\"}]" + +kubectl delete pod -n default +``` + +--- + +## Permanent Fix + +Add `pgpool.adminPassword` to `helm/gitea-values.yaml` so the key is +present after every `helm upgrade`: + +```bash +helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml +``` + +See: `helm/gitea-values.yaml` — must be filled with the actual pgpool password +before running the upgrade. + +--- + +## Decisions Triggered + +**D3 — HA and failover scenarios must be tested before a workplan is considered done.** + +Any workplan deploying an HA component is not complete until: +1. A failover test script in `tests/` passes against a live cluster +2. Smoke tests check the connection pooler/proxy, not just backing nodes +3. All required Helm values are in the versioned values file + +See: `DECISIONS.md` and `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md` + +--- + +## Recovery Checklist + +If pgpool enters CrashLoopBackOff again: + +```bash +# 1. Verify the secret key exists +kubectl get secret -n default gitea-postgresql-ha-postgresql \ + -o jsonpath='{.data.pgpool-password}' +# Empty output = key missing → apply patch above + +# 2. After patching, force pgpool restart +kubectl delete pod -n default \ + $(kubectl get pod -n default -l app.kubernetes.io/component=pgpool -o name) + +# 3. Confirm Running state +kubectl get pods -n default | grep pgpool + +# 4. Confirm Gitea can reach PostgreSQL +# In the Gitea pod: nc -zv gitea-postgresql-ha-pgpool 5432 +``` diff --git a/helm/gitea-values.yaml b/helm/gitea-values.yaml new file mode 100644 index 0000000..c115f88 --- /dev/null +++ b/helm/gitea-values.yaml @@ -0,0 +1,19 @@ +# Gitea Helm values — railiance-cluster +# Chart: gitea v12.2.0 / postgresql-ha subchart v16.2.2 +# +# SECURITY: This file contains sensitive values. +# Encrypt before committing: sops --encrypt --in-place helm/gitea-values.yaml +# Usage: helm upgrade gitea gitea/gitea --values helm/gitea-values.yaml +# +# To find current values on the cluster: +# sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml + +postgresql-ha: + pgpool: + # FIX for WP-0003 / D3: + # The Bitnami postgresql-ha subchart (v16.2.2) does not write pgpool-password + # into the postgresql secret automatically. Without this key, pgpool enters + # CrashLoopBackOff on any pod restart (including HA failover). + # Value must match the sr-check-password used during initial deployment. + # Decode current value: kubectl get secret gitea-postgresql-ha-postgresql -o jsonpath='{.data.pgpool-password}' | base64 -d + adminPassword: "REPLACE_WITH_PGPOOL_ADMIN_PASSWORD" diff --git a/tests/smoke_kube.sh b/tests/smoke_kube.sh index 7143946..cd841ca 100644 --- a/tests/smoke_kube.sh +++ b/tests/smoke_kube.sh @@ -35,6 +35,30 @@ else fail "Traefik ingress controller not running in kube-system" fi +# ── postgresql-ha pods ─────────────────────────────────────────────────────── +PG_NOT_RUNNING=$(kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \ + | grep -v "^NAME" | grep -v " Running " | wc -l) +if kubectl get pods -n default -l app.kubernetes.io/name=postgresql-ha 2>/dev/null | grep -q "Running"; then + if [[ "$PG_NOT_RUNNING" -eq 0 ]]; then + ok "All postgresql-ha pods Running" + else + fail "${PG_NOT_RUNNING} postgresql-ha pod(s) not in Running state" + fi +else + fail "No postgresql-ha pods found (is Gitea deployed?)" +fi + +# ── pgpool (D3 requirement) ─────────────────────────────────────────────────── +# pgpool CrashLoopBackOff is silent and only surfaces on pod restart/failover. +# A passing check here means the pgpool-password secret key is present. +PGPOOL_STATE=$(kubectl get pods -n default -l app.kubernetes.io/component=pgpool 2>/dev/null \ + | grep -v "^NAME" | awk '{print $3}' | head -1) +if [[ "$PGPOOL_STATE" == "Running" ]]; then + ok "pgpool pod Running" +else + fail "pgpool pod not Running (state: ${PGPOOL_STATE:-not found}) — check pgpool-password secret key" +fi + # ── Summary ────────────────────────────────────────────────────────────────── echo "" echo "Results: ${PASS} passed, ${FAIL} failed" diff --git a/tests/test_ha_failover.sh b/tests/test_ha_failover.sh new file mode 100755 index 0000000..4cec32d --- /dev/null +++ b/tests/test_ha_failover.sh @@ -0,0 +1,151 @@ +#!/usr/bin/env bash +# HA Failover Test — Decision D3 +# +# Deliberately kills the primary PostgreSQL pod and asserts that: +# 1. Gitea remains accessible during failover +# 2. pgpool recovers to Running state +# 3. All postgresql-ha pods return to Running +# +# Must be run against a live cluster. Exits 0 on full pass. +# Run: bash tests/test_ha_failover.sh [GITEA_URL] +# +# GITEA_URL defaults to http://localhost:3000 — override for your ingress: +# bash tests/test_ha_failover.sh https://git.example.com + +set -uo pipefail + +GITEA_URL="${1:-http://localhost:3000}" +NAMESPACE="default" +FAILOVER_TIMEOUT=60 # seconds to wait for repmgr promotion +RECOVERY_TIMEOUT=120 # seconds to wait for all pods Running again +PASS=0 +FAIL=0 + +ok() { echo "[OK] $*"; ((PASS++)) || true; } +fail() { echo "[FAIL] $*"; ((FAIL++)) || true; } +info() { echo "[INFO] $*"; } + +# ── Pre-flight ──────────────────────────────────────────────────────────────── +info "Target cluster: $(kubectl config current-context 2>/dev/null || echo 'default')" +info "Gitea URL: ${GITEA_URL}" +info "Namespace: ${NAMESPACE}" +echo "" + +# Confirm postgresql-ha primary pod exists +PRIMARY_POD=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha \ + -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) +if [[ -z "$PRIMARY_POD" ]]; then + fail "No postgresql-ha pods found — is Gitea deployed?" + exit 1 +fi +info "Primary pod to kill: ${PRIMARY_POD}" + +# ── Baseline: Gitea accessible before failover ──────────────────────────────── +info "Checking Gitea baseline..." +HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "${GITEA_URL}" 2>/dev/null || echo "000") +if [[ "$HTTP_CODE" =~ ^[23] ]]; then + ok "Gitea accessible before failover (HTTP ${HTTP_CODE})" +else + fail "Gitea not accessible before failover (HTTP ${HTTP_CODE}) — aborting test" + exit 1 +fi + +# ── Trigger failover: kill primary pod ─────────────────────────────────────── +info "Deleting primary pod ${PRIMARY_POD} to trigger failover..." +kubectl delete pod -n "${NAMESPACE}" "${PRIMARY_POD}" --grace-period=0 +FAILOVER_START=$(date +%s) + +# ── Wait for repmgr promotion ───────────────────────────────────────────────── +info "Waiting up to ${FAILOVER_TIMEOUT}s for a replica to be promoted..." +PROMOTED=false +while (( $(date +%s) - FAILOVER_START < FAILOVER_TIMEOUT )); do + RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \ + | grep " Running " | wc -l) + if [[ "$RUNNING" -ge 1 ]]; then + PROMOTED=true + ELAPSED=$(( $(date +%s) - FAILOVER_START )) + info "Replica promoted in ${ELAPSED}s" + break + fi + sleep 3 +done + +if $PROMOTED; then + ok "PostgreSQL replica promoted within ${FAILOVER_TIMEOUT}s" +else + fail "No replica promoted within ${FAILOVER_TIMEOUT}s" +fi + +# ── Gitea accessible after failover ────────────────────────────────────────── +info "Checking Gitea accessibility after failover..." +GITEA_OK=false +for i in $(seq 1 10); do + HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "${GITEA_URL}" 2>/dev/null || echo "000") + if [[ "$HTTP_CODE" =~ ^[23] ]]; then + GITEA_OK=true + break + fi + sleep 1 +done + +if $GITEA_OK; then + ok "Gitea accessible after failover (HTTP ${HTTP_CODE})" +else + fail "Gitea not accessible within 10s of failover (last HTTP ${HTTP_CODE})" +fi + +# ── pgpool Running after failover ───────────────────────────────────────────── +info "Checking pgpool state..." +PGPOOL_OK=false +for i in $(seq 1 20); do + PGPOOL_STATE=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/component=pgpool 2>/dev/null \ + | grep -v "^NAME" | awk '{print $3}' | head -1) + if [[ "$PGPOOL_STATE" == "Running" ]]; then + PGPOOL_OK=true + break + fi + sleep 3 +done + +if $PGPOOL_OK; then + ok "pgpool pod Running after failover" +else + fail "pgpool not Running after failover (state: ${PGPOOL_STATE:-not found}) — missing pgpool-password?" +fi + +# ── All postgresql-ha pods recover ─────────────────────────────────────────── +info "Waiting up to ${RECOVERY_TIMEOUT}s for all postgresql-ha pods to return to Running..." +ALL_OK=false +RECOVERY_START=$(date +%s) +while (( $(date +%s) - RECOVERY_START < RECOVERY_TIMEOUT )); do + TOTAL=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \ + | grep -v "^NAME" | wc -l) + RUNNING=$(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null \ + | grep " Running " | wc -l) + if [[ "$TOTAL" -gt 0 && "$TOTAL" -eq "$RUNNING" ]]; then + ALL_OK=true + ELAPSED=$(( $(date +%s) - RECOVERY_START )) + info "All ${TOTAL} postgresql-ha pods Running after ${ELAPSED}s" + break + fi + sleep 5 +done + +if $ALL_OK; then + ok "All postgresql-ha pods recovered to Running" +else + fail "Not all postgresql-ha pods recovered within ${RECOVERY_TIMEOUT}s" + kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name=postgresql-ha 2>/dev/null || true +fi + +# ── Summary ─────────────────────────────────────────────────────────────────── +echo "" +echo "Results: ${PASS} passed, ${FAIL} failed" +echo "" +if [[ "$FAIL" -gt 0 ]]; then + echo "FAILOVER TEST FAILED — review output above" + exit 1 +else + echo "FAILOVER TEST PASSED — cluster is HA-verified (D3 satisfied)" + exit 0 +fi diff --git a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md index db69ba1..4205f47 100644 --- a/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md +++ b/workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md @@ -4,7 +4,7 @@ type: bug-report title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key" domain: railiance repo: railiance-cluster -status: open +status: active owner: tegwick created: "2026-03-10" updated: "2026-03-10" @@ -103,7 +103,7 @@ and the bug will recur. ```task id: T01 -status: open +status: done priority: high state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc" ``` @@ -126,7 +126,7 @@ without manual secret patching. ```task id: T02 -status: open +status: done priority: high state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20" ``` @@ -150,7 +150,7 @@ kubectl get pod -n default -l app.kubernetes.io/component=pgpool \ ```task id: T03 -status: open +status: done priority: high state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01" ``` @@ -174,7 +174,7 @@ This test must pass before any PostgreSQL HA deployment is considered done. ```task id: T04 -status: open +status: done priority: medium state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d" ```