bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
45
DECISIONS.md
45
DECISIONS.md
@@ -19,3 +19,48 @@ I want to go with C and separate concerns. Nginx for external SSL will need secu
|
|||||||
We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
|
We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## D3 — HA and failover scenarios must be tested before a workplan is considered done
|
||||||
|
|
||||||
|
**Date:** 2026-03-10
|
||||||
|
**Decided by:** Tegwick
|
||||||
|
|
||||||
|
On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
|
||||||
|
secret key) that had been present since initial deployment on 2025-08-31 but was
|
||||||
|
never discovered because no pod restart had occurred in 20 days. The immediate
|
||||||
|
symptom was Gitea logins hanging silently for hours.
|
||||||
|
|
||||||
|
This incident showed that deploying an HA component and declaring it "done"
|
||||||
|
without ever triggering a failover gives false confidence. Infrastructure that
|
||||||
|
has never failed over is not HA — it is just redundant hardware.
|
||||||
|
|
||||||
|
**Policy:**
|
||||||
|
|
||||||
|
Any workplan that deploys or configures a High Availability component
|
||||||
|
(database cluster, replicated storage, redundant ingress, etc.) is **not
|
||||||
|
complete** until a failover test passes. Specifically:
|
||||||
|
|
||||||
|
1. A test script in `tests/` must exist that deliberately kills the primary
|
||||||
|
component and asserts the service remains available within an acceptable
|
||||||
|
recovery window.
|
||||||
|
|
||||||
|
2. The test must be run against a live cluster and exit 0 before the workplan
|
||||||
|
status is set to `completed`.
|
||||||
|
|
||||||
|
3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
|
||||||
|
check for each HA component's connection pooler, proxy, or load balancer —
|
||||||
|
not just the backing nodes.
|
||||||
|
|
||||||
|
4. Any Helm chart values required to make HA work correctly (secrets,
|
||||||
|
passwords, topology settings) must be present in the versioned values file
|
||||||
|
before the workplan is closed, so that a `helm upgrade` cannot silently
|
||||||
|
regress the fix.
|
||||||
|
|
||||||
|
**Rationale:** A failure that only surfaces on the first real event (restart,
|
||||||
|
failover, node loss) is a deployment bug, not an operational surprise. Railiance
|
||||||
|
aims for calm ops — and calm ops requires that every failure mode we know about
|
||||||
|
has been tested before it matters.
|
||||||
|
|
||||||
|
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|||||||
189
workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
Normal file
189
workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
---
|
||||||
|
id: RAIL-BS-WP-0003
|
||||||
|
type: bug-report
|
||||||
|
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
|
||||||
|
domain: railiance
|
||||||
|
repo: railiance-cluster
|
||||||
|
status: open
|
||||||
|
owner: tegwick
|
||||||
|
created: "2026-03-10"
|
||||||
|
updated: "2026-03-10"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
|
||||||
|
restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
|
||||||
|
entered CrashLoopBackOff and produced no logs. As a result Gitea's login
|
||||||
|
and all write operations hung indefinitely. The root page was still served
|
||||||
|
(from Valkey cache) which masked the failure.
|
||||||
|
|
||||||
|
The fix was to patch a missing key in a Kubernetes secret. The root cause is
|
||||||
|
that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
|
||||||
|
populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
|
||||||
|
secret, even though the pgpool pod requires it at startup.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Time (UTC) | Event |
|
||||||
|
|---|---|
|
||||||
|
| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
|
||||||
|
| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
|
||||||
|
| ~11:00 | User noticed Gitea login hanging; home page still loading |
|
||||||
|
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
|
||||||
|
| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
|
||||||
|
| ~13:15 | Gitea fully operational |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
The Bitnami `pgpool` container startup script reads the file
|
||||||
|
`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
|
||||||
|
`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
|
||||||
|
mount. That secret key was never created by the Helm chart, so the file did
|
||||||
|
not exist. The container exited immediately with no logs.
|
||||||
|
|
||||||
|
The pod had been running for 20 days without a restart, so this gap was
|
||||||
|
never discovered during initial deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Secret was missing the pgpool-password key
|
||||||
|
sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
|
||||||
|
# data: keys were password, postgres-password, repmgr-password only
|
||||||
|
# pgpool-password was absent
|
||||||
|
|
||||||
|
# pgpool pod describe showed 824 back-off restarts over 173 minutes
|
||||||
|
# No logs in either current or --previous output
|
||||||
|
sudo k3s kubectl logs -n default <pgpool-pod> --previous
|
||||||
|
# (empty)
|
||||||
|
|
||||||
|
# Gitea process had zero TCP connections to PostgreSQL port 5432
|
||||||
|
# but many connections to Valkey port 6379
|
||||||
|
cat /proc/<gitea-pid>/net/tcp | grep 1538 # 1538 = 5432 hex — no results
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Immediate Fix Applied
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
|
||||||
|
sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
|
||||||
|
--type='json' \
|
||||||
|
-p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
|
||||||
|
|
||||||
|
# Restart pgpool
|
||||||
|
sudo k3s kubectl delete pod -n default <pgpool-pod-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risk: Fix Will Be Lost on helm upgrade
|
||||||
|
|
||||||
|
The patched secret is managed by Helm (annotation:
|
||||||
|
`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
|
||||||
|
secret from the chart template, which does not include `pgpool-password`,
|
||||||
|
and the bug will recur.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
### T01 — Add pgpool-password to Helm values
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: T01
|
||||||
|
status: open
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
|
||||||
|
include the pgpool-password so it survives `helm upgrade`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
postgresql-ha:
|
||||||
|
postgresql:
|
||||||
|
pgpoolPassword: <value matching sr-check-password>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
|
||||||
|
without manual secret patching.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T02 — Add pgpool health check to smoke test
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: T02
|
||||||
|
status: open
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Extend `tests/smoke_kube.sh` to assert:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# All postgresql-ha pods Running
|
||||||
|
kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
|
||||||
|
|
||||||
|
# pgpool specifically not in CrashLoopBackOff
|
||||||
|
kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
|
||||||
|
-o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
|
||||||
|
```
|
||||||
|
|
||||||
|
**Done when:** the smoke test catches a pgpool failure within 5 minutes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T03 — Add HA failover test
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: T03
|
||||||
|
status: open
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Create `tests/test_ha_failover.sh` that:
|
||||||
|
|
||||||
|
1. Records Gitea login response time (baseline)
|
||||||
|
2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
|
||||||
|
3. Waits for repmgr to promote a replica (max 60s)
|
||||||
|
4. Asserts Gitea login POST still succeeds within 10s
|
||||||
|
5. Asserts pgpool pod is Running (not CrashLoopBackOff)
|
||||||
|
6. Asserts all postgresql pods return to Running
|
||||||
|
|
||||||
|
This test must pass before any PostgreSQL HA deployment is considered done.
|
||||||
|
|
||||||
|
**Done when:** script exits 0 against a live cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T04 — Document the incident in docs/
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: T04
|
||||||
|
status: open
|
||||||
|
priority: medium
|
||||||
|
```
|
||||||
|
|
||||||
|
Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
|
||||||
|
timeline, root cause, and fix, so future operators understand what happened
|
||||||
|
and how to recover.
|
||||||
|
|
||||||
|
**Done when:** doc committed and linked from `docs/README.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Bitnami postgresql-ha chart v16.2.2
|
||||||
|
- Gitea Helm chart v12.2.0
|
||||||
|
- Related decision: D3 (HA testing policy) in `DECISIONS.md`
|
||||||
Reference in New Issue
Block a user