From 41d239c1664e1d9d4f384ed244512df9da75a5fa Mon Sep 17 00:00:00 2001 From: tegwick Date: Wed, 25 Mar 2026 11:30:44 +0100 Subject: [PATCH] ops: establish ops/ directory with Gitea runbook and INC-001 incident report MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 --- ops/README.md | 23 +++ .../2026-03-25-gitea-pgpool-crashloop.md | 122 ++++++++++++++ ops/runbooks/gitea-coulombcore.md | 159 ++++++++++++++++++ state-hub/dashboard/src/docs/connecting.md | 32 ++++ 4 files changed, 336 insertions(+) create mode 100644 ops/README.md create mode 100644 ops/incidents/2026-03-25-gitea-pgpool-crashloop.md create mode 100644 ops/runbooks/gitea-coulombcore.md diff --git a/ops/README.md b/ops/README.md new file mode 100644 index 0000000..f09f824 --- /dev/null +++ b/ops/README.md @@ -0,0 +1,23 @@ +# Ops Documentation + +Operational runbooks and incident reports for the Railiance/Custodian infrastructure. + +## Structure + +``` +ops/ + runbooks/ — how-to guides for recurring operational tasks and known issues + incidents/ — post-incident reports (append-only, one file per incident) +``` + +## Runbooks + +| Runbook | Covers | +|---------|--------| +| [gitea-coulombcore.md](runbooks/gitea-coulombcore.md) | Gitea on COULOMBCORE k3s — access, known issues, recovery checklist | + +## Incidents + +| ID | Date | Summary | Status | +|----|------|---------|--------| +| [INC-001](incidents/2026-03-25-gitea-pgpool-crashloop.md) | 2026-03-25 | Gitea down 13d — PGPool containerd StartError + CPU exhaustion | Resolved | diff --git a/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md b/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md new file mode 100644 index 0000000..8cb9e7e --- /dev/null +++ b/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md @@ -0,0 +1,122 @@ +--- +title: "INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion" +date: 2026-03-25 +severity: high +status: resolved +affected: gitea (http://92.205.130.254:32166) +environment: COULOMBCORE k3s cluster +duration: ~13 days (2026-03-12 to 2026-03-25) +resolved_by: Bernd Worsch / Claude +--- + +# INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion + +## Summary + +Gitea was completely unavailable for approximately 13 days. Root cause was a containerd +state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with +`StartError: cannot start a stopped process`. This cascaded to Gitea, which crashed when +unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0) +was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion. + +--- + +## Timeline + +| Time | Event | +|------|-------| +| 2026-03-10 | `helm upgrade gitea gitea/gitea` — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4) | +| 2026-03-12 | New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with `StartError: cannot start a stopped process` | +| 2026-03-12 | Gitea app pod crashes due to DB unreachable (PGPool down) | +| ~2026-03-22 | New Gitea pod from upgrade rollout attempts to schedule. Blocked: `Insufficient cpu` (CPU requests at 98%) | +| 2026-03-25 | Incident detected and diagnosed | +| 2026-03-25 09:28 | PGPool CPU request reduced from 250m → 100m via `helm upgrade --reuse-values` | +| 2026-03-25 09:28 | Old crashing Gitea pod deleted to free 100m CPU | +| 2026-03-25 09:29 | New PGPool pod (100m request) schedules and starts successfully (`1/1 Running`) | +| 2026-03-25 09:30 | New Gitea pod (pending 3 days) schedules, inits, starts (`1/1 Running`) | +| 2026-03-25 09:30 | Gitea HTTP endpoint returns 200. Incident resolved. | + +--- + +## Root Causes + +### Primary: Containerd state corruption (PGPool) + +The PGPool pod was in `CrashLoopBackOff` with exit code 128 and message: +``` +failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown +``` + +This is a known containerd bug where a container task is left in an invalid "stopped" state +in containerd's internal database. Every restart attempt by Kubernetes immediately fails +because containerd refuses to start a task it believes is already stopped. The fix is to +delete the pod — the new pod gets a fresh containerd task ID and starts normally. + +### Secondary: CPU exhaustion (rolling update blocked) + +The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components. +The new Gitea app pod (`gitea-f4f657c59-cmtdf`) could not schedule for 3+ days because +CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was +unnecessarily high for a lightweight connection pooler on this resource-constrained +single-node cluster. + +--- + +## Impact + +- Gitea (code hosting) completely unavailable for ~13 days +- All repos with `remote_url` pointing to `92.205.130.254:32166` were unreachable +- No data loss — PostgreSQL HA pods remained running throughout +- Git push/pull and Gitea API calls failed for all consumers during the outage + +--- + +## Resolution Steps + +```bash +# 1. Reduce PGPool CPU request (primary blocker for scheduling) +helm upgrade gitea gitea/gitea --version 12.5.0 -n default \ + --reuse-values \ + --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \ + --set 'postgresql-ha.pgpool.resources.limits.cpu=200m' + +# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod +kubectl delete pod gitea-79f98f897f-khs26 + +# 3. Cleanup leftover node-debugger pods from prior investigation attempt +kubectl delete pod \ + node-debugger-254.130.205.92.host.secureserver.net-bn7qq \ + node-debugger-254.130.205.92.host.secureserver.net-vmd79 +``` + +Within ~90 seconds: +- New PGPool pod scheduled and running +- Pending Gitea pod (3 days) scheduled, init containers ran, main container started +- Gitea HTTP 200 confirmed + +--- + +## Follow-up Actions + +- [ ] Add PGPool CPU resource override to `railiance-apps` Helm values file (currently + stored only in Helm release; values should be in git) +- [ ] Set up alerting for `CrashLoopBackOff` pods older than 30 minutes +- [ ] Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster + (provides no actual HA benefit, consumes 750m CPU requests) +- [ ] Consider adding a CPU request budget dashboard panel to the Observable dashboard + +--- + +## Lessons Learned + +1. **Containerd StartError is not a config issue.** The error message looks like a runtime + failure but is actually a containerd state corruption. The fix is always "delete the pod". + This is now documented in the runbook: `ops/runbooks/gitea-coulombcore.md` + +2. **Track Helm values in git.** The only custom value (`pgpool.adminPassword`) was in the + Helm release but not in `railiance-apps`. The resource fix applied here (pgpool CPU) would + have been overwritten by a future `helm upgrade --reuse-values` from a clean checkout. + All non-secret Helm values should live in `railiance-apps/`. + +3. **Single-node CPU budget is tight.** At 98% CPU request allocation, any pod churn causes + scheduling failures. Resource requests need to be right-sized for this environment. diff --git a/ops/runbooks/gitea-coulombcore.md b/ops/runbooks/gitea-coulombcore.md new file mode 100644 index 0000000..272dd04 --- /dev/null +++ b/ops/runbooks/gitea-coulombcore.md @@ -0,0 +1,159 @@ +--- +title: Runbook — Gitea on COULOMBCORE +tags: [gitea, coulombcore, k3s, postgresql-ha] +created: 2026-03-25 +updated: 2026-03-25 +--- + +# Runbook: Gitea on COULOMBCORE + +Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`). +It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching. + +--- + +## Access + +```bash +# SSH (requires ~/.ssh/id_ops) +ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 + +# Web UI +http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000 + +# Check all Gitea pods +kubectl get pods -l 'app.kubernetes.io/instance=gitea' +``` + +--- + +## Helm Release + +| Field | Value | +|-------|-------| +| Release name | `gitea` | +| Namespace | `default` | +| Chart | `gitea/gitea` | +| Current version | 12.5.0 (Gitea 1.25.4) | + +```bash +helm list -n default +helm history gitea -n default +helm get values gitea -n default +``` + +--- + +## Known Issues + +### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process` + +**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows: +``` +Last State: Terminated + Reason: StartError + Message: failed to start containerd task "...": cannot start a stopped process: unknown + Exit Code: 128 +``` + +**Root cause:** Containerd state corruption on the k3s node — the container task is recorded +as "stopped" in containerd's internal state but the process never actually ran. This causes +every restart attempt to fail immediately with exit code 128. Not a config or auth issue. + +**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task. + +```bash +kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name) +``` + +Wait 30s then confirm it comes up `1/1 Running`. + +**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`) +unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`. +Fixing PGPool automatically unblocks Gitea. + +--- + +### 2. Gitea pods Pending — Insufficient CPU + +**Symptom:** New pod stuck in `Pending` with scheduler event: +``` +0/1 nodes are available: 1 Insufficient cpu. +``` + +**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98% +allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each, +Valkey, SSO stack, and monitoring, the budget is nearly exhausted. + +**Check:** +```bash +kubectl describe node | grep -A6 "Allocated resources" +``` + +**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods: + +```bash +# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler) +helm upgrade gitea gitea/gitea --version -n default \ + --reuse-values \ + --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \ + --set 'postgresql-ha.pgpool.resources.limits.cpu=200m' + +# Delete the stuck old Gitea pod if it's crashlooping +kubectl delete pod +``` + +This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to +schedule the new PGPool (100m) + new Gitea (100m via init containers). + +**After-fix:** The rolling update from the blocked deployment should self-complete once +both pods can schedule and Gitea can reach PGPool. + +--- + +## Recovery Checklist + +When Gitea is down, work through this in order: + +1. **Check PGPool** — most common root cause + ```bash + kubectl get pod -l 'app.kubernetes.io/component=pgpool' + ``` + - `CrashLoopBackOff` → delete the pod (see issue #1 above) + - `Pending` → check CPU budget (see issue #2) + +2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue + ```bash + kubectl get pod -l 'app.kubernetes.io/component=postgresql' + ``` + +3. **Check Gitea app pod** + ```bash + kubectl get pod -l 'app.kubernetes.io/component=gitea' + kubectl logs --tail=20 + ``` + - DB connect errors → PGPool issue (go to step 1) + - Init container crash → check `kubectl logs -c configure-gitea` + +4. **Verify end-to-end** + ```bash + curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/ + # expect: 200 + ``` + +--- + +## Node Resource Budget (approximate) + +| Component | CPU Request | +|-----------|------------| +| postgresql-ha-postgresql × 3 | 750m | +| pgpool | 100m (after 2026-03-25 fix, was 250m) | +| valkey-cluster × 3 | 300m | +| gitea app | ~100m (init containers) | +| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m | +| System (coredns, metrics-server, traefik) | ~200m | +| **Total** | **~1675m** | + +Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without +reviewing resource requests first. diff --git a/state-hub/dashboard/src/docs/connecting.md b/state-hub/dashboard/src/docs/connecting.md index f2f095e..582e543 100644 --- a/state-hub/dashboard/src/docs/connecting.md +++ b/state-hub/dashboard/src/docs/connecting.md @@ -135,6 +135,38 @@ tunnels: max_attempts: 0 backoff_initial: 5 backoff_max: 60 + + state-hub-railiance01: # API tunnel + host: 92.205.62.239 + remote_port: 18000 + local_port: 8000 + ssh_user: tegwick + ssh_key: ~/.ssh/id_ops + actor: agent.claude-railiance01 + health_check: + url: http://127.0.0.1:8000/state/health + interval_seconds: 30 + timeout_seconds: 5 + reconnect: + max_attempts: 0 + backoff_initial: 5 + backoff_max: 60 + + state-hub-mcp-railiance01: # MCP SSE tunnel + host: 92.205.62.239 + remote_port: 18001 + local_port: 8001 + ssh_user: tegwick + ssh_key: ~/.ssh/id_ops + actor: agent.claude-railiance01 + health_check: + url: http://127.0.0.1:18001/sse + interval_seconds: 30 + timeout_seconds: 5 + reconnect: + max_attempts: 0 + backoff_initial: 5 + backoff_max: 60 ``` ops-bridge source: `~/ops-bridge` · SSH key: `~/.ssh/id_ops`