ops: establish ops/ directory with Gitea runbook and INC-001 incident report

- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:30:44 +01:00
parent efbbef76b0
commit 41d239c166
4 changed files with 336 additions and 0 deletions
--- a/ops/README.md
+++ b/ops/README.md
@@ -0,0 +1,23 @@
+# Ops Documentation
+
+Operational runbooks and incident reports for the Railiance/Custodian infrastructure.
+
+## Structure
+
+```
+ops/
+  runbooks/   — how-to guides for recurring operational tasks and known issues
+  incidents/  — post-incident reports (append-only, one file per incident)
+```
+
+## Runbooks
+
+| Runbook | Covers |
+|---------|--------|
+| [gitea-coulombcore.md](runbooks/gitea-coulombcore.md) | Gitea on COULOMBCORE k3s — access, known issues, recovery checklist |
+
+## Incidents
+
+| ID | Date | Summary | Status |
+|----|------|---------|--------|
+| [INC-001](incidents/2026-03-25-gitea-pgpool-crashloop.md) | 2026-03-25 | Gitea down 13d — PGPool containerd StartError + CPU exhaustion | Resolved |
--- a/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md
+++ b/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md
@@ -0,0 +1,122 @@
+---
+title: "INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion"
+date: 2026-03-25
+severity: high
+status: resolved
+affected: gitea (http://92.205.130.254:32166)
+environment: COULOMBCORE k3s cluster
+duration: ~13 days (2026-03-12 to 2026-03-25)
+resolved_by: Bernd Worsch / Claude
+---
+
+# INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion
+
+## Summary
+
+Gitea was completely unavailable for approximately 13 days. Root cause was a containerd
+state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with
+`StartError: cannot start a stopped process`. This cascaded to Gitea, which crashed when
+unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0)
+was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion.
+
+---
+
+## Timeline
+
+| Time | Event |
+|------|-------|
+| 2026-03-10 | `helm upgrade gitea gitea/gitea` — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4) |
+| 2026-03-12 | New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with `StartError: cannot start a stopped process` |
+| 2026-03-12 | Gitea app pod crashes due to DB unreachable (PGPool down) |
+| ~2026-03-22 | New Gitea pod from upgrade rollout attempts to schedule. Blocked: `Insufficient cpu` (CPU requests at 98%) |
+| 2026-03-25 | Incident detected and diagnosed |
+| 2026-03-25 09:28 | PGPool CPU request reduced from 250m → 100m via `helm upgrade --reuse-values` |
+| 2026-03-25 09:28 | Old crashing Gitea pod deleted to free 100m CPU |
+| 2026-03-25 09:29 | New PGPool pod (100m request) schedules and starts successfully (`1/1 Running`) |
+| 2026-03-25 09:30 | New Gitea pod (pending 3 days) schedules, inits, starts (`1/1 Running`) |
+| 2026-03-25 09:30 | Gitea HTTP endpoint returns 200. Incident resolved. |
+
+---
+
+## Root Causes
+
+### Primary: Containerd state corruption (PGPool)
+
+The PGPool pod was in `CrashLoopBackOff` with exit code 128 and message:
+```
+failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown
+```
+
+This is a known containerd bug where a container task is left in an invalid "stopped" state
+in containerd's internal database. Every restart attempt by Kubernetes immediately fails
+because containerd refuses to start a task it believes is already stopped. The fix is to
+delete the pod — the new pod gets a fresh containerd task ID and starts normally.
+
+### Secondary: CPU exhaustion (rolling update blocked)
+
+The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components.
+The new Gitea app pod (`gitea-f4f657c59-cmtdf`) could not schedule for 3+ days because
+CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was
+unnecessarily high for a lightweight connection pooler on this resource-constrained
+single-node cluster.
+
+---
+
+## Impact
+
+- Gitea (code hosting) completely unavailable for ~13 days
+- All repos with `remote_url` pointing to `92.205.130.254:32166` were unreachable
+- No data loss — PostgreSQL HA pods remained running throughout
+- Git push/pull and Gitea API calls failed for all consumers during the outage
+
+---
+
+## Resolution Steps
+
+```bash
+# 1. Reduce PGPool CPU request (primary blocker for scheduling)
+helm upgrade gitea gitea/gitea --version 12.5.0 -n default \
+  --reuse-values \
+  --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
+  --set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
+
+# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod
+kubectl delete pod gitea-79f98f897f-khs26
+
+# 3. Cleanup leftover node-debugger pods from prior investigation attempt
+kubectl delete pod \
+  node-debugger-254.130.205.92.host.secureserver.net-bn7qq \
+  node-debugger-254.130.205.92.host.secureserver.net-vmd79
+```
+
+Within ~90 seconds:
+- New PGPool pod scheduled and running
+- Pending Gitea pod (3 days) scheduled, init containers ran, main container started
+- Gitea HTTP 200 confirmed
+
+---
+
+## Follow-up Actions
+
+- [ ] Add PGPool CPU resource override to `railiance-apps` Helm values file (currently
+      stored only in Helm release; values should be in git)
+- [ ] Set up alerting for `CrashLoopBackOff` pods older than 30 minutes
+- [ ] Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster
+      (provides no actual HA benefit, consumes 750m CPU requests)
+- [ ] Consider adding a CPU request budget dashboard panel to the Observable dashboard
+
+---
+
+## Lessons Learned
+
+1. **Containerd StartError is not a config issue.** The error message looks like a runtime
+   failure but is actually a containerd state corruption. The fix is always "delete the pod".
+   This is now documented in the runbook: `ops/runbooks/gitea-coulombcore.md`
+
+2. **Track Helm values in git.** The only custom value (`pgpool.adminPassword`) was in the
+   Helm release but not in `railiance-apps`. The resource fix applied here (pgpool CPU) would
+   have been overwritten by a future `helm upgrade --reuse-values` from a clean checkout.
+   All non-secret Helm values should live in `railiance-apps/`.
+
+3. **Single-node CPU budget is tight.** At 98% CPU request allocation, any pod churn causes
+   scheduling failures. Resource requests need to be right-sized for this environment.
--- a/ops/runbooks/gitea-coulombcore.md
+++ b/ops/runbooks/gitea-coulombcore.md
@@ -0,0 +1,159 @@
+---
+title: Runbook — Gitea on COULOMBCORE
+tags: [gitea, coulombcore, k3s, postgresql-ha]
+created: 2026-03-25
+updated: 2026-03-25
+---
+
+# Runbook: Gitea on COULOMBCORE
+
+Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
+It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
+
+---
+
+## Access
+
+```bash
+# SSH (requires ~/.ssh/id_ops)
+ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
+
+# Web UI
+http://92.205.130.254:32166    # NodePort 32166 → gitea svc → pod :3000
+
+# Check all Gitea pods
+kubectl get pods -l 'app.kubernetes.io/instance=gitea'
+```
+
+---
+
+## Helm Release
+
+| Field | Value |
+|-------|-------|
+| Release name | `gitea` |
+| Namespace | `default` |
+| Chart | `gitea/gitea` |
+| Current version | 12.5.0 (Gitea 1.25.4) |
+
+```bash
+helm list -n default
+helm history gitea -n default
+helm get values gitea -n default
+```
+
+---
+
+## Known Issues
+
+### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
+
+**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
+```
+Last State: Terminated
+  Reason: StartError
+  Message: failed to start containerd task "...": cannot start a stopped process: unknown
+  Exit Code: 128
+```
+
+**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
+as "stopped" in containerd's internal state but the process never actually ran. This causes
+every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
+
+**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
+
+```bash
+kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
+```
+
+Wait 30s then confirm it comes up `1/1 Running`.
+
+**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
+unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
+Fixing PGPool automatically unblocks Gitea.
+
+---
+
+### 2. Gitea pods Pending — Insufficient CPU
+
+**Symptom:** New pod stuck in `Pending` with scheduler event:
+```
+0/1 nodes are available: 1 Insufficient cpu.
+```
+
+**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
+allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
+Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
+
+**Check:**
+```bash
+kubectl describe node | grep -A6 "Allocated resources"
+```
+
+**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
+
+```bash
+# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
+helm upgrade gitea gitea/gitea --version <current> -n default \
+  --reuse-values \
+  --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
+  --set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
+
+# Delete the stuck old Gitea pod if it's crashlooping
+kubectl delete pod <old-gitea-pod-name>
+```
+
+This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
+schedule the new PGPool (100m) + new Gitea (100m via init containers).
+
+**After-fix:** The rolling update from the blocked deployment should self-complete once
+both pods can schedule and Gitea can reach PGPool.
+
+---
+
+## Recovery Checklist
+
+When Gitea is down, work through this in order:
+
+1. **Check PGPool** — most common root cause
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=pgpool'
+   ```
+   - `CrashLoopBackOff` → delete the pod (see issue #1 above)
+   - `Pending` → check CPU budget (see issue #2)
+
+2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=postgresql'
+   ```
+
+3. **Check Gitea app pod**
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=gitea'
+   kubectl logs <gitea-pod> --tail=20
+   ```
+   - DB connect errors → PGPool issue (go to step 1)
+   - Init container crash → check `kubectl logs <pod> -c configure-gitea`
+
+4. **Verify end-to-end**
+   ```bash
+   curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
+   # expect: 200
+   ```
+
+---
+
+## Node Resource Budget (approximate)
+
+| Component | CPU Request |
+|-----------|------------|
+| postgresql-ha-postgresql × 3 | 750m |
+| pgpool | 100m (after 2026-03-25 fix, was 250m) |
+| valkey-cluster × 3 | 300m |
+| gitea app | ~100m (init containers) |
+| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
+| System (coredns, metrics-server, traefik) | ~200m |
+| **Total** | **~1675m** |
+
+Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
+reviewing resource requests first.
--- a/state-hub/dashboard/src/docs/connecting.md
+++ b/state-hub/dashboard/src/docs/connecting.md
@@ -135,6 +135,38 @@ tunnels:
      max_attempts: 0
      backoff_initial: 5
      backoff_max: 60
+
+  state-hub-railiance01:        # API tunnel
+    host: 92.205.62.239
+    remote_port: 18000
+    local_port: 8000
+    ssh_user: tegwick
+    ssh_key: ~/.ssh/id_ops
+    actor: agent.claude-railiance01
+    health_check:
+      url: http://127.0.0.1:8000/state/health
+      interval_seconds: 30
+      timeout_seconds: 5
+    reconnect:
+      max_attempts: 0
+      backoff_initial: 5
+      backoff_max: 60
+
+  state-hub-mcp-railiance01:    # MCP SSE tunnel
+    host: 92.205.62.239
+    remote_port: 18001
+    local_port: 8001
+    ssh_user: tegwick
+    ssh_key: ~/.ssh/id_ops
+    actor: agent.claude-railiance01
+    health_check:
+      url: http://127.0.0.1:18001/sse
+      interval_seconds: 30
+      timeout_seconds: 5
+    reconnect:
+      max_attempts: 0
+      backoff_initial: 5
+      backoff_max: 60
 ```

 ops-bridge source: `~/ops-bridge` · SSH key: `~/.ssh/id_ops`