ops: establish ops/ directory with Gitea runbook and INC-001 incident report

- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:30:44 +01:00
parent efbbef76b0
commit 41d239c166
4 changed files with 336 additions and 0 deletions
--- a/ops/runbooks/gitea-coulombcore.md
+++ b/ops/runbooks/gitea-coulombcore.md
@@ -0,0 +1,159 @@
+---
+title: Runbook — Gitea on COULOMBCORE
+tags: [gitea, coulombcore, k3s, postgresql-ha]
+created: 2026-03-25
+updated: 2026-03-25
+---
+
+# Runbook: Gitea on COULOMBCORE
+
+Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
+It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
+
+---
+
+## Access
+
+```bash
+# SSH (requires ~/.ssh/id_ops)
+ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
+
+# Web UI
+http://92.205.130.254:32166    # NodePort 32166 → gitea svc → pod :3000
+
+# Check all Gitea pods
+kubectl get pods -l 'app.kubernetes.io/instance=gitea'
+```
+
+---
+
+## Helm Release
+
+| Field | Value |
+|-------|-------|
+| Release name | `gitea` |
+| Namespace | `default` |
+| Chart | `gitea/gitea` |
+| Current version | 12.5.0 (Gitea 1.25.4) |
+
+```bash
+helm list -n default
+helm history gitea -n default
+helm get values gitea -n default
+```
+
+---
+
+## Known Issues
+
+### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
+
+**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
+```
+Last State: Terminated
+  Reason: StartError
+  Message: failed to start containerd task "...": cannot start a stopped process: unknown
+  Exit Code: 128
+```
+
+**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
+as "stopped" in containerd's internal state but the process never actually ran. This causes
+every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
+
+**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
+
+```bash
+kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
+```
+
+Wait 30s then confirm it comes up `1/1 Running`.
+
+**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
+unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
+Fixing PGPool automatically unblocks Gitea.
+
+---
+
+### 2. Gitea pods Pending — Insufficient CPU
+
+**Symptom:** New pod stuck in `Pending` with scheduler event:
+```
+0/1 nodes are available: 1 Insufficient cpu.
+```
+
+**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
+allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
+Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
+
+**Check:**
+```bash
+kubectl describe node | grep -A6 "Allocated resources"
+```
+
+**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
+
+```bash
+# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
+helm upgrade gitea gitea/gitea --version <current> -n default \
+  --reuse-values \
+  --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
+  --set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
+
+# Delete the stuck old Gitea pod if it's crashlooping
+kubectl delete pod <old-gitea-pod-name>
+```
+
+This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
+schedule the new PGPool (100m) + new Gitea (100m via init containers).
+
+**After-fix:** The rolling update from the blocked deployment should self-complete once
+both pods can schedule and Gitea can reach PGPool.
+
+---
+
+## Recovery Checklist
+
+When Gitea is down, work through this in order:
+
+1. **Check PGPool** — most common root cause
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=pgpool'
+   ```
+   - `CrashLoopBackOff` → delete the pod (see issue #1 above)
+   - `Pending` → check CPU budget (see issue #2)
+
+2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=postgresql'
+   ```
+
+3. **Check Gitea app pod**
+   ```bash
+   kubectl get pod -l 'app.kubernetes.io/component=gitea'
+   kubectl logs <gitea-pod> --tail=20
+   ```
+   - DB connect errors → PGPool issue (go to step 1)
+   - Init container crash → check `kubectl logs <pod> -c configure-gitea`
+
+4. **Verify end-to-end**
+   ```bash
+   curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
+   # expect: 200
+   ```
+
+---
+
+## Node Resource Budget (approximate)
+
+| Component | CPU Request |
+|-----------|------------|
+| postgresql-ha-postgresql × 3 | 750m |
+| pgpool | 100m (after 2026-03-25 fix, was 250m) |
+| valkey-cluster × 3 | 300m |
+| gitea app | ~100m (init containers) |
+| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
+| System (coredns, metrics-server, traefik) | ~200m |
+| **Total** | **~1675m** |
+
+Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
+reviewing resource requests first.