--- title: Runbook — Gitea on COULOMBCORE tags: [gitea, coulombcore, k3s, postgresql-ha] created: 2026-03-25 updated: 2026-03-26 --- # Runbook: Gitea on COULOMBCORE Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`). It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching. --- ## Access ```bash # SSH (requires ~/.ssh/id_ops) ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 # Web UI http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000 # Check all Gitea pods kubectl get pods -l 'app.kubernetes.io/instance=gitea' ``` --- ## Helm Release | Field | Value | |-------|-------| | Release name | `gitea` | | Namespace | `default` | | Chart | `gitea/gitea` | | Current version | 12.5.0 (Gitea 1.25.4) | ```bash helm list -n default helm history gitea -n default helm get values gitea -n default ``` --- ## Known Issues ### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process` **Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows: ``` Last State: Terminated Reason: StartError Message: failed to start containerd task "...": cannot start a stopped process: unknown Exit Code: 128 ``` **Root cause:** Containerd state corruption on the k3s node — the container task is recorded as "stopped" in containerd's internal state but the process never actually ran. This causes every restart attempt to fail immediately with exit code 128. Not a config or auth issue. **Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task. ```bash kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name) ``` Wait 30s then confirm it comes up `1/1 Running`. **Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`) unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`. Fixing PGPool automatically unblocks Gitea. --- ### 2. Gitea pods Pending — Insufficient CPU **Symptom:** New pod stuck in `Pending` with scheduler event: ``` 0/1 nodes are available: 1 Insufficient cpu. ``` **Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98% allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each, Valkey, SSO stack, and monitoring, the budget is nearly exhausted. **Check:** ```bash kubectl describe node | grep -A6 "Allocated resources" ``` **Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods: ```bash # Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler) helm upgrade gitea gitea/gitea --version -n default \ --reuse-values \ --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \ --set 'postgresql-ha.pgpool.resources.limits.cpu=200m' # Delete the stuck old Gitea pod if it's crashlooping kubectl delete pod ``` This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to schedule the new PGPool (100m) + new Gitea (100m via init containers). **After-fix:** The rolling update from the blocked deployment should self-complete once both pods can schedule and Gitea can reach PGPool. --- ## Recovery Checklist When Gitea is down, work through this in order: 1. **Check PGPool** — most common root cause ```bash kubectl get pod -l 'app.kubernetes.io/component=pgpool' ``` - `CrashLoopBackOff` → delete the pod (see issue #1 above) - `Pending` → check CPU budget (see issue #2) 2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue ```bash kubectl get pod -l 'app.kubernetes.io/component=postgresql' ``` 3. **Check Gitea app pod** ```bash kubectl get pod -l 'app.kubernetes.io/component=gitea' kubectl logs --tail=20 ``` - DB connect errors → PGPool issue (go to step 1) - Init container crash → check `kubectl logs -c configure-gitea` 4. **Verify end-to-end** ```bash curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/ # expect: 200 ``` --- ### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive) **Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks, kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established before the overload and requires no new connections). **Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses) exhausts the process/memory budget. With no swap, the kernel thrashes continuously. **Triage (workstation):** ```bash # Check if node is alive despite SSH being down curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel # k3s API — will timeout if node is thrashing kubectl get nodes # expect TLS timeout ``` **Fix (requires console/VNC access):** ```bash # 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top # Runaway claude agents: massive VIRT (>50GB), user tegwick # 2. Kill the offenders kill -9 kill -9 # apport in D-state amplifies load # 3. Wait ~60s for load to drop; SSH will start accepting connections # 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha' ``` **Gitea does NOT need to be restarted** — it survives node overload. Once load drops and PostgreSQL HA resyncs, Gitea serves requests again. **Prevention:** See "Robustness" section below. --- ## Node Resource Budget (approximate) | Component | CPU Request | |-----------|------------| | postgresql-ha-postgresql × 3 | 750m | | pgpool | 100m (after 2026-03-25 fix, was 250m) | | valkey-cluster × 3 | 300m | | gitea app | ~100m (init containers) | | SSO stack (authelia, lldap, privacyidea, keycape) | ~225m | | System (coredns, metrics-server, traefik) | ~200m | | **Total** | **~1675m** | Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without reviewing resource requests first. --- ## Robustness — Hardening Checklist These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26): ### 1. Add swap (not yet done — highest priority) ```bash fallocate -l 4G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile echo '/swapfile none swap sw 0 0' >> /etc/fstab ``` Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time. ### 2. Cap tegwick user nproc (not yet done) ```bash # /etc/security/limits.conf tegwick hard nproc 512 tegwick soft nproc 256 ``` Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine within 256 soft / 512 hard. ### 3. Cap tegwick systemd user session memory (not yet done) ```bash # Create override for the tegwick user slice mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/ cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf </dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST -d "k3s pod unhealthy on COULOMBCORE" || true ``` Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod with restart count > 3 and status not Running/Completed warrants a notification. This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days undetected) without adding tunnel infrastructure that can't help under node overload.