the-custodian/ops/runbooks/gitea-coulombcore.md

---
title: Runbook — Gitea on COULOMBCORE
tags: [gitea, coulombcore, k3s, postgresql-ha]
created: 2026-03-25
updated: 2026-03-26
---

# Runbook: Gitea on COULOMBCORE

Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.

---

## Access

```bash
# SSH (requires ~/.ssh/id_ops)
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254

# Web UI
http://92.205.130.254:32166    # NodePort 32166 → gitea svc → pod :3000

# Check all Gitea pods
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
```

---

## Helm Release

| Field | Value |
|-------|-------|
| Release name | `gitea` |
| Namespace | `default` |
| Chart | `gitea/gitea` |
| Current version | 12.5.0 (Gitea 1.25.4) |

```bash
helm list -n default
helm history gitea -n default
helm get values gitea -n default
```

---

## Known Issues

### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`

**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
```
Last State: Terminated
  Reason: StartError
  Message: failed to start containerd task "...": cannot start a stopped process: unknown
  Exit Code: 128
```

**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
as "stopped" in containerd's internal state but the process never actually ran. This causes
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.

**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.

```bash
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
```

Wait 30s then confirm it comes up `1/1 Running`.

**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
Fixing PGPool automatically unblocks Gitea.

---

### 2. Gitea pods Pending — Insufficient CPU

**Symptom:** New pod stuck in `Pending` with scheduler event:
```
0/1 nodes are available: 1 Insufficient cpu.
```

**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.

**Check:**
```bash
kubectl describe node | grep -A6 "Allocated resources"
```

**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:

```bash
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
helm upgrade gitea gitea/gitea --version <current> -n default \
  --reuse-values \
  --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
  --set 'postgresql-ha.pgpool.resources.limits.cpu=200m'

# Delete the stuck old Gitea pod if it's crashlooping
kubectl delete pod <old-gitea-pod-name>
```

This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
schedule the new PGPool (100m) + new Gitea (100m via init containers).

**After-fix:** The rolling update from the blocked deployment should self-complete once
both pods can schedule and Gitea can reach PGPool.

---

## Recovery Checklist

When Gitea is down, work through this in order:

1. **Check PGPool** — most common root cause
   ```bash
   kubectl get pod -l 'app.kubernetes.io/component=pgpool'
   ```
   - `CrashLoopBackOff` → delete the pod (see issue #1 above)
   - `Pending` → check CPU budget (see issue #2)

2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
   ```bash
   kubectl get pod -l 'app.kubernetes.io/component=postgresql'
   ```

3. **Check Gitea app pod**
   ```bash
   kubectl get pod -l 'app.kubernetes.io/component=gitea'
   kubectl logs <gitea-pod> --tail=20
   ```
   - DB connect errors → PGPool issue (go to step 1)
   - Init container crash → check `kubectl logs <pod> -c configure-gitea`

4. **Verify end-to-end**
   ```bash
   curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
   # expect: 200
   ```

---

### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive)

**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake
timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks,
kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established
before the overload and requires no new connections).

**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses)
exhausts the process/memory budget. With no swap, the kernel thrashes continuously.

**Triage (workstation):**
```bash
# Check if node is alive despite SSH being down
curl -s --max-time 5 http://127.0.0.1:8000/state/health   # via reverse tunnel

# k3s API — will timeout if node is thrashing
kubectl get nodes                                           # expect TLS timeout
```

**Fix (requires console/VNC access):**
```bash
# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top
#    Runaway claude agents: massive VIRT (>50GB), user tegwick

# 2. Kill the offenders
kill -9 <runaway-pid>
kill -9 <apport-pid-if-in-D-state>   # apport in D-state amplifies load

# 3. Wait ~60s for load to drop; SSH will start accepting connections
# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
```

**Gitea does NOT need to be restarted** — it survives node overload. Once load drops
and PostgreSQL HA resyncs, Gitea serves requests again.

**Prevention:** See "Robustness" section below.

---

## Node Resource Budget (approximate)

| Component | CPU Request |
|-----------|------------|
| postgresql-ha-postgresql × 3 | 750m |
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
| valkey-cluster × 3 | 300m |
| gitea app | ~100m (init containers) |
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
| System (coredns, metrics-server, traefik) | ~200m |
| **Total** | **~1675m** |

Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
reviewing resource requests first.

---

## Robustness — Hardening Checklist

These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26):

### 1. Add swap (not yet done — highest priority)

```bash
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
```

Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time.

### 2. Cap tegwick user nproc (not yet done)

```bash
# /etc/security/limits.conf
tegwick hard nproc 512
tegwick soft nproc 256
```

Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine
within 256 soft / 512 hard.

### 3. Cap tegwick systemd user session memory (not yet done)

```bash
# Create override for the tegwick user slice
mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/
cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf <<EOF
[Slice]
MemoryMax=1500M
MemorySwapMax=512M
EOF
systemctl daemon-reload
```

Prevents a rogue user process from consuming all 3.9GB.

### 4. Always-on agent guardrails (process hygiene)

- **Never run `/ralph-loop` directly on COULOMBCORE** — use `/ralph-workplan` which
  self-terminates when the workplan is complete (HEUREKA stop condition).
- Set `--max-iterations` explicitly on any Ralph invocation.
- Avoid large parallel agent fans (e.g., spawning 20 sub-agents simultaneously) on
  this resource-constrained node.

### 5. Add cluster health alerting (not yet done)

A per-service tunnel adds passive visibility but no alerting. A single cron covering the
whole cluster is more useful — it catches Gitea, PGPool, and any other crashlooping pod.

```bash
# /etc/cron.d/k3s-pod-health  (on CoulombCore, run as tegwick)
*/5 * * * * tegwick kubectl get pods -A 2>/dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST <notify-webhook> -d "k3s pod unhealthy on COULOMBCORE" || true
```

Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod
with restart count > 3 and status not Running/Completed warrants a notification.

This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days
undetected) without adding tunnel infrastructure that can't help under node overload.