Documents the dashboard's architecture, framework choice rationale, data-fetching strategies (static loaders + live polling), component library, page inventory, and key features including the Workstream Health Index and entity modals. Also registers the new page in the Reference nav and adds runbook section for node overload / runaway agent process (INC-002) with hardening checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
268 lines
8.4 KiB
Markdown
268 lines
8.4 KiB
Markdown
---
|
||
title: Runbook — Gitea on COULOMBCORE
|
||
tags: [gitea, coulombcore, k3s, postgresql-ha]
|
||
created: 2026-03-25
|
||
updated: 2026-03-26
|
||
---
|
||
|
||
# Runbook: Gitea on COULOMBCORE
|
||
|
||
Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
|
||
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
|
||
|
||
---
|
||
|
||
## Access
|
||
|
||
```bash
|
||
# SSH (requires ~/.ssh/id_ops)
|
||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
|
||
|
||
# Web UI
|
||
http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000
|
||
|
||
# Check all Gitea pods
|
||
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
|
||
```
|
||
|
||
---
|
||
|
||
## Helm Release
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Release name | `gitea` |
|
||
| Namespace | `default` |
|
||
| Chart | `gitea/gitea` |
|
||
| Current version | 12.5.0 (Gitea 1.25.4) |
|
||
|
||
```bash
|
||
helm list -n default
|
||
helm history gitea -n default
|
||
helm get values gitea -n default
|
||
```
|
||
|
||
---
|
||
|
||
## Known Issues
|
||
|
||
### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
|
||
|
||
**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
|
||
```
|
||
Last State: Terminated
|
||
Reason: StartError
|
||
Message: failed to start containerd task "...": cannot start a stopped process: unknown
|
||
Exit Code: 128
|
||
```
|
||
|
||
**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
|
||
as "stopped" in containerd's internal state but the process never actually ran. This causes
|
||
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
|
||
|
||
**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
|
||
|
||
```bash
|
||
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
|
||
```
|
||
|
||
Wait 30s then confirm it comes up `1/1 Running`.
|
||
|
||
**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
|
||
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
|
||
Fixing PGPool automatically unblocks Gitea.
|
||
|
||
---
|
||
|
||
### 2. Gitea pods Pending — Insufficient CPU
|
||
|
||
**Symptom:** New pod stuck in `Pending` with scheduler event:
|
||
```
|
||
0/1 nodes are available: 1 Insufficient cpu.
|
||
```
|
||
|
||
**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
|
||
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
|
||
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
|
||
|
||
**Check:**
|
||
```bash
|
||
kubectl describe node | grep -A6 "Allocated resources"
|
||
```
|
||
|
||
**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
|
||
|
||
```bash
|
||
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
|
||
helm upgrade gitea gitea/gitea --version <current> -n default \
|
||
--reuse-values \
|
||
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
|
||
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
|
||
|
||
# Delete the stuck old Gitea pod if it's crashlooping
|
||
kubectl delete pod <old-gitea-pod-name>
|
||
```
|
||
|
||
This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
|
||
schedule the new PGPool (100m) + new Gitea (100m via init containers).
|
||
|
||
**After-fix:** The rolling update from the blocked deployment should self-complete once
|
||
both pods can schedule and Gitea can reach PGPool.
|
||
|
||
---
|
||
|
||
## Recovery Checklist
|
||
|
||
When Gitea is down, work through this in order:
|
||
|
||
1. **Check PGPool** — most common root cause
|
||
```bash
|
||
kubectl get pod -l 'app.kubernetes.io/component=pgpool'
|
||
```
|
||
- `CrashLoopBackOff` → delete the pod (see issue #1 above)
|
||
- `Pending` → check CPU budget (see issue #2)
|
||
|
||
2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
|
||
```bash
|
||
kubectl get pod -l 'app.kubernetes.io/component=postgresql'
|
||
```
|
||
|
||
3. **Check Gitea app pod**
|
||
```bash
|
||
kubectl get pod -l 'app.kubernetes.io/component=gitea'
|
||
kubectl logs <gitea-pod> --tail=20
|
||
```
|
||
- DB connect errors → PGPool issue (go to step 1)
|
||
- Init container crash → check `kubectl logs <pod> -c configure-gitea`
|
||
|
||
4. **Verify end-to-end**
|
||
```bash
|
||
curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
|
||
# expect: 200
|
||
```
|
||
|
||
---
|
||
|
||
### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive)
|
||
|
||
**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake
|
||
timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks,
|
||
kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established
|
||
before the overload and requires no new connections).
|
||
|
||
**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses)
|
||
exhausts the process/memory budget. With no swap, the kernel thrashes continuously.
|
||
|
||
**Triage (workstation):**
|
||
```bash
|
||
# Check if node is alive despite SSH being down
|
||
curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel
|
||
|
||
# k3s API — will timeout if node is thrashing
|
||
kubectl get nodes # expect TLS timeout
|
||
```
|
||
|
||
**Fix (requires console/VNC access):**
|
||
```bash
|
||
# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top
|
||
# Runaway claude agents: massive VIRT (>50GB), user tegwick
|
||
|
||
# 2. Kill the offenders
|
||
kill -9 <runaway-pid>
|
||
kill -9 <apport-pid-if-in-D-state> # apport in D-state amplifies load
|
||
|
||
# 3. Wait ~60s for load to drop; SSH will start accepting connections
|
||
# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts
|
||
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
|
||
```
|
||
|
||
**Gitea does NOT need to be restarted** — it survives node overload. Once load drops
|
||
and PostgreSQL HA resyncs, Gitea serves requests again.
|
||
|
||
**Prevention:** See "Robustness" section below.
|
||
|
||
---
|
||
|
||
## Node Resource Budget (approximate)
|
||
|
||
| Component | CPU Request |
|
||
|-----------|------------|
|
||
| postgresql-ha-postgresql × 3 | 750m |
|
||
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
|
||
| valkey-cluster × 3 | 300m |
|
||
| gitea app | ~100m (init containers) |
|
||
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
|
||
| System (coredns, metrics-server, traefik) | ~200m |
|
||
| **Total** | **~1675m** |
|
||
|
||
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
|
||
reviewing resource requests first.
|
||
|
||
---
|
||
|
||
## Robustness — Hardening Checklist
|
||
|
||
These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26):
|
||
|
||
### 1. Add swap (not yet done — highest priority)
|
||
|
||
```bash
|
||
fallocate -l 4G /swapfile
|
||
chmod 600 /swapfile
|
||
mkswap /swapfile
|
||
swapon /swapfile
|
||
echo '/swapfile none swap sw 0 0' >> /etc/fstab
|
||
```
|
||
|
||
Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time.
|
||
|
||
### 2. Cap tegwick user nproc (not yet done)
|
||
|
||
```bash
|
||
# /etc/security/limits.conf
|
||
tegwick hard nproc 512
|
||
tegwick soft nproc 256
|
||
```
|
||
|
||
Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine
|
||
within 256 soft / 512 hard.
|
||
|
||
### 3. Cap tegwick systemd user session memory (not yet done)
|
||
|
||
```bash
|
||
# Create override for the tegwick user slice
|
||
mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/
|
||
cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf <<EOF
|
||
[Slice]
|
||
MemoryMax=1500M
|
||
MemorySwapMax=512M
|
||
EOF
|
||
systemctl daemon-reload
|
||
```
|
||
|
||
Prevents a rogue user process from consuming all 3.9GB.
|
||
|
||
### 4. Always-on agent guardrails (process hygiene)
|
||
|
||
- **Never run `/ralph-loop` directly on COULOMBCORE** — use `/ralph-workplan` which
|
||
self-terminates when the workplan is complete (HEUREKA stop condition).
|
||
- Set `--max-iterations` explicitly on any Ralph invocation.
|
||
- Avoid large parallel agent fans (e.g., spawning 20 sub-agents simultaneously) on
|
||
this resource-constrained node.
|
||
|
||
### 5. Add cluster health alerting (not yet done)
|
||
|
||
A per-service tunnel adds passive visibility but no alerting. A single cron covering the
|
||
whole cluster is more useful — it catches Gitea, PGPool, and any other crashlooping pod.
|
||
|
||
```bash
|
||
# /etc/cron.d/k3s-pod-health (on CoulombCore, run as tegwick)
|
||
*/5 * * * * tegwick kubectl get pods -A 2>/dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST <notify-webhook> -d "k3s pod unhealthy on COULOMBCORE" || true
|
||
```
|
||
|
||
Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod
|
||
with restart count > 3 and status not Running/Completed warrants a notification.
|
||
|
||
This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days
|
||
undetected) without adding tunnel infrastructure that can't help under node overload.
|