ops: establish ops/ directory with Gitea runbook and INC-001 incident report

- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea
  on COULOMBCORE, documents containerd StartError pattern and CPU budget issue
- Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem
  for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock)
- Create ops/README.md — index for runbooks and incidents
- state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config
  (was previously unsaved)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-25 11:30:44 +01:00
parent efbbef76b0
commit 41d239c166
4 changed files with 336 additions and 0 deletions

View File

@@ -0,0 +1,159 @@
---
title: Runbook — Gitea on COULOMBCORE
tags: [gitea, coulombcore, k3s, postgresql-ha]
created: 2026-03-25
updated: 2026-03-25
---
# Runbook: Gitea on COULOMBCORE
Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
---
## Access
```bash
# SSH (requires ~/.ssh/id_ops)
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
# Web UI
http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000
# Check all Gitea pods
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
```
---
## Helm Release
| Field | Value |
|-------|-------|
| Release name | `gitea` |
| Namespace | `default` |
| Chart | `gitea/gitea` |
| Current version | 12.5.0 (Gitea 1.25.4) |
```bash
helm list -n default
helm history gitea -n default
helm get values gitea -n default
```
---
## Known Issues
### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
```
Last State: Terminated
Reason: StartError
Message: failed to start containerd task "...": cannot start a stopped process: unknown
Exit Code: 128
```
**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
as "stopped" in containerd's internal state but the process never actually ran. This causes
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
```bash
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
```
Wait 30s then confirm it comes up `1/1 Running`.
**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
Fixing PGPool automatically unblocks Gitea.
---
### 2. Gitea pods Pending — Insufficient CPU
**Symptom:** New pod stuck in `Pending` with scheduler event:
```
0/1 nodes are available: 1 Insufficient cpu.
```
**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
**Check:**
```bash
kubectl describe node | grep -A6 "Allocated resources"
```
**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
```bash
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
helm upgrade gitea gitea/gitea --version <current> -n default \
--reuse-values \
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
# Delete the stuck old Gitea pod if it's crashlooping
kubectl delete pod <old-gitea-pod-name>
```
This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
schedule the new PGPool (100m) + new Gitea (100m via init containers).
**After-fix:** The rolling update from the blocked deployment should self-complete once
both pods can schedule and Gitea can reach PGPool.
---
## Recovery Checklist
When Gitea is down, work through this in order:
1. **Check PGPool** — most common root cause
```bash
kubectl get pod -l 'app.kubernetes.io/component=pgpool'
```
- `CrashLoopBackOff` → delete the pod (see issue #1 above)
- `Pending` → check CPU budget (see issue #2)
2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
```bash
kubectl get pod -l 'app.kubernetes.io/component=postgresql'
```
3. **Check Gitea app pod**
```bash
kubectl get pod -l 'app.kubernetes.io/component=gitea'
kubectl logs <gitea-pod> --tail=20
```
- DB connect errors → PGPool issue (go to step 1)
- Init container crash → check `kubectl logs <pod> -c configure-gitea`
4. **Verify end-to-end**
```bash
curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
# expect: 200
```
---
## Node Resource Budget (approximate)
| Component | CPU Request |
|-----------|------------|
| postgresql-ha-postgresql × 3 | 750m |
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
| valkey-cluster × 3 | 300m |
| gitea app | ~100m (init containers) |
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
| System (coredns, metrics-server, traefik) | ~200m |
| **Total** | **~1675m** |
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
reviewing resource requests first.