ops: establish ops/ directory with Gitea runbook and INC-001 incident report

- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea
  on COULOMBCORE, documents containerd StartError pattern and CPU budget issue
- Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem
  for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock)
- Create ops/README.md — index for runbooks and incidents
- state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config
  (was previously unsaved)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-25 11:30:44 +01:00
parent efbbef76b0
commit 41d239c166
4 changed files with 336 additions and 0 deletions

23
ops/README.md Normal file
View File

@@ -0,0 +1,23 @@
# Ops Documentation
Operational runbooks and incident reports for the Railiance/Custodian infrastructure.
## Structure
```
ops/
runbooks/ — how-to guides for recurring operational tasks and known issues
incidents/ — post-incident reports (append-only, one file per incident)
```
## Runbooks
| Runbook | Covers |
|---------|--------|
| [gitea-coulombcore.md](runbooks/gitea-coulombcore.md) | Gitea on COULOMBCORE k3s — access, known issues, recovery checklist |
## Incidents
| ID | Date | Summary | Status |
|----|------|---------|--------|
| [INC-001](incidents/2026-03-25-gitea-pgpool-crashloop.md) | 2026-03-25 | Gitea down 13d — PGPool containerd StartError + CPU exhaustion | Resolved |

View File

@@ -0,0 +1,122 @@
---
title: "INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion"
date: 2026-03-25
severity: high
status: resolved
affected: gitea (http://92.205.130.254:32166)
environment: COULOMBCORE k3s cluster
duration: ~13 days (2026-03-12 to 2026-03-25)
resolved_by: Bernd Worsch / Claude
---
# INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion
## Summary
Gitea was completely unavailable for approximately 13 days. Root cause was a containerd
state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with
`StartError: cannot start a stopped process`. This cascaded to Gitea, which crashed when
unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0)
was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion.
---
## Timeline
| Time | Event |
|------|-------|
| 2026-03-10 | `helm upgrade gitea gitea/gitea` — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4) |
| 2026-03-12 | New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with `StartError: cannot start a stopped process` |
| 2026-03-12 | Gitea app pod crashes due to DB unreachable (PGPool down) |
| ~2026-03-22 | New Gitea pod from upgrade rollout attempts to schedule. Blocked: `Insufficient cpu` (CPU requests at 98%) |
| 2026-03-25 | Incident detected and diagnosed |
| 2026-03-25 09:28 | PGPool CPU request reduced from 250m → 100m via `helm upgrade --reuse-values` |
| 2026-03-25 09:28 | Old crashing Gitea pod deleted to free 100m CPU |
| 2026-03-25 09:29 | New PGPool pod (100m request) schedules and starts successfully (`1/1 Running`) |
| 2026-03-25 09:30 | New Gitea pod (pending 3 days) schedules, inits, starts (`1/1 Running`) |
| 2026-03-25 09:30 | Gitea HTTP endpoint returns 200. Incident resolved. |
---
## Root Causes
### Primary: Containerd state corruption (PGPool)
The PGPool pod was in `CrashLoopBackOff` with exit code 128 and message:
```
failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown
```
This is a known containerd bug where a container task is left in an invalid "stopped" state
in containerd's internal database. Every restart attempt by Kubernetes immediately fails
because containerd refuses to start a task it believes is already stopped. The fix is to
delete the pod — the new pod gets a fresh containerd task ID and starts normally.
### Secondary: CPU exhaustion (rolling update blocked)
The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components.
The new Gitea app pod (`gitea-f4f657c59-cmtdf`) could not schedule for 3+ days because
CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was
unnecessarily high for a lightweight connection pooler on this resource-constrained
single-node cluster.
---
## Impact
- Gitea (code hosting) completely unavailable for ~13 days
- All repos with `remote_url` pointing to `92.205.130.254:32166` were unreachable
- No data loss — PostgreSQL HA pods remained running throughout
- Git push/pull and Gitea API calls failed for all consumers during the outage
---
## Resolution Steps
```bash
# 1. Reduce PGPool CPU request (primary blocker for scheduling)
helm upgrade gitea gitea/gitea --version 12.5.0 -n default \
--reuse-values \
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod
kubectl delete pod gitea-79f98f897f-khs26
# 3. Cleanup leftover node-debugger pods from prior investigation attempt
kubectl delete pod \
node-debugger-254.130.205.92.host.secureserver.net-bn7qq \
node-debugger-254.130.205.92.host.secureserver.net-vmd79
```
Within ~90 seconds:
- New PGPool pod scheduled and running
- Pending Gitea pod (3 days) scheduled, init containers ran, main container started
- Gitea HTTP 200 confirmed
---
## Follow-up Actions
- [ ] Add PGPool CPU resource override to `railiance-apps` Helm values file (currently
stored only in Helm release; values should be in git)
- [ ] Set up alerting for `CrashLoopBackOff` pods older than 30 minutes
- [ ] Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster
(provides no actual HA benefit, consumes 750m CPU requests)
- [ ] Consider adding a CPU request budget dashboard panel to the Observable dashboard
---
## Lessons Learned
1. **Containerd StartError is not a config issue.** The error message looks like a runtime
failure but is actually a containerd state corruption. The fix is always "delete the pod".
This is now documented in the runbook: `ops/runbooks/gitea-coulombcore.md`
2. **Track Helm values in git.** The only custom value (`pgpool.adminPassword`) was in the
Helm release but not in `railiance-apps`. The resource fix applied here (pgpool CPU) would
have been overwritten by a future `helm upgrade --reuse-values` from a clean checkout.
All non-secret Helm values should live in `railiance-apps/`.
3. **Single-node CPU budget is tight.** At 98% CPU request allocation, any pod churn causes
scheduling failures. Resource requests need to be right-sized for this environment.

View File

@@ -0,0 +1,159 @@
---
title: Runbook — Gitea on COULOMBCORE
tags: [gitea, coulombcore, k3s, postgresql-ha]
created: 2026-03-25
updated: 2026-03-25
---
# Runbook: Gitea on COULOMBCORE
Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
---
## Access
```bash
# SSH (requires ~/.ssh/id_ops)
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
# Web UI
http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000
# Check all Gitea pods
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
```
---
## Helm Release
| Field | Value |
|-------|-------|
| Release name | `gitea` |
| Namespace | `default` |
| Chart | `gitea/gitea` |
| Current version | 12.5.0 (Gitea 1.25.4) |
```bash
helm list -n default
helm history gitea -n default
helm get values gitea -n default
```
---
## Known Issues
### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
```
Last State: Terminated
Reason: StartError
Message: failed to start containerd task "...": cannot start a stopped process: unknown
Exit Code: 128
```
**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
as "stopped" in containerd's internal state but the process never actually ran. This causes
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
```bash
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
```
Wait 30s then confirm it comes up `1/1 Running`.
**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
Fixing PGPool automatically unblocks Gitea.
---
### 2. Gitea pods Pending — Insufficient CPU
**Symptom:** New pod stuck in `Pending` with scheduler event:
```
0/1 nodes are available: 1 Insufficient cpu.
```
**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
**Check:**
```bash
kubectl describe node | grep -A6 "Allocated resources"
```
**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
```bash
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
helm upgrade gitea gitea/gitea --version <current> -n default \
--reuse-values \
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
# Delete the stuck old Gitea pod if it's crashlooping
kubectl delete pod <old-gitea-pod-name>
```
This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
schedule the new PGPool (100m) + new Gitea (100m via init containers).
**After-fix:** The rolling update from the blocked deployment should self-complete once
both pods can schedule and Gitea can reach PGPool.
---
## Recovery Checklist
When Gitea is down, work through this in order:
1. **Check PGPool** — most common root cause
```bash
kubectl get pod -l 'app.kubernetes.io/component=pgpool'
```
- `CrashLoopBackOff` → delete the pod (see issue #1 above)
- `Pending` → check CPU budget (see issue #2)
2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
```bash
kubectl get pod -l 'app.kubernetes.io/component=postgresql'
```
3. **Check Gitea app pod**
```bash
kubectl get pod -l 'app.kubernetes.io/component=gitea'
kubectl logs <gitea-pod> --tail=20
```
- DB connect errors → PGPool issue (go to step 1)
- Init container crash → check `kubectl logs <pod> -c configure-gitea`
4. **Verify end-to-end**
```bash
curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
# expect: 200
```
---
## Node Resource Budget (approximate)
| Component | CPU Request |
|-----------|------------|
| postgresql-ha-postgresql × 3 | 750m |
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
| valkey-cluster × 3 | 300m |
| gitea app | ~100m (init containers) |
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
| System (coredns, metrics-server, traefik) | ~200m |
| **Total** | **~1675m** |
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
reviewing resource requests first.

View File

@@ -135,6 +135,38 @@ tunnels:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
state-hub-railiance01: # API tunnel
host: 92.205.62.239
remote_port: 18000
local_port: 8000
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: agent.claude-railiance01
health_check:
url: http://127.0.0.1:8000/state/health
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
state-hub-mcp-railiance01: # MCP SSE tunnel
host: 92.205.62.239
remote_port: 18001
local_port: 8001
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: agent.claude-railiance01
health_check:
url: http://127.0.0.1:18001/sse
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
```
ops-bridge source: `~/ops-bridge` · SSH key: `~/.ssh/id_ops`