Files
the-custodian/ops/README.md
tegwick 41d239c166 ops: establish ops/ directory with Gitea runbook and INC-001 incident report
- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea
  on COULOMBCORE, documents containerd StartError pattern and CPU budget issue
- Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem
  for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock)
- Create ops/README.md — index for runbooks and incidents
- state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config
  (was previously unsaved)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:30:44 +01:00

713 B

Ops Documentation

Operational runbooks and incident reports for the Railiance/Custodian infrastructure.

Structure

ops/
  runbooks/   — how-to guides for recurring operational tasks and known issues
  incidents/  — post-incident reports (append-only, one file per incident)

Runbooks

Runbook Covers
gitea-coulombcore.md Gitea on COULOMBCORE k3s — access, known issues, recovery checklist

Incidents

ID Date Summary Status
INC-001 2026-03-25 Gitea down 13d — PGPool containerd StartError + CPU exhaustion Resolved