- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
24 lines
713 B
Markdown
24 lines
713 B
Markdown
# Ops Documentation
|
|
|
|
Operational runbooks and incident reports for the Railiance/Custodian infrastructure.
|
|
|
|
## Structure
|
|
|
|
```
|
|
ops/
|
|
runbooks/ — how-to guides for recurring operational tasks and known issues
|
|
incidents/ — post-incident reports (append-only, one file per incident)
|
|
```
|
|
|
|
## Runbooks
|
|
|
|
| Runbook | Covers |
|
|
|---------|--------|
|
|
| [gitea-coulombcore.md](runbooks/gitea-coulombcore.md) | Gitea on COULOMBCORE k3s — access, known issues, recovery checklist |
|
|
|
|
## Incidents
|
|
|
|
| ID | Date | Summary | Status |
|
|
|----|------|---------|--------|
|
|
| [INC-001](incidents/2026-03-25-gitea-pgpool-crashloop.md) | 2026-03-25 | Gitea down 13d — PGPool containerd StartError + CPU exhaustion | Resolved |
|