ops: establish ops/ directory with Gitea runbook and INC-001 incident report
- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
23
ops/README.md
Normal file
23
ops/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Ops Documentation
|
||||
|
||||
Operational runbooks and incident reports for the Railiance/Custodian infrastructure.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
ops/
|
||||
runbooks/ — how-to guides for recurring operational tasks and known issues
|
||||
incidents/ — post-incident reports (append-only, one file per incident)
|
||||
```
|
||||
|
||||
## Runbooks
|
||||
|
||||
| Runbook | Covers |
|
||||
|---------|--------|
|
||||
| [gitea-coulombcore.md](runbooks/gitea-coulombcore.md) | Gitea on COULOMBCORE k3s — access, known issues, recovery checklist |
|
||||
|
||||
## Incidents
|
||||
|
||||
| ID | Date | Summary | Status |
|
||||
|----|------|---------|--------|
|
||||
| [INC-001](incidents/2026-03-25-gitea-pgpool-crashloop.md) | 2026-03-25 | Gitea down 13d — PGPool containerd StartError + CPU exhaustion | Resolved |
|
||||
122
ops/incidents/2026-03-25-gitea-pgpool-crashloop.md
Normal file
122
ops/incidents/2026-03-25-gitea-pgpool-crashloop.md
Normal file
@@ -0,0 +1,122 @@
|
||||
---
|
||||
title: "INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion"
|
||||
date: 2026-03-25
|
||||
severity: high
|
||||
status: resolved
|
||||
affected: gitea (http://92.205.130.254:32166)
|
||||
environment: COULOMBCORE k3s cluster
|
||||
duration: ~13 days (2026-03-12 to 2026-03-25)
|
||||
resolved_by: Bernd Worsch / Claude
|
||||
---
|
||||
|
||||
# INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion
|
||||
|
||||
## Summary
|
||||
|
||||
Gitea was completely unavailable for approximately 13 days. Root cause was a containerd
|
||||
state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with
|
||||
`StartError: cannot start a stopped process`. This cascaded to Gitea, which crashed when
|
||||
unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0)
|
||||
was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| 2026-03-10 | `helm upgrade gitea gitea/gitea` — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4) |
|
||||
| 2026-03-12 | New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with `StartError: cannot start a stopped process` |
|
||||
| 2026-03-12 | Gitea app pod crashes due to DB unreachable (PGPool down) |
|
||||
| ~2026-03-22 | New Gitea pod from upgrade rollout attempts to schedule. Blocked: `Insufficient cpu` (CPU requests at 98%) |
|
||||
| 2026-03-25 | Incident detected and diagnosed |
|
||||
| 2026-03-25 09:28 | PGPool CPU request reduced from 250m → 100m via `helm upgrade --reuse-values` |
|
||||
| 2026-03-25 09:28 | Old crashing Gitea pod deleted to free 100m CPU |
|
||||
| 2026-03-25 09:29 | New PGPool pod (100m request) schedules and starts successfully (`1/1 Running`) |
|
||||
| 2026-03-25 09:30 | New Gitea pod (pending 3 days) schedules, inits, starts (`1/1 Running`) |
|
||||
| 2026-03-25 09:30 | Gitea HTTP endpoint returns 200. Incident resolved. |
|
||||
|
||||
---
|
||||
|
||||
## Root Causes
|
||||
|
||||
### Primary: Containerd state corruption (PGPool)
|
||||
|
||||
The PGPool pod was in `CrashLoopBackOff` with exit code 128 and message:
|
||||
```
|
||||
failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown
|
||||
```
|
||||
|
||||
This is a known containerd bug where a container task is left in an invalid "stopped" state
|
||||
in containerd's internal database. Every restart attempt by Kubernetes immediately fails
|
||||
because containerd refuses to start a task it believes is already stopped. The fix is to
|
||||
delete the pod — the new pod gets a fresh containerd task ID and starts normally.
|
||||
|
||||
### Secondary: CPU exhaustion (rolling update blocked)
|
||||
|
||||
The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components.
|
||||
The new Gitea app pod (`gitea-f4f657c59-cmtdf`) could not schedule for 3+ days because
|
||||
CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was
|
||||
unnecessarily high for a lightweight connection pooler on this resource-constrained
|
||||
single-node cluster.
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
- Gitea (code hosting) completely unavailable for ~13 days
|
||||
- All repos with `remote_url` pointing to `92.205.130.254:32166` were unreachable
|
||||
- No data loss — PostgreSQL HA pods remained running throughout
|
||||
- Git push/pull and Gitea API calls failed for all consumers during the outage
|
||||
|
||||
---
|
||||
|
||||
## Resolution Steps
|
||||
|
||||
```bash
|
||||
# 1. Reduce PGPool CPU request (primary blocker for scheduling)
|
||||
helm upgrade gitea gitea/gitea --version 12.5.0 -n default \
|
||||
--reuse-values \
|
||||
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
|
||||
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
|
||||
|
||||
# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod
|
||||
kubectl delete pod gitea-79f98f897f-khs26
|
||||
|
||||
# 3. Cleanup leftover node-debugger pods from prior investigation attempt
|
||||
kubectl delete pod \
|
||||
node-debugger-254.130.205.92.host.secureserver.net-bn7qq \
|
||||
node-debugger-254.130.205.92.host.secureserver.net-vmd79
|
||||
```
|
||||
|
||||
Within ~90 seconds:
|
||||
- New PGPool pod scheduled and running
|
||||
- Pending Gitea pod (3 days) scheduled, init containers ran, main container started
|
||||
- Gitea HTTP 200 confirmed
|
||||
|
||||
---
|
||||
|
||||
## Follow-up Actions
|
||||
|
||||
- [ ] Add PGPool CPU resource override to `railiance-apps` Helm values file (currently
|
||||
stored only in Helm release; values should be in git)
|
||||
- [ ] Set up alerting for `CrashLoopBackOff` pods older than 30 minutes
|
||||
- [ ] Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster
|
||||
(provides no actual HA benefit, consumes 750m CPU requests)
|
||||
- [ ] Consider adding a CPU request budget dashboard panel to the Observable dashboard
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Containerd StartError is not a config issue.** The error message looks like a runtime
|
||||
failure but is actually a containerd state corruption. The fix is always "delete the pod".
|
||||
This is now documented in the runbook: `ops/runbooks/gitea-coulombcore.md`
|
||||
|
||||
2. **Track Helm values in git.** The only custom value (`pgpool.adminPassword`) was in the
|
||||
Helm release but not in `railiance-apps`. The resource fix applied here (pgpool CPU) would
|
||||
have been overwritten by a future `helm upgrade --reuse-values` from a clean checkout.
|
||||
All non-secret Helm values should live in `railiance-apps/`.
|
||||
|
||||
3. **Single-node CPU budget is tight.** At 98% CPU request allocation, any pod churn causes
|
||||
scheduling failures. Resource requests need to be right-sized for this environment.
|
||||
159
ops/runbooks/gitea-coulombcore.md
Normal file
159
ops/runbooks/gitea-coulombcore.md
Normal file
@@ -0,0 +1,159 @@
|
||||
---
|
||||
title: Runbook — Gitea on COULOMBCORE
|
||||
tags: [gitea, coulombcore, k3s, postgresql-ha]
|
||||
created: 2026-03-25
|
||||
updated: 2026-03-25
|
||||
---
|
||||
|
||||
# Runbook: Gitea on COULOMBCORE
|
||||
|
||||
Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
|
||||
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
|
||||
|
||||
---
|
||||
|
||||
## Access
|
||||
|
||||
```bash
|
||||
# SSH (requires ~/.ssh/id_ops)
|
||||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
|
||||
|
||||
# Web UI
|
||||
http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000
|
||||
|
||||
# Check all Gitea pods
|
||||
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Helm Release
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Release name | `gitea` |
|
||||
| Namespace | `default` |
|
||||
| Chart | `gitea/gitea` |
|
||||
| Current version | 12.5.0 (Gitea 1.25.4) |
|
||||
|
||||
```bash
|
||||
helm list -n default
|
||||
helm history gitea -n default
|
||||
helm get values gitea -n default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
|
||||
|
||||
**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
|
||||
```
|
||||
Last State: Terminated
|
||||
Reason: StartError
|
||||
Message: failed to start containerd task "...": cannot start a stopped process: unknown
|
||||
Exit Code: 128
|
||||
```
|
||||
|
||||
**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
|
||||
as "stopped" in containerd's internal state but the process never actually ran. This causes
|
||||
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
|
||||
|
||||
**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
|
||||
|
||||
```bash
|
||||
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
|
||||
```
|
||||
|
||||
Wait 30s then confirm it comes up `1/1 Running`.
|
||||
|
||||
**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
|
||||
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
|
||||
Fixing PGPool automatically unblocks Gitea.
|
||||
|
||||
---
|
||||
|
||||
### 2. Gitea pods Pending — Insufficient CPU
|
||||
|
||||
**Symptom:** New pod stuck in `Pending` with scheduler event:
|
||||
```
|
||||
0/1 nodes are available: 1 Insufficient cpu.
|
||||
```
|
||||
|
||||
**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
|
||||
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
|
||||
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
|
||||
|
||||
**Check:**
|
||||
```bash
|
||||
kubectl describe node | grep -A6 "Allocated resources"
|
||||
```
|
||||
|
||||
**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
|
||||
|
||||
```bash
|
||||
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
|
||||
helm upgrade gitea gitea/gitea --version <current> -n default \
|
||||
--reuse-values \
|
||||
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
|
||||
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
|
||||
|
||||
# Delete the stuck old Gitea pod if it's crashlooping
|
||||
kubectl delete pod <old-gitea-pod-name>
|
||||
```
|
||||
|
||||
This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
|
||||
schedule the new PGPool (100m) + new Gitea (100m via init containers).
|
||||
|
||||
**After-fix:** The rolling update from the blocked deployment should self-complete once
|
||||
both pods can schedule and Gitea can reach PGPool.
|
||||
|
||||
---
|
||||
|
||||
## Recovery Checklist
|
||||
|
||||
When Gitea is down, work through this in order:
|
||||
|
||||
1. **Check PGPool** — most common root cause
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=pgpool'
|
||||
```
|
||||
- `CrashLoopBackOff` → delete the pod (see issue #1 above)
|
||||
- `Pending` → check CPU budget (see issue #2)
|
||||
|
||||
2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=postgresql'
|
||||
```
|
||||
|
||||
3. **Check Gitea app pod**
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=gitea'
|
||||
kubectl logs <gitea-pod> --tail=20
|
||||
```
|
||||
- DB connect errors → PGPool issue (go to step 1)
|
||||
- Init container crash → check `kubectl logs <pod> -c configure-gitea`
|
||||
|
||||
4. **Verify end-to-end**
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
|
||||
# expect: 200
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Node Resource Budget (approximate)
|
||||
|
||||
| Component | CPU Request |
|
||||
|-----------|------------|
|
||||
| postgresql-ha-postgresql × 3 | 750m |
|
||||
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
|
||||
| valkey-cluster × 3 | 300m |
|
||||
| gitea app | ~100m (init containers) |
|
||||
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
|
||||
| System (coredns, metrics-server, traefik) | ~200m |
|
||||
| **Total** | **~1675m** |
|
||||
|
||||
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
|
||||
reviewing resource requests first.
|
||||
@@ -135,6 +135,38 @@ tunnels:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
|
||||
state-hub-railiance01: # API tunnel
|
||||
host: 92.205.62.239
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: tegwick
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-railiance01
|
||||
health_check:
|
||||
url: http://127.0.0.1:8000/state/health
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
|
||||
state-hub-mcp-railiance01: # MCP SSE tunnel
|
||||
host: 92.205.62.239
|
||||
remote_port: 18001
|
||||
local_port: 8001
|
||||
ssh_user: tegwick
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-railiance01
|
||||
health_check:
|
||||
url: http://127.0.0.1:18001/sse
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
```
|
||||
|
||||
ops-bridge source: `~/ops-bridge` · SSH key: `~/.ssh/id_ops`
|
||||
|
||||
Reference in New Issue
Block a user