- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
123 lines
5.1 KiB
Markdown
123 lines
5.1 KiB
Markdown
---
|
|
title: "INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion"
|
|
date: 2026-03-25
|
|
severity: high
|
|
status: resolved
|
|
affected: gitea (http://92.205.130.254:32166)
|
|
environment: COULOMBCORE k3s cluster
|
|
duration: ~13 days (2026-03-12 to 2026-03-25)
|
|
resolved_by: Bernd Worsch / Claude
|
|
---
|
|
|
|
# INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion
|
|
|
|
## Summary
|
|
|
|
Gitea was completely unavailable for approximately 13 days. Root cause was a containerd
|
|
state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with
|
|
`StartError: cannot start a stopped process`. This cascaded to Gitea, which crashed when
|
|
unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0)
|
|
was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion.
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Time | Event |
|
|
|------|-------|
|
|
| 2026-03-10 | `helm upgrade gitea gitea/gitea` — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4) |
|
|
| 2026-03-12 | New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with `StartError: cannot start a stopped process` |
|
|
| 2026-03-12 | Gitea app pod crashes due to DB unreachable (PGPool down) |
|
|
| ~2026-03-22 | New Gitea pod from upgrade rollout attempts to schedule. Blocked: `Insufficient cpu` (CPU requests at 98%) |
|
|
| 2026-03-25 | Incident detected and diagnosed |
|
|
| 2026-03-25 09:28 | PGPool CPU request reduced from 250m → 100m via `helm upgrade --reuse-values` |
|
|
| 2026-03-25 09:28 | Old crashing Gitea pod deleted to free 100m CPU |
|
|
| 2026-03-25 09:29 | New PGPool pod (100m request) schedules and starts successfully (`1/1 Running`) |
|
|
| 2026-03-25 09:30 | New Gitea pod (pending 3 days) schedules, inits, starts (`1/1 Running`) |
|
|
| 2026-03-25 09:30 | Gitea HTTP endpoint returns 200. Incident resolved. |
|
|
|
|
---
|
|
|
|
## Root Causes
|
|
|
|
### Primary: Containerd state corruption (PGPool)
|
|
|
|
The PGPool pod was in `CrashLoopBackOff` with exit code 128 and message:
|
|
```
|
|
failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown
|
|
```
|
|
|
|
This is a known containerd bug where a container task is left in an invalid "stopped" state
|
|
in containerd's internal database. Every restart attempt by Kubernetes immediately fails
|
|
because containerd refuses to start a task it believes is already stopped. The fix is to
|
|
delete the pod — the new pod gets a fresh containerd task ID and starts normally.
|
|
|
|
### Secondary: CPU exhaustion (rolling update blocked)
|
|
|
|
The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components.
|
|
The new Gitea app pod (`gitea-f4f657c59-cmtdf`) could not schedule for 3+ days because
|
|
CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was
|
|
unnecessarily high for a lightweight connection pooler on this resource-constrained
|
|
single-node cluster.
|
|
|
|
---
|
|
|
|
## Impact
|
|
|
|
- Gitea (code hosting) completely unavailable for ~13 days
|
|
- All repos with `remote_url` pointing to `92.205.130.254:32166` were unreachable
|
|
- No data loss — PostgreSQL HA pods remained running throughout
|
|
- Git push/pull and Gitea API calls failed for all consumers during the outage
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
```bash
|
|
# 1. Reduce PGPool CPU request (primary blocker for scheduling)
|
|
helm upgrade gitea gitea/gitea --version 12.5.0 -n default \
|
|
--reuse-values \
|
|
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
|
|
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
|
|
|
|
# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod
|
|
kubectl delete pod gitea-79f98f897f-khs26
|
|
|
|
# 3. Cleanup leftover node-debugger pods from prior investigation attempt
|
|
kubectl delete pod \
|
|
node-debugger-254.130.205.92.host.secureserver.net-bn7qq \
|
|
node-debugger-254.130.205.92.host.secureserver.net-vmd79
|
|
```
|
|
|
|
Within ~90 seconds:
|
|
- New PGPool pod scheduled and running
|
|
- Pending Gitea pod (3 days) scheduled, init containers ran, main container started
|
|
- Gitea HTTP 200 confirmed
|
|
|
|
---
|
|
|
|
## Follow-up Actions
|
|
|
|
- [ ] Add PGPool CPU resource override to `railiance-apps` Helm values file (currently
|
|
stored only in Helm release; values should be in git)
|
|
- [ ] Set up alerting for `CrashLoopBackOff` pods older than 30 minutes
|
|
- [ ] Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster
|
|
(provides no actual HA benefit, consumes 750m CPU requests)
|
|
- [ ] Consider adding a CPU request budget dashboard panel to the Observable dashboard
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Containerd StartError is not a config issue.** The error message looks like a runtime
|
|
failure but is actually a containerd state corruption. The fix is always "delete the pod".
|
|
This is now documented in the runbook: `ops/runbooks/gitea-coulombcore.md`
|
|
|
|
2. **Track Helm values in git.** The only custom value (`pgpool.adminPassword`) was in the
|
|
Helm release but not in `railiance-apps`. The resource fix applied here (pgpool CPU) would
|
|
have been overwritten by a future `helm upgrade --reuse-values` from a clean checkout.
|
|
All non-secret Helm values should live in `railiance-apps/`.
|
|
|
|
3. **Single-node CPU budget is tight.** At 98% CPU request allocation, any pod churn causes
|
|
scheduling failures. Resource requests need to be right-sized for this environment.
|