Files
the-custodian/ops/incidents/2026-03-25-gitea-pgpool-crashloop.md
tegwick 41d239c166 ops: establish ops/ directory with Gitea runbook and INC-001 incident report
- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea
  on COULOMBCORE, documents containerd StartError pattern and CPU budget issue
- Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem
  for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock)
- Create ops/README.md — index for runbooks and incidents
- state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config
  (was previously unsaved)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:30:44 +01:00

5.1 KiB

title, date, severity, status, affected, environment, duration, resolved_by
title date severity status affected environment duration resolved_by
INC-001: Gitea down — PGPool containerd StartError + CPU exhaustion 2026-03-25 high resolved gitea (http://92.205.130.254:32166) COULOMBCORE k3s cluster ~13 days (2026-03-12 to 2026-03-25) Bernd Worsch / Claude

INC-001: Gitea down — PGPool CrashLoopBackOff + CPU exhaustion

Summary

Gitea was completely unavailable for approximately 13 days. Root cause was a containerd state corruption on the COULOMBCORE k3s node causing PGPool to fail on every start with StartError: cannot start a stopped process. This cascaded to Gitea, which crashed when unable to reach its database. A concurrent Helm upgrade rollout (gitea 12.2.0 → 12.5.0) was additionally blocked because a new Gitea pod could not schedule due to CPU exhaustion.


Timeline

Time Event
2026-03-10 helm upgrade gitea gitea/gitea — chart 12.2.0 → 12.5.0 (Gitea 1.24.5 → 1.25.4)
2026-03-12 New PGPool pod created by Helm upgrade rolling restart. Pod enters CrashLoopBackOff immediately with StartError: cannot start a stopped process
2026-03-12 Gitea app pod crashes due to DB unreachable (PGPool down)
~2026-03-22 New Gitea pod from upgrade rollout attempts to schedule. Blocked: Insufficient cpu (CPU requests at 98%)
2026-03-25 Incident detected and diagnosed
2026-03-25 09:28 PGPool CPU request reduced from 250m → 100m via helm upgrade --reuse-values
2026-03-25 09:28 Old crashing Gitea pod deleted to free 100m CPU
2026-03-25 09:29 New PGPool pod (100m request) schedules and starts successfully (1/1 Running)
2026-03-25 09:30 New Gitea pod (pending 3 days) schedules, inits, starts (1/1 Running)
2026-03-25 09:30 Gitea HTTP endpoint returns 200. Incident resolved.

Root Causes

Primary: Containerd state corruption (PGPool)

The PGPool pod was in CrashLoopBackOff with exit code 128 and message:

failed to start containerd task "b6de5dce...": cannot start a stopped process: unknown

This is a known containerd bug where a container task is left in an invalid "stopped" state in containerd's internal database. Every restart attempt by Kubernetes immediately fails because containerd refuses to start a task it believes is already stopped. The fix is to delete the pod — the new pod gets a fresh containerd task ID and starts normally.

Secondary: CPU exhaustion (rolling update blocked)

The Helm 12.2.0 → 12.5.0 upgrade triggered a rolling update for all Gitea components. The new Gitea app pod (gitea-f4f657c59-cmtdf) could not schedule for 3+ days because CPU requests were at 1975m/2000m (~98%). PGPool's default request of 250m was unnecessarily high for a lightweight connection pooler on this resource-constrained single-node cluster.


Impact

  • Gitea (code hosting) completely unavailable for ~13 days
  • All repos with remote_url pointing to 92.205.130.254:32166 were unreachable
  • No data loss — PostgreSQL HA pods remained running throughout
  • Git push/pull and Gitea API calls failed for all consumers during the outage

Resolution Steps

# 1. Reduce PGPool CPU request (primary blocker for scheduling)
helm upgrade gitea gitea/gitea --version 12.5.0 -n default \
  --reuse-values \
  --set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
  --set 'postgresql-ha.pgpool.resources.limits.cpu=200m'

# 2. Delete old crashing Gitea pod to free CPU and trigger fresh ReplicaSet pod
kubectl delete pod gitea-79f98f897f-khs26

# 3. Cleanup leftover node-debugger pods from prior investigation attempt
kubectl delete pod \
  node-debugger-254.130.205.92.host.secureserver.net-bn7qq \
  node-debugger-254.130.205.92.host.secureserver.net-vmd79

Within ~90 seconds:

  • New PGPool pod scheduled and running
  • Pending Gitea pod (3 days) scheduled, init containers ran, main container started
  • Gitea HTTP 200 confirmed

Follow-up Actions

  • Add PGPool CPU resource override to railiance-apps Helm values file (currently stored only in Helm release; values should be in git)
  • Set up alerting for CrashLoopBackOff pods older than 30 minutes
  • Review whether 3-node PostgreSQL HA is appropriate for a single-node cluster (provides no actual HA benefit, consumes 750m CPU requests)
  • Consider adding a CPU request budget dashboard panel to the Observable dashboard

Lessons Learned

  1. Containerd StartError is not a config issue. The error message looks like a runtime failure but is actually a containerd state corruption. The fix is always "delete the pod". This is now documented in the runbook: ops/runbooks/gitea-coulombcore.md

  2. Track Helm values in git. The only custom value (pgpool.adminPassword) was in the Helm release but not in railiance-apps. The resource fix applied here (pgpool CPU) would have been overwritten by a future helm upgrade --reuse-values from a clean checkout. All non-secret Helm values should live in railiance-apps/.

  3. Single-node CPU budget is tight. At 98% CPU request allocation, any pod churn causes scheduling failures. Resource requests need to be right-sized for this environment.