ops: establish ops/ directory with Gitea runbook and INC-001 incident report
- Create ops/runbooks/gitea-coulombcore.md — recovery checklist for Gitea on COULOMBCORE, documents containerd StartError pattern and CPU budget issue - Create ops/incidents/2026-03-25-gitea-pgpool-crashloop.md — INC-001 post-mortem for 13-day Gitea outage (PGPool CrashLoopBackOff + rolling update CPU deadlock) - Create ops/README.md — index for runbooks and incidents - state-hub/dashboard/src/docs/connecting.md: add railiance01 tunnel config (was previously unsaved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
159
ops/runbooks/gitea-coulombcore.md
Normal file
159
ops/runbooks/gitea-coulombcore.md
Normal file
@@ -0,0 +1,159 @@
|
||||
---
|
||||
title: Runbook — Gitea on COULOMBCORE
|
||||
tags: [gitea, coulombcore, k3s, postgresql-ha]
|
||||
created: 2026-03-25
|
||||
updated: 2026-03-25
|
||||
---
|
||||
|
||||
# Runbook: Gitea on COULOMBCORE
|
||||
|
||||
Gitea runs on the single-node k3s cluster at COULOMBCORE (`92.205.130.254`, user `tegwick`).
|
||||
It uses Bitnami `postgresql-ha` (PGPool + 3-node Patroni) and Valkey cluster for caching.
|
||||
|
||||
---
|
||||
|
||||
## Access
|
||||
|
||||
```bash
|
||||
# SSH (requires ~/.ssh/id_ops)
|
||||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
|
||||
|
||||
# Web UI
|
||||
http://92.205.130.254:32166 # NodePort 32166 → gitea svc → pod :3000
|
||||
|
||||
# Check all Gitea pods
|
||||
kubectl get pods -l 'app.kubernetes.io/instance=gitea'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Helm Release
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Release name | `gitea` |
|
||||
| Namespace | `default` |
|
||||
| Chart | `gitea/gitea` |
|
||||
| Current version | 12.5.0 (Gitea 1.25.4) |
|
||||
|
||||
```bash
|
||||
helm list -n default
|
||||
helm history gitea -n default
|
||||
helm get values gitea -n default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### 1. PGPool CrashLoopBackOff — containerd `StartError: cannot start a stopped process`
|
||||
|
||||
**Symptom:** `gitea-postgresql-ha-pgpool-*` pod is in `CrashLoopBackOff`. Describe shows:
|
||||
```
|
||||
Last State: Terminated
|
||||
Reason: StartError
|
||||
Message: failed to start containerd task "...": cannot start a stopped process: unknown
|
||||
Exit Code: 128
|
||||
```
|
||||
|
||||
**Root cause:** Containerd state corruption on the k3s node — the container task is recorded
|
||||
as "stopped" in containerd's internal state but the process never actually ran. This causes
|
||||
every restart attempt to fail immediately with exit code 128. Not a config or auth issue.
|
||||
|
||||
**Fix:** Delete the pod. The ReplicaSet controller recreates it with a fresh containerd task.
|
||||
|
||||
```bash
|
||||
kubectl delete pod $(kubectl get pod -l 'app.kubernetes.io/component=pgpool' -o name)
|
||||
```
|
||||
|
||||
Wait 30s then confirm it comes up `1/1 Running`.
|
||||
|
||||
**Cascade effect:** PGPool down → `gitea-postgresql-ha-pgpool` ClusterIP (`10.43.242.51:5432`)
|
||||
unreachable → Gitea app pod exhausts 10 DB connection attempts → exits → `CrashLoopBackOff`.
|
||||
Fixing PGPool automatically unblocks Gitea.
|
||||
|
||||
---
|
||||
|
||||
### 2. Gitea pods Pending — Insufficient CPU
|
||||
|
||||
**Symptom:** New pod stuck in `Pending` with scheduler event:
|
||||
```
|
||||
0/1 nodes are available: 1 Insufficient cpu.
|
||||
```
|
||||
|
||||
**Root cause:** The single-node cluster has ~2 vCPUs. CPU requests routinely approach 98%
|
||||
allocation. PGPool defaults to 250m CPU request; combined with 3x PostgreSQL at 250m each,
|
||||
Valkey, SSO stack, and monitoring, the budget is nearly exhausted.
|
||||
|
||||
**Check:**
|
||||
```bash
|
||||
kubectl describe node | grep -A6 "Allocated resources"
|
||||
```
|
||||
|
||||
**Fix:** Reduce PGPool CPU request via Helm upgrade, then delete any stale crashing pods:
|
||||
|
||||
```bash
|
||||
# Reduce pgpool from 250m to 100m (safe — pgpool is a lightweight connection pooler)
|
||||
helm upgrade gitea gitea/gitea --version <current> -n default \
|
||||
--reuse-values \
|
||||
--set 'postgresql-ha.pgpool.resources.requests.cpu=100m' \
|
||||
--set 'postgresql-ha.pgpool.resources.limits.cpu=200m'
|
||||
|
||||
# Delete the stuck old Gitea pod if it's crashlooping
|
||||
kubectl delete pod <old-gitea-pod-name>
|
||||
```
|
||||
|
||||
This frees ~250m (old pgpool, if crashing) + 100m (old gitea) = 350m, which is enough to
|
||||
schedule the new PGPool (100m) + new Gitea (100m via init containers).
|
||||
|
||||
**After-fix:** The rolling update from the blocked deployment should self-complete once
|
||||
both pods can schedule and Gitea can reach PGPool.
|
||||
|
||||
---
|
||||
|
||||
## Recovery Checklist
|
||||
|
||||
When Gitea is down, work through this in order:
|
||||
|
||||
1. **Check PGPool** — most common root cause
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=pgpool'
|
||||
```
|
||||
- `CrashLoopBackOff` → delete the pod (see issue #1 above)
|
||||
- `Pending` → check CPU budget (see issue #2)
|
||||
|
||||
2. **Check PostgreSQL** — should be 3/3 Running; if not, this is a deeper issue
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=postgresql'
|
||||
```
|
||||
|
||||
3. **Check Gitea app pod**
|
||||
```bash
|
||||
kubectl get pod -l 'app.kubernetes.io/component=gitea'
|
||||
kubectl logs <gitea-pod> --tail=20
|
||||
```
|
||||
- DB connect errors → PGPool issue (go to step 1)
|
||||
- Init container crash → check `kubectl logs <pod> -c configure-gitea`
|
||||
|
||||
4. **Verify end-to-end**
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" http://92.205.130.254:32166/
|
||||
# expect: 200
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Node Resource Budget (approximate)
|
||||
|
||||
| Component | CPU Request |
|
||||
|-----------|------------|
|
||||
| postgresql-ha-postgresql × 3 | 750m |
|
||||
| pgpool | 100m (after 2026-03-25 fix, was 250m) |
|
||||
| valkey-cluster × 3 | 300m |
|
||||
| gitea app | ~100m (init containers) |
|
||||
| SSO stack (authelia, lldap, privacyidea, keycape) | ~225m |
|
||||
| System (coredns, metrics-server, traefik) | ~200m |
|
||||
| **Total** | **~1675m** |
|
||||
|
||||
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
|
||||
reviewing resource requests first.
|
||||
Reference in New Issue
Block a user