gitea-values.sops.yaml relocated to railiance-apps/helm/ per
ADR-003 boundary rules — Gitea is S5, values belong in S5 repo.
Tombstone left in helm/MOVED.md. SCOPE.md updated to reflect
resolved violation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k3s runs in SQLite mode (no --cluster-init). Replace etcd-snapshot
with sqlite3 .backup for a WAL-aware hot copy of state.db.
Update restore guide to match. Cron installed under root crontab.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures k3s, helm, kubectl, goss, sops, and age as direct tool
dependencies for railiance-cluster. Versions are unresolved (confidence:
low) — no version pins exist in the repo yet.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `make backup` now invokes `sudo tools/cmd/railiance-backup-s2` directly
- Move `mkdir -p` in railiance-backup-s2 to after the root check so the
script emits a clear error instead of a raw permission-denied failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tools/cmd/railiance-backup-s2:
- k3s etcd snapshot (age-encrypted)
- Helm release values for all namespaces (age-encrypted)
- kubeconfig /etc/rancher/k3s/k3s.yaml (age-encrypted)
- output: /opt/backup/railiance/cluster/, keep last 7, .last-backup stamp
- requires root, no network dependency
tools/cmd/railiance-restore-s2:
- lists available backups with sizes
- prints step-by-step restore instructions for each artifact type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
not a sudoers whitelist. Add restore drill commands. Fix cron to use
absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.
DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.
The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.
Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.
Also removed the incorrectly installed cron job that called the broken script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs:
- GITEA_URL defaulted to localhost:3000; Gitea NodePort is 32166
- Pod label app.kubernetes.io/name=postgresql-ha matched pgpool pod too;
added component=postgresql to target only postgres nodes
- Used bare 'kubectl' which is not on PATH; switched to 'k3s kubectl'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
(fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link
Also: make test-ha-failover target, Makefile .PHONY updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.
Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit only included the staged portion (k3s tasks).
The working-tree additions (Helm install, kubeconfig fetch, version vars)
were never staged and were left behind.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>