T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
not a sudoers whitelist. Add restore drill commands. Fix cron to use
absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.
DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.
The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.
Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.
Also removed the incorrectly installed cron job that called the broken script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
(fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link
Also: make test-ha-failover target, Makefile .PHONY updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.
Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all operational references to reflect the new repo name per
ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved.
Gitea remote URL updated locally (Gitea repo rename is a manual step).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per ADR-002 (railiance-hosts/docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md):
- ansible/harden.yml: replaced with tombstone pointing to railiance-hosts
- ansible/bootstrap.yml: remove `import_playbook: harden.yml`; add
pre-condition comment; OS hardening is no longer this repo's concern
- docs/first_host.md: rewritten to reflect 3-step flow:
converge railiance-hosts → railiance-bootstrap k3s install → smoke test
- workplans/RAIL-BS-WP-0002-k3s-baseline.md: new workplan for k3s +
Helm + Kubernetes platform baseline; linked to repo goal 70ab2379
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State Hub SBOM assessment identified a gap: no lockfile exists for the
Ansible control-node pip dependencies, making the repo unrepresentable
in the SBOM inventory.
4-task workplan to reach SBOM Level 3 (Ingested):
- T01: audit control-node pip deps
- T02: create pyproject.toml + uv.lock for ansible (+ transitive tree)
- T03: ingest into State Hub
- T04: create ansible/requirements.yml (even if empty, to be explicit)
State Hub task: 5f8cade5-119c-42e8-ba93-e9d0478650e4
Workstream: phase-0-operational-baseline (59155efb)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>