Captures k3s, helm, kubectl, goss, sops, and age as direct tool
dependencies for railiance-cluster. Versions are unresolved (confidence:
low) — no version pins exist in the repo yet.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `make backup` now invokes `sudo tools/cmd/railiance-backup-s2` directly
- Move `mkdir -p` in railiance-backup-s2 to after the root check so the
script emits a clear error instead of a raw permission-denied failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tools/cmd/railiance-backup-s2:
- k3s etcd snapshot (age-encrypted)
- Helm release values for all namespaces (age-encrypted)
- kubeconfig /etc/rancher/k3s/k3s.yaml (age-encrypted)
- output: /opt/backup/railiance/cluster/, keep last 7, .last-backup stamp
- requires root, no network dependency
tools/cmd/railiance-restore-s2:
- lists available backups with sizes
- prints step-by-step restore instructions for each artifact type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
not a sudoers whitelist. Add restore drill commands. Fix cron to use
absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.
DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.
The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.
Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.
Also removed the incorrectly installed cron job that called the broken script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs:
- GITEA_URL defaulted to localhost:3000; Gitea NodePort is 32166
- Pod label app.kubernetes.io/name=postgresql-ha matched pgpool pod too;
added component=postgresql to target only postgres nodes
- Used bare 'kubectl' which is not on PATH; switched to 'k3s kubectl'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
(fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link
Also: make test-ha-failover target, Makefile .PHONY updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.
Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit only included the staged portion (k3s tasks).
The working-tree additions (Helm install, kubeconfig fetch, version vars)
were never staged and were left behind.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cloudinit/user-data.yaml and tools/cmd/railiance-plan-host relocated
to railiance-infra per ADR-003. Tombstone stubs left in place.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all operational references to reflect the new repo name per
ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved.
Gitea remote URL updated locally (Gitea repo rename is a manual step).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per ADR-002 (railiance-hosts/docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md):
- ansible/harden.yml: replaced with tombstone pointing to railiance-hosts
- ansible/bootstrap.yml: remove `import_playbook: harden.yml`; add
pre-condition comment; OS hardening is no longer this repo's concern
- docs/first_host.md: rewritten to reflect 3-step flow:
converge railiance-hosts → railiance-bootstrap k3s install → smoke test
- workplans/RAIL-BS-WP-0002-k3s-baseline.md: new workplan for k3s +
Helm + Kubernetes platform baseline; linked to repo goal 70ab2379
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Declares ansible>=10 as the only pip dependency for the control node.
Generates uv.lock pinning ansible 12.3.0 / ansible-core 2.19.7 and
the full transitive tree (13 packages). Adds explicit empty
ansible/requirements.yml confirming no Galaxy collections are used.
Closes RAIL-BS-WP-0001 T01–T04. Enables SBOM ingestion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Glob with pattern 'workplans/*.md' from repo root fails silently
(tool limitation with subdirectory prefixes in patterns). Changed to
Glob(pattern="**/*.md", path="workplans/") which does find files,
with Bash ls as fallback. This fixes step 2 of the session protocol
silently producing no workplan results.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous CLAUDE.md only had a First Session Protocol. When workstreams already
existed, the session would call get_state_summary() and produce no useful output.
New 3-step protocol:
- Step 1: get_state_summary() + get_next_steps() via state-hub MCP tools
- Step 2: scan workplans/*.md for active tasks
- Step 3: output orientation brief: active workstreams, pending repo tasks
(from workplans/ + [repo:railiance-bootstrap] state hub tasks), suggested
next action, SBOM status (currently null — gap noted)
Also adds Known Pending Tasks table for RAIL-BS-WP-0001 (dep management)
and strengthens ADR-001 workplan convention and contribution tracking sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State Hub SBOM assessment identified a gap: no lockfile exists for the
Ansible control-node pip dependencies, making the repo unrepresentable
in the SBOM inventory.
4-task workplan to reach SBOM Level 3 (Ingested):
- T01: audit control-node pip deps
- T02: create pyproject.toml + uv.lock for ansible (+ transitive tree)
- T03: ingest into State Hub
- T04: create ansible/requirements.yml (even if empty, to be explicit)
State Hub task: 5f8cade5-119c-42e8-ba93-e9d0478650e4
Workstream: phase-0-operational-baseline (59155efb)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd,
Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and
Phoenix CronJob for weekly node rotation to prevent configuration drift.
ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes
to self-healing Gitea cluster with monitoring and alert silencing.
Also adds CLAUDE.md with Custodian State Hub session protocol.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>