2055 Commits

Author SHA1 Message Date
2acc06f466 docs: add SCOPE.md for rapid orientation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 23:11:38 +01:00
4e1a90032b fix(backup): elevate sudo in Makefile and guard mkdir after root check
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- `make backup` now invokes `sudo tools/cmd/railiance-backup-s2` directly
- Move `mkdir -p` in railiance-backup-s2 to after the root check so the
  script emits a clear error instead of a raw permission-denied failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 22:33:49 +00:00
7e28399f69 feat(backup): implement S2 integrated backup — WP-0004 T01-T04
Some checks failed
railiance-tests / smoke (push) Has been cancelled
tools/cmd/railiance-backup-s2:
  - k3s etcd snapshot (age-encrypted)
  - Helm release values for all namespaces (age-encrypted)
  - kubeconfig /etc/rancher/k3s/k3s.yaml (age-encrypted)
  - output: /opt/backup/railiance/cluster/, keep last 7, .last-backup stamp
  - requires root, no network dependency

tools/cmd/railiance-restore-s2:
  - lists available backups with sizes
  - prints step-by-step restore instructions for each artifact type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 21:17:54 +01:00
66f8ca4009 docs(wp-0004): add implementation notes for sudo, etcd, helm, cron
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
     not a sudoers whitelist. Add restore drill commands. Fix cron to use
     absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 16:52:40 +00:00
5b0cfbf10a feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 17:43:30 +01:00
719e4f40d1 fix(wp-0004): correct T05 scope — server backup is Gitea+Zulip via railiance-infra
Some checks failed
railiance-tests / smoke (push) Has been cancelled
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.

Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.

Also removed the incorrectly installed cron job that called the broken script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:51:42 +00:00
a15ceee92b chore(workplan): add state_hub_task_ids to WP-0004
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Written by fix-consistency: T01-T06 registered in state hub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:24:28 +01:00
75467673a8 feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout
- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for
  current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo
- tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack
  (railiance-infra/cluster/platform/enablement/apps) + project repos;
  remove stale railiance-bootstrap reference
- docs/backup-restore.md: fix Step 5 clone commands to current repo names
- Makefile: add make backup and make preflight targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:21:29 +01:00
441a37c5ae chore(workplan): mark WP-0003 completed — pgpool fix deployed and verified
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:45:35 +01:00
3297ac1f6c fix(test): correct ha-failover test — wrong URL, wrong pod label, missing kubectl
Three bugs:
- GITEA_URL defaulted to localhost:3000; Gitea NodePort is 32166
- Pod label app.kubernetes.io/name=postgresql-ha matched pgpool pod too;
  added component=postgresql to target only postgres nodes
- Used bare 'kubectl' which is not on PATH; switched to 'k3s kubectl'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:42:54 +00:00
7daef079c2 feat(secrets): encrypt gitea Helm values with SOPS (age)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add .sops.yaml policy targeting *.sops.yaml files using the shared age
key from railiance-infra. Migrate helm/gitea-values.yaml to encrypted
helm/gitea-values.sops.yaml.

Pins all postgresql-ha passwords (postgresql, postgres, repmgr, pgpool,
pgpool-admin, sr-check) so helm upgrade never regenerates secrets and
breaks the running cluster. Fixes WP-0003 T01.

Usage: helm upgrade gitea gitea/gitea -n default -f <(sops -d helm/gitea-values.sops.yaml)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:37:22 +00:00
660a63c674 feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
     (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link

Also: make test-ha-failover target, Makefile .PHONY updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:16:22 +01:00
42391c3b61 chore(workplan): add state_hub_workstream_id to WP-0003
Registered by fix-consistency: workstream 7ee9ee22, tasks T01-T04.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:06:55 +01:00
359d5b8b5b bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.

Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:03:36 +00:00
ada406f327 fix(bootstrap): commit full bootstrap.yml — Helm + kubeconfig tasks
The previous commit only included the staged portion (k3s tasks).
The working-tree additions (Helm install, kubeconfig fetch, version vars)
were never staged and were left behind.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:52:36 +00:00
871c31a95d chore(workplan): mark WP-0002 completed — all tasks done 2026-03-10
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:44:39 +00:00
901535ca44 feat(k3s-baseline): complete WP-0002 T01-T05
- bootstrap.yml: install k3s (server+cluster-init, pinned v1.35.1+k3s1)
  and Helm (v3.17.3 with checksum verify); fetch kubeconfig to control node
- tests/smoke_kube.sh: assert node Ready, helm, CoreDNS, Traefik
- docs/kubeconfig.md: usage, merge, context-switch, security note
- Makefile: k3s-install and smoke targets with make help

Closes T01, T02, T03, T04, T05 of RAIL-BS-WP-0002.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:43:16 +00:00
fb6618e9ab fix(claude): correct COULOMBCORE IP to 92.205.130.254
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 01:54:20 +01:00
fded740121 docs(claude): update tunnel instructions to use state-hub Makefile
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Tunnel is now started from ~/the-custodian/state-hub:
  make tunnel HOST=tegwick@92.205.62.239

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 01:19:45 +01:00
e1c33712c1 docs(claude): add State Hub tunnel setup instructions
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 01:08:57 +01:00
ac13e94324 chore(workplan): pin k3s version to v1.35.1+k3s1
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:54:29 +01:00
5891f1a881 chore(workplan): update WP-0002 references after repo rename
- railiance-hosts → railiance-infra throughout
- ADR-002 → ADR-003 boundary reference
- Remove 'railiance-apps decision pending' note (resolved)

Tasks T01–T05 unchanged — all still todo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:52:07 +01:00
4561aa5aec chore(relocate): stub out S1 items moved to railiance-infra
Some checks failed
railiance-tests / smoke (push) Has been cancelled
cloudinit/user-data.yaml and tools/cmd/railiance-plan-host relocated
to railiance-infra per ADR-003. Tombstone stubs left in place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:35:04 +01:00
01903a17bb chore(rename): railiance-bootstrap → railiance-cluster
Update all operational references to reflect the new repo name per
ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved.
Gitea remote URL updated locally (Gitea repo rename is a manual step).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:34:21 +01:00
783c8cebbd feat(boundary): remove OS-hardening overlap; add k3s baseline workplan
Per ADR-002 (railiance-hosts/docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md):
- ansible/harden.yml: replaced with tombstone pointing to railiance-hosts
- ansible/bootstrap.yml: remove `import_playbook: harden.yml`; add
  pre-condition comment; OS hardening is no longer this repo's concern
- docs/first_host.md: rewritten to reflect 3-step flow:
  converge railiance-hosts → railiance-bootstrap k3s install → smoke test
- workplans/RAIL-BS-WP-0002-k3s-baseline.md: new workplan for k3s +
  Helm + Kubernetes platform baseline; linked to repo goal 70ab2379

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 19:53:22 +01:00
1d759508ac Workplan belonged to railiance-hosts
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-08 23:30:12 +01:00
19661ca0c6 feat(bootstrap): add HostEurope hardening playbook and workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- workplans/RAIL-BS-WP-0002-hosteurope-bootstrap.md: new workplan for
  Secure Single-Server Bootstrap at HostEurope (repo goal d7092599).
  T01-T03 done; T04+T05 require ansible on a box with network access to
  92.205.62.239 (hosts.ini is gitignored — recreate on new box).

- ansible/harden.yml: new playbook — disables root/password SSH auth,
  enables UFW (allow 22/tcp 6443/tcp 8472/udp, deny-all default),
  installs fail2ban with SSH jail, sets HISTCONTROL=ignorespace.

- ansible/bootstrap.yml: import_playbook harden.yml runs before k3s.

- ansible/hosts.ini.example: add [hosteurope] group template.

- QUICKSTART.md: document two-stage bootstrap (harden → k3s).

- CLAUDE.md: add goal_guidance handling to session protocol
  (needs_workplan + alignment_warnings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 22:50:51 +01:00
d83bc1049f added dependency workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-04 19:40:32 +01:00
c60f678756 docs(workplan): mark RAIL-BS-WP-0001 completed
Some checks failed
railiance-tests / smoke (push) Has been cancelled
All four tasks done; SBOM ingested 13 packages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 20:23:05 +01:00
f7b8cdb4c1 feat(deps): add pyproject.toml, uv.lock, and ansible/requirements.yml
Declares ansible>=10 as the only pip dependency for the control node.
Generates uv.lock pinning ansible 12.3.0 / ansible-core 2.19.7 and
the full transitive tree (13 packages). Adds explicit empty
ansible/requirements.yml confirming no Galaxy collections are used.

Closes RAIL-BS-WP-0001 T01–T04. Enables SBOM ingestion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 20:22:26 +01:00
9fe5348af3 fix(CLAUDE.md): use reliable workplan discovery in step 2
Glob with pattern 'workplans/*.md' from repo root fails silently
(tool limitation with subdirectory prefixes in patterns). Changed to
Glob(pattern="**/*.md", path="workplans/") which does find files,
with Bash ls as fallback. This fixes step 2 of the session protocol
silently producing no workplan results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 20:13:29 +01:00
1aa5e436ae fix(CLAUDE.md): rewrite session protocol to surface custodian tasks on open
Previous CLAUDE.md only had a First Session Protocol. When workstreams already
existed, the session would call get_state_summary() and produce no useful output.

New 3-step protocol:
- Step 1: get_state_summary() + get_next_steps() via state-hub MCP tools
- Step 2: scan workplans/*.md for active tasks
- Step 3: output orientation brief: active workstreams, pending repo tasks
  (from workplans/ + [repo:railiance-bootstrap] state hub tasks), suggested
  next action, SBOM status (currently null — gap noted)

Also adds Known Pending Tasks table for RAIL-BS-WP-0001 (dep management)
and strengthens ADR-001 workplan convention and contribution tracking sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 20:05:22 +01:00
44428655d2 feat(sbom): add workplan RAIL-BS-WP-0001 — fix Ansible dep management
Some checks failed
railiance-tests / smoke (push) Has been cancelled
State Hub SBOM assessment identified a gap: no lockfile exists for the
Ansible control-node pip dependencies, making the repo unrepresentable
in the SBOM inventory.

4-task workplan to reach SBOM Level 3 (Ingested):
- T01: audit control-node pip deps
- T02: create pyproject.toml + uv.lock for ansible (+ transitive tree)
- T03: ingest into State Hub
- T04: create ansible/requirements.yml (even if empty, to be explicit)

State Hub task: 5f8cade5-119c-42e8-ba93-e9d0478650e4
Workstream: phase-0-operational-baseline (59155efb)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 19:29:20 +01:00
76ae1351ce fix: correct pg_stat_user_tables column name in restore drill (relname not tablename)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 07:25:59 +01:00
ab5b12334d docs: backup and restore runbook
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Covers encryption (age key management), what is protected, backup
command, daily cron, preflight checks, full step-by-step restore
procedure, restore drill instructions, and two extension points
(EP-RAIL-003 git mirrors, EP-RAIL-004 offsite secondary copy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 00:08:14 +01:00
4381a079a2 feat: backup + preflight commands, decisions log, gitignore update
- tools/cmd/railiance-backup: pg_dump + config snapshot, age-encrypted,
  uploaded to Nextcloud file drop via curl PUT. Daily cron target.
- tools/cmd/railiance-preflight: pre-migration safety gate — checks backup
  freshness, all repos clean/pushed, age key present.
- bin/railiance: added backup and preflight subcommands.
- DECISIONS.md: decision log (D1 ingress Nginx+Traefik, D2 Nextcloud backup).
- .gitignore: exclude *backup-dropoff-link* files (contain upload tokens).
- CLAUDE.md: state hub session protocol update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 23:59:28 +01:00
eb8a6902b6 docs: add ThreePhoenix architecture concept and workplan
RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd,
Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and
Phoenix CronJob for weekly node rotation to prevent configuration drift.

ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes
to self-healing Gitea cluster with monitoring and alert silencing.

Also adds CLAUDE.md with Custodian State Hub session protocol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 01:13:05 +01:00
b7696e657f chore: another improvement
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 01:23:02 +00:00
6662e9a377 chore: improved quickstart
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 01:18:14 +00:00
53482f8e65 chore: improve quickstart instructions
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 00:57:17 +00:00
b1862d67f0 feat: added plan-host command
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 02:46:48 +02:00
7530468d80 refactor: separated command script
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 02:39:47 +02:00
0bfdf465c1 chore: improved quickstart for newbies
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 02:00:25 +02:00
96eccc6b67 feat: rails style bootkit bin/railiance with quickstart
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 01:32:19 +02:00
676ec32379 chore: remove outdated readme
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 00:15:44 +02:00
75d11583e4 feat(tools): directed panspermia inspired colonization scripts
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-13 00:11:18 +02:00
be750b005b chore: add MIT License
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-12 02:44:03 +02:00
038adef94a docs: update root README with quick start and structure
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-12 02:36:45 +02:00
15935b78aa docs: add Railiance overview README
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-12 02:31:24 +02:00
3f8dafe5c2 docs: add contributor guidelines
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2025-09-12 02:25:53 +02:00