Commit Graph

22 Commits

Author SHA1 Message Date
1eb8559f27 tools and workplans
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-05-15 23:03:28 +02:00
9fc5a033d5 feat(s2): add Gitea SSH NodePort service + close WP-0004 (backup tool, scope updates)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- helm/gitea-ssh-nodeport.yaml: expose Gitea SSH on NodePort 30022 (targetPort 2222)
  for on-node git automation (RAIL-HO-WP-0004-T07)
- tools/cmd/railiance-backup-s2: fix SQLite hot backup (was broken etcd-snapshot)
- tools/cmd/railiance-restore-s2: update restore instructions for SQLite mode
- workplans/RAIL-BS-WP-0004-safety-net.md: mark done
- SCOPE.md: update current state, document boundary violations, fix connectivity docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 01:07:02 +01:00
ee6d7b149e new workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-20 23:43:17 +01:00
66f8ca4009 docs(wp-0004): add implementation notes for sudo, etcd, helm, cron
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
     not a sudoers whitelist. Add restore drill commands. Fix cron to use
     absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 16:52:40 +00:00
5b0cfbf10a feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 17:43:30 +01:00
719e4f40d1 fix(wp-0004): correct T05 scope — server backup is Gitea+Zulip via railiance-infra
Some checks failed
railiance-tests / smoke (push) Has been cancelled
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.

Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.

Also removed the incorrectly installed cron job that called the broken script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:51:42 +00:00
a15ceee92b chore(workplan): add state_hub_task_ids to WP-0004
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Written by fix-consistency: T01-T06 registered in state hub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:24:28 +01:00
75467673a8 feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout
- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for
  current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo
- tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack
  (railiance-infra/cluster/platform/enablement/apps) + project repos;
  remove stale railiance-bootstrap reference
- docs/backup-restore.md: fix Step 5 clone commands to current repo names
- Makefile: add make backup and make preflight targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:21:29 +01:00
441a37c5ae chore(workplan): mark WP-0003 completed — pgpool fix deployed and verified
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:45:35 +01:00
660a63c674 feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
     (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link

Also: make test-ha-failover target, Makefile .PHONY updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:16:22 +01:00
42391c3b61 chore(workplan): add state_hub_workstream_id to WP-0003
Registered by fix-consistency: workstream 7ee9ee22, tasks T01-T04.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:06:55 +01:00
359d5b8b5b bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.

Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:03:36 +00:00
871c31a95d chore(workplan): mark WP-0002 completed — all tasks done 2026-03-10
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:44:39 +00:00
ac13e94324 chore(workplan): pin k3s version to v1.35.1+k3s1
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:54:29 +01:00
5891f1a881 chore(workplan): update WP-0002 references after repo rename
- railiance-hosts → railiance-infra throughout
- ADR-002 → ADR-003 boundary reference
- Remove 'railiance-apps decision pending' note (resolved)

Tasks T01–T05 unchanged — all still todo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:52:07 +01:00
01903a17bb chore(rename): railiance-bootstrap → railiance-cluster
Update all operational references to reflect the new repo name per
ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved.
Gitea remote URL updated locally (Gitea repo rename is a manual step).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:34:21 +01:00
783c8cebbd feat(boundary): remove OS-hardening overlap; add k3s baseline workplan
Per ADR-002 (railiance-hosts/docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md):
- ansible/harden.yml: replaced with tombstone pointing to railiance-hosts
- ansible/bootstrap.yml: remove `import_playbook: harden.yml`; add
  pre-condition comment; OS hardening is no longer this repo's concern
- docs/first_host.md: rewritten to reflect 3-step flow:
  converge railiance-hosts → railiance-bootstrap k3s install → smoke test
- workplans/RAIL-BS-WP-0002-k3s-baseline.md: new workplan for k3s +
  Helm + Kubernetes platform baseline; linked to repo goal 70ab2379

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 19:53:22 +01:00
1d759508ac Workplan belonged to railiance-hosts
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-08 23:30:12 +01:00
19661ca0c6 feat(bootstrap): add HostEurope hardening playbook and workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- workplans/RAIL-BS-WP-0002-hosteurope-bootstrap.md: new workplan for
  Secure Single-Server Bootstrap at HostEurope (repo goal d7092599).
  T01-T03 done; T04+T05 require ansible on a box with network access to
  92.205.62.239 (hosts.ini is gitignored — recreate on new box).

- ansible/harden.yml: new playbook — disables root/password SSH auth,
  enables UFW (allow 22/tcp 6443/tcp 8472/udp, deny-all default),
  installs fail2ban with SSH jail, sets HISTCONTROL=ignorespace.

- ansible/bootstrap.yml: import_playbook harden.yml runs before k3s.

- ansible/hosts.ini.example: add [hosteurope] group template.

- QUICKSTART.md: document two-stage bootstrap (harden → k3s).

- CLAUDE.md: add goal_guidance handling to session protocol
  (needs_workplan + alignment_warnings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 22:50:51 +01:00
d83bc1049f added dependency workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-04 19:40:32 +01:00
c60f678756 docs(workplan): mark RAIL-BS-WP-0001 completed
Some checks failed
railiance-tests / smoke (push) Has been cancelled
All four tasks done; SBOM ingested 13 packages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 20:23:05 +01:00
44428655d2 feat(sbom): add workplan RAIL-BS-WP-0001 — fix Ansible dep management
Some checks failed
railiance-tests / smoke (push) Has been cancelled
State Hub SBOM assessment identified a gap: no lockfile exists for the
Ansible control-node pip dependencies, making the repo unrepresentable
in the SBOM inventory.

4-task workplan to reach SBOM Level 3 (Ingested):
- T01: audit control-node pip deps
- T02: create pyproject.toml + uv.lock for ansible (+ transitive tree)
- T03: ingest into State Hub
- T04: create ansible/requirements.yml (even if empty, to be explicit)

State Hub task: 5f8cade5-119c-42e8-ba93-e9d0478650e4
Workstream: phase-0-operational-baseline (59155efb)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 19:29:20 +01:00