Commit Graph

86 Commits

Author SHA1 Message Date
f9098e2dea chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 04:53:49 +02:00
977343f712 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 04:53:41 +02:00
171ece866a chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 04:39:16 +02:00
438fd22704 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 04:39:04 +02:00
baed899a07 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 04:24:43 +02:00
bc2107a339 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 04:24:31 +02:00
93b4689f20 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 04:10:06 +02:00
74d187dea0 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 04:09:55 +02:00
08cd460fcf chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 03:55:32 +02:00
77a799c7fd chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 03:55:21 +02:00
a4448b8688 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 03:40:53 +02:00
8af0e197ba chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 03:40:41 +02:00
b7b1822f35 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 03:26:16 +02:00
c028150a1f chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 03:26:06 +02:00
ea00ae3453 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 03:11:36 +02:00
b63689ba33 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 03:11:25 +02:00
d949dbee30 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 02:56:58 +02:00
9d220b6649 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 02:56:46 +02:00
493356a1a3 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 02:42:21 +02:00
50b1fb3b08 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 02:42:10 +02:00
58cf10c5a1 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 02:28:33 +02:00
3202475bed chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-bootstrap
2026-04-21 02:28:18 +02:00
0e2cde0b82 chore(consistency): sync task status from DB [auto]
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for railiance-cluster
2026-04-21 02:15:02 +02:00
595a043634 feat(boundary): move Gitea Helm values to railiance-apps (T06)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
gitea-values.sops.yaml relocated to railiance-apps/helm/ per
ADR-003 boundary rules — Gitea is S5, values belong in S5 repo.
Tombstone left in helm/MOVED.md. SCOPE.md updated to reflect
resolved violation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 13:23:41 +01:00
9fc5a033d5 feat(s2): add Gitea SSH NodePort service + close WP-0004 (backup tool, scope updates)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- helm/gitea-ssh-nodeport.yaml: expose Gitea SSH on NodePort 30022 (targetPort 2222)
  for on-node git automation (RAIL-HO-WP-0004-T07)
- tools/cmd/railiance-backup-s2: fix SQLite hot backup (was broken etcd-snapshot)
- tools/cmd/railiance-restore-s2: update restore instructions for SQLite mode
- workplans/RAIL-BS-WP-0004-safety-net.md: mark done
- SCOPE.md: update current state, document boundary violations, fix connectivity docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 01:07:02 +01:00
943d0f3e80 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-03-26:
  - update .custodian-brief.md for railiance-bootstrap
2026-03-27 01:07:01 +01:00
6eaf20d4d0 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-03-26:
  - update .custodian-brief.md for railiance-cluster
2026-03-27 01:07:01 +01:00
2420915d30 fix(backup): SQLite hot backup instead of etcd snapshot
Some checks failed
railiance-tests / smoke (push) Has been cancelled
k3s runs in SQLite mode (no --cluster-init). Replace etcd-snapshot
with sqlite3 .backup for a WAL-aware hot copy of state.db.
Update restore guide to match. Cron installed under root crontab.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 21:56:19 +00:00
ee6d7b149e new workplan
Some checks failed
railiance-tests / smoke (push) Has been cancelled
2026-03-20 23:43:17 +01:00
a747445790 Updated scope 2026-03-20 23:42:59 +01:00
6431bfab79 chore(sbom): add system-level tool dependency manifest
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Captures k3s, helm, kubectl, goss, sops, and age as direct tool
dependencies for railiance-cluster. Versions are unresolved (confidence:
low) — no version pins exist in the repo yet.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 18:31:12 +01:00
2acc06f466 docs: add SCOPE.md for rapid orientation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 23:11:38 +01:00
4e1a90032b fix(backup): elevate sudo in Makefile and guard mkdir after root check
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- `make backup` now invokes `sudo tools/cmd/railiance-backup-s2` directly
- Move `mkdir -p` in railiance-backup-s2 to after the root check so the
  script emits a clear error instead of a raw permission-denied failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 22:33:49 +00:00
7e28399f69 feat(backup): implement S2 integrated backup — WP-0004 T01-T04
Some checks failed
railiance-tests / smoke (push) Has been cancelled
tools/cmd/railiance-backup-s2:
  - k3s etcd snapshot (age-encrypted)
  - Helm release values for all namespaces (age-encrypted)
  - kubeconfig /etc/rancher/k3s/k3s.yaml (age-encrypted)
  - output: /opt/backup/railiance/cluster/, keep last 7, .last-backup stamp
  - requires root, no network dependency

tools/cmd/railiance-restore-s2:
  - lists available backups with sizes
  - prints step-by-step restore instructions for each artifact type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 21:17:54 +01:00
66f8ca4009 docs(wp-0004): add implementation notes for sudo, etcd, helm, cron
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
     not a sudoers whitelist. Add restore drill commands. Fix cron to use
     absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 16:52:40 +00:00
5b0cfbf10a feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 17:43:30 +01:00
719e4f40d1 fix(wp-0004): correct T05 scope — server backup is Gitea+Zulip via railiance-infra
Some checks failed
railiance-tests / smoke (push) Has been cancelled
The railiance-backup script targets a developer workstation (custodian DB
in Docker + Claude config). It is not applicable to the server.

Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an
Ansible role. T05 now documents this correctly and blocks wiring up a cron
job until the right script exists.

Also removed the incorrectly installed cron job that called the broken script.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:51:42 +00:00
a15ceee92b chore(workplan): add state_hub_task_ids to WP-0004
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Written by fix-consistency: T01-T06 registered in state hub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:24:28 +01:00
75467673a8 feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout
- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for
  current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo
- tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack
  (railiance-infra/cluster/platform/enablement/apps) + project repos;
  remove stale railiance-bootstrap reference
- docs/backup-restore.md: fix Step 5 clone commands to current repo names
- Makefile: add make backup and make preflight targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:21:29 +01:00
441a37c5ae chore(workplan): mark WP-0003 completed — pgpool fix deployed and verified
helm upgrade confirmed pgpool starts cleanly with adminPassword in values.
SOPS encryption applied. Smoke test passes. D3 failover test pending.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:45:35 +01:00
3297ac1f6c fix(test): correct ha-failover test — wrong URL, wrong pod label, missing kubectl
Three bugs:
- GITEA_URL defaulted to localhost:3000; Gitea NodePort is 32166
- Pod label app.kubernetes.io/name=postgresql-ha matched pgpool pod too;
  added component=postgresql to target only postgres nodes
- Used bare 'kubectl' which is not on PATH; switched to 'k3s kubectl'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:42:54 +00:00
7daef079c2 feat(secrets): encrypt gitea Helm values with SOPS (age)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add .sops.yaml policy targeting *.sops.yaml files using the shared age
key from railiance-infra. Migrate helm/gitea-values.yaml to encrypted
helm/gitea-values.sops.yaml.

Pins all postgresql-ha passwords (postgresql, postgres, repmgr, pgpool,
pgpool-admin, sr-check) so helm upgrade never regenerates secrets and
breaks the running cluster. Fixes WP-0003 T01.

Usage: helm upgrade gitea gitea/gitea -n default -f <(sops -d helm/gitea-values.sops.yaml)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:37:22 +00:00
660a63c674 feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword
     (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade)
T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks
T03: tests/test_ha_failover.sh — D3 HA failover test script
T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link

Also: make test-ha-failover target, Makefile .PHONY updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:16:22 +01:00
42391c3b61 chore(workplan): add state_hub_workstream_id to WP-0003
Registered by fix-consistency: workstream 7ee9ee22, tasks T01-T04.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:06:55 +01:00
359d5b8b5b bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL
HA failover caused pgpool to enter CrashLoopBackOff due to a missing
pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug
present since initial deployment but hidden by the lack of any pod restart.

Add Decision D3: HA and failover scenarios must be tested before a workplan
is considered done. Any HA component deployment requires a passing failover
test script in tests/ and complete Helm values before status = completed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:03:36 +00:00
ada406f327 fix(bootstrap): commit full bootstrap.yml — Helm + kubeconfig tasks
The previous commit only included the staged portion (k3s tasks).
The working-tree additions (Helm install, kubeconfig fetch, version vars)
were never staged and were left behind.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:52:36 +00:00
871c31a95d chore(workplan): mark WP-0002 completed — all tasks done 2026-03-10
State Hub update pending: tunnel was offline during this session.
Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:44:39 +00:00
901535ca44 feat(k3s-baseline): complete WP-0002 T01-T05
- bootstrap.yml: install k3s (server+cluster-init, pinned v1.35.1+k3s1)
  and Helm (v3.17.3 with checksum verify); fetch kubeconfig to control node
- tests/smoke_kube.sh: assert node Ready, helm, CoreDNS, Traefik
- docs/kubeconfig.md: usage, merge, context-switch, security note
- Makefile: k3s-install and smoke targets with make help

Closes T01, T02, T03, T04, T05 of RAIL-BS-WP-0002.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:43:16 +00:00
fb6618e9ab fix(claude): correct COULOMBCORE IP to 92.205.130.254
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 01:54:20 +01:00
fded740121 docs(claude): update tunnel instructions to use state-hub Makefile
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Tunnel is now started from ~/the-custodian/state-hub:
  make tunnel HOST=tegwick@92.205.62.239

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 01:19:45 +01:00