railiance-cluster

Author	SHA1	Message	Date
tegwick	6431bfab79	chore(sbom): add system-level tool dependency manifest Some checks failed railiance-tests / smoke (push) Has been cancelled Details Captures k3s, helm, kubectl, goss, sops, and age as direct tool dependencies for railiance-cluster. Versions are unresolved (confidence: low) — no version pins exist in the repo yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 18:31:12 +01:00
tegwick	2acc06f466	docs: add SCOPE.md for rapid orientation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 23:11:38 +01:00
Bernd Worsch	4e1a90032b	fix(backup): elevate sudo in Makefile and guard mkdir after root check Some checks failed railiance-tests / smoke (push) Has been cancelled Details - `make backup` now invokes `sudo tools/cmd/railiance-backup-s2` directly - Move `mkdir -p` in railiance-backup-s2 to after the root check so the script emits a clear error instead of a raw permission-denied failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 22:33:49 +00:00
tegwick	7e28399f69	feat(backup): implement S2 integrated backup — WP-0004 T01-T04 Some checks failed railiance-tests / smoke (push) Has been cancelled Details tools/cmd/railiance-backup-s2: - k3s etcd snapshot (age-encrypted) - Helm release values for all namespaces (age-encrypted) - kubeconfig /etc/rancher/k3s/k3s.yaml (age-encrypted) - output: /opt/backup/railiance/cluster/, keep last 7, .last-backup stamp - requires root, no network dependency tools/cmd/railiance-restore-s2: - lists available backups with sizes - prints step-by-step restore instructions for each artifact type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 21:17:54 +01:00
Bernd Worsch	66f8ca4009	docs(wp-0004): add implementation notes for sudo, etcd, helm, cron Some checks failed railiance-tests / smoke (push) Has been cancelled Details T02: note to verify etcd is in use before implementing; flags root requirement T03: add KUBECONFIG to helm commands; note root access approach T06: document solution to sudo problem — run cron under root's crontab, not a sudoers whitelist. Add restore drill commands. Fix cron to use absolute path (~ unreliable in root crontab). T01: note to remove old railiance-backup script (wrong scope) Makefile: fix stale backup description, add restore target, fix .PHONY Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 16:52:40 +00:00
tegwick	5b0cfbf10a	feat(backup): revise WP-0004 — integrated backup per capability (D4) Some checks failed railiance-tests / smoke (push) Has been cancelled Details WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots, Helm values, kubeconfig). No external dependencies. age encryption reuses SOPS key pair. Output to /opt/backup/railiance/cluster/. DECISIONS.md D4: integrated backup per capability, not centralized. EP-RAIL-005 registered in state hub: custodian orchestration deferred until all layers implement the standard interface. The old monolithic backup (custodian DB + operator config) was not S2's concern and has been removed from this workplan scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 17:43:30 +01:00
Bernd Worsch	719e4f40d1	fix(wp-0004): correct T05 scope — server backup is Gitea+Zulip via railiance-infra Some checks failed railiance-tests / smoke (push) Has been cancelled Details The railiance-backup script targets a developer workstation (custodian DB in Docker + Claude config). It is not applicable to the server. Server backup (Gitea repos + Zulip data) belongs in railiance-infra as an Ansible role. T05 now documents this correctly and blocks wiring up a cron job until the right script exists. Also removed the incorrectly installed cron job that called the broken script. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 14:51:42 +00:00
tegwick	a15ceee92b	chore(workplan): add state_hub_task_ids to WP-0004 Some checks failed railiance-tests / smoke (push) Has been cancelled Details Written by fix-consistency: T01-T06 registered in state hub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 15:24:28 +01:00
tegwick	75467673a8	feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout - workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo - tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack (railiance-infra/cluster/platform/enablement/apps) + project repos; remove stale railiance-bootstrap reference - docs/backup-restore.md: fix Step 5 clone commands to current repo names - Makefile: add make backup and make preflight targets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 15:21:29 +01:00
tegwick	441a37c5ae	chore(workplan): mark WP-0003 completed — pgpool fix deployed and verified helm upgrade confirmed pgpool starts cleanly with adminPassword in values. SOPS encryption applied. Smoke test passes. D3 failover test pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 14:45:35 +01:00
Bernd Worsch	3297ac1f6c	fix(test): correct ha-failover test — wrong URL, wrong pod label, missing kubectl Three bugs: - GITEA_URL defaulted to localhost:3000; Gitea NodePort is 32166 - Pod label app.kubernetes.io/name=postgresql-ha matched pgpool pod too; added component=postgresql to target only postgres nodes - Used bare 'kubectl' which is not on PATH; switched to 'k3s kubectl' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 13:42:54 +00:00
Bernd Worsch	7daef079c2	feat(secrets): encrypt gitea Helm values with SOPS (age) Some checks failed railiance-tests / smoke (push) Has been cancelled Details Add .sops.yaml policy targeting *.sops.yaml files using the shared age key from railiance-infra. Migrate helm/gitea-values.yaml to encrypted helm/gitea-values.sops.yaml. Pins all postgresql-ha passwords (postgresql, postgres, repmgr, pgpool, pgpool-admin, sr-check) so helm upgrade never regenerates secrets and breaks the running cluster. Fixes WP-0003 T01. Usage: helm upgrade gitea gitea/gitea -n default -f <(sops -d helm/gitea-values.sops.yaml) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 13:37:22 +00:00
tegwick	660a63c674	feat(pgpool): implement WP-0003 T01-T04 — permanent fix for pgpool-password bug Some checks failed railiance-tests / smoke (push) Has been cancelled Details T01: helm/gitea-values.yaml with postgresql-ha.pgpool.adminPassword (fill REPLACE_WITH_PGPOOL_ADMIN_PASSWORD before helm upgrade) T02: tests/smoke_kube.sh — add pgpool and postgresql-ha pod health checks T03: tests/test_ha_failover.sh — D3 HA failover test script T04: docs/incidents/2026-03-10-pgpool-missing-secret.md + README link Also: make test-ha-failover target, Makefile .PHONY updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 14:16:22 +01:00
tegwick	42391c3b61	chore(workplan): add state_hub_workstream_id to WP-0003 Registered by fix-consistency: workstream 7ee9ee22, tasks T01-T04. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 14:06:55 +01:00
Bernd Worsch	359d5b8b5b	bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy Some checks failed railiance-tests / smoke (push) Has been cancelled Details Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 13:03:36 +00:00
Bernd Worsch	ada406f327	fix(bootstrap): commit full bootstrap.yml — Helm + kubeconfig tasks The previous commit only included the staged portion (k3s tasks). The working-tree additions (Helm install, kubeconfig fetch, version vars) were never staged and were left behind. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 09:52:36 +00:00
Bernd Worsch	871c31a95d	chore(workplan): mark WP-0002 completed — all tasks done 2026-03-10 State Hub update pending: tunnel was offline during this session. Run from local machine: cd ~/the-custodian/state-hub && make tunnel HOST=tegwick@92.205.130.254 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 09:44:39 +00:00
Bernd Worsch	901535ca44	feat(k3s-baseline): complete WP-0002 T01-T05 - bootstrap.yml: install k3s (server+cluster-init, pinned v1.35.1+k3s1) and Helm (v3.17.3 with checksum verify); fetch kubeconfig to control node - tests/smoke_kube.sh: assert node Ready, helm, CoreDNS, Traefik - docs/kubeconfig.md: usage, merge, context-switch, security note - Makefile: k3s-install and smoke targets with make help Closes T01, T02, T03, T04, T05 of RAIL-BS-WP-0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 09:43:16 +00:00
tegwick	fb6618e9ab	fix(claude): correct COULOMBCORE IP to 92.205.130.254 Some checks failed railiance-tests / smoke (push) Has been cancelled Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 01:54:20 +01:00
tegwick	fded740121	docs(claude): update tunnel instructions to use state-hub Makefile Some checks failed railiance-tests / smoke (push) Has been cancelled Details Tunnel is now started from ~/the-custodian/state-hub: make tunnel HOST=tegwick@92.205.62.239 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 01:19:45 +01:00
tegwick	e1c33712c1	docs(claude): add State Hub tunnel setup instructions Some checks failed railiance-tests / smoke (push) Has been cancelled Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 01:08:57 +01:00
tegwick	ac13e94324	chore(workplan): pin k3s version to v1.35.1+k3s1 Some checks failed railiance-tests / smoke (push) Has been cancelled Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 00:54:29 +01:00
tegwick	5891f1a881	chore(workplan): update WP-0002 references after repo rename - railiance-hosts → railiance-infra throughout - ADR-002 → ADR-003 boundary reference - Remove 'railiance-apps decision pending' note (resolved) Tasks T01–T05 unchanged — all still todo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 00:52:07 +01:00
tegwick	4561aa5aec	chore(relocate): stub out S1 items moved to railiance-infra Some checks failed railiance-tests / smoke (push) Has been cancelled Details cloudinit/user-data.yaml and tools/cmd/railiance-plan-host relocated to railiance-infra per ADR-003. Tombstone stubs left in place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 00:35:04 +01:00
tegwick	01903a17bb	chore(rename): railiance-bootstrap → railiance-cluster Update all operational references to reflect the new repo name per ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved. Gitea remote URL updated locally (Gitea repo rename is a manual step). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-10 00:34:21 +01:00
tegwick	783c8cebbd	feat(boundary): remove OS-hardening overlap; add k3s baseline workplan Per ADR-002 (railiance-hosts/docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md): - ansible/harden.yml: replaced with tombstone pointing to railiance-hosts - ansible/bootstrap.yml: remove `import_playbook: harden.yml`; add pre-condition comment; OS hardening is no longer this repo's concern - docs/first_host.md: rewritten to reflect 3-step flow: converge railiance-hosts → railiance-bootstrap k3s install → smoke test - workplans/RAIL-BS-WP-0002-k3s-baseline.md: new workplan for k3s + Helm + Kubernetes platform baseline; linked to repo goal 70ab2379 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-09 19:53:22 +01:00
tegwick	1d759508ac	Workplan belonged to railiance-hosts Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2026-03-08 23:30:12 +01:00
tegwick	19661ca0c6	feat(bootstrap): add HostEurope hardening playbook and workplan Some checks failed railiance-tests / smoke (push) Has been cancelled Details - workplans/RAIL-BS-WP-0002-hosteurope-bootstrap.md: new workplan for Secure Single-Server Bootstrap at HostEurope (repo goal d7092599). T01-T03 done; T04+T05 require ansible on a box with network access to 92.205.62.239 (hosts.ini is gitignored — recreate on new box). - ansible/harden.yml: new playbook — disables root/password SSH auth, enables UFW (allow 22/tcp 6443/tcp 8472/udp, deny-all default), installs fail2ban with SSH jail, sets HISTCONTROL=ignorespace. - ansible/bootstrap.yml: import_playbook harden.yml runs before k3s. - ansible/hosts.ini.example: add [hosteurope] group template. - QUICKSTART.md: document two-stage bootstrap (harden → k3s). - CLAUDE.md: add goal_guidance handling to session protocol (needs_workplan + alignment_warnings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-08 22:50:51 +01:00
tegwick	d83bc1049f	added dependency workplan Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2026-03-04 19:40:32 +01:00
tegwick	c60f678756	docs(workplan): mark RAIL-BS-WP-0001 completed Some checks failed railiance-tests / smoke (push) Has been cancelled Details All four tasks done; SBOM ingested 13 packages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 20:23:05 +01:00
tegwick	f7b8cdb4c1	feat(deps): add pyproject.toml, uv.lock, and ansible/requirements.yml Declares ansible>=10 as the only pip dependency for the control node. Generates uv.lock pinning ansible 12.3.0 / ansible-core 2.19.7 and the full transitive tree (13 packages). Adds explicit empty ansible/requirements.yml confirming no Galaxy collections are used. Closes RAIL-BS-WP-0001 T01–T04. Enables SBOM ingestion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 20:22:26 +01:00
tegwick	9fe5348af3	fix(CLAUDE.md): use reliable workplan discovery in step 2 Glob with pattern 'workplans/.md' from repo root fails silently (tool limitation with subdirectory prefixes in patterns). Changed to Glob(pattern="/.md", path="workplans/") which does find files, with Bash ls as fallback. This fixes step 2 of the session protocol silently producing no workplan results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 20:13:29 +01:00
tegwick	1aa5e436ae	fix(CLAUDE.md): rewrite session protocol to surface custodian tasks on open Previous CLAUDE.md only had a First Session Protocol. When workstreams already existed, the session would call get_state_summary() and produce no useful output. New 3-step protocol: - Step 1: get_state_summary() + get_next_steps() via state-hub MCP tools - Step 2: scan workplans/*.md for active tasks - Step 3: output orientation brief: active workstreams, pending repo tasks (from workplans/ + [repo:railiance-bootstrap] state hub tasks), suggested next action, SBOM status (currently null — gap noted) Also adds Known Pending Tasks table for RAIL-BS-WP-0001 (dep management) and strengthens ADR-001 workplan convention and contribution tracking sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 20:05:22 +01:00
tegwick	44428655d2	feat(sbom): add workplan RAIL-BS-WP-0001 — fix Ansible dep management Some checks failed railiance-tests / smoke (push) Has been cancelled Details State Hub SBOM assessment identified a gap: no lockfile exists for the Ansible control-node pip dependencies, making the repo unrepresentable in the SBOM inventory. 4-task workplan to reach SBOM Level 3 (Ingested): - T01: audit control-node pip deps - T02: create pyproject.toml + uv.lock for ansible (+ transitive tree) - T03: ingest into State Hub - T04: create ansible/requirements.yml (even if empty, to be explicit) State Hub task: 5f8cade5-119c-42e8-ba93-e9d0478650e4 Workstream: phase-0-operational-baseline (59155efb) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 19:29:20 +01:00
tegwick	76ae1351ce	fix: correct pg_stat_user_tables column name in restore drill (relname not tablename) Some checks failed railiance-tests / smoke (push) Has been cancelled Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-26 07:25:59 +01:00
tegwick	ab5b12334d	docs: backup and restore runbook Some checks failed railiance-tests / smoke (push) Has been cancelled Details Covers encryption (age key management), what is protected, backup command, daily cron, preflight checks, full step-by-step restore procedure, restore drill instructions, and two extension points (EP-RAIL-003 git mirrors, EP-RAIL-004 offsite secondary copy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-26 00:08:14 +01:00
tegwick	4381a079a2	feat: backup + preflight commands, decisions log, gitignore update - tools/cmd/railiance-backup: pg_dump + config snapshot, age-encrypted, uploaded to Nextcloud file drop via curl PUT. Daily cron target. - tools/cmd/railiance-preflight: pre-migration safety gate — checks backup freshness, all repos clean/pushed, age key present. - bin/railiance: added backup and preflight subcommands. - DECISIONS.md: decision log (D1 ingress Nginx+Traefik, D2 Nextcloud backup). - .gitignore: exclude backup-dropoff-link files (contain upload tokens). - CLAUDE.md: state hub session protocol update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-25 23:59:28 +01:00
tegwick	eb8a6902b6	docs: add ThreePhoenix architecture concept and workplan RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-25 01:13:05 +01:00
Bernd Worsch	b7696e657f	chore: another improvement Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 01:23:02 +00:00
Bernd Worsch	6662e9a377	chore: improved quickstart Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 01:18:14 +00:00
Bernd Worsch	53482f8e65	chore: improve quickstart instructions Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 00:57:17 +00:00
Bernd Worsch	b1862d67f0	feat: added plan-host command Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 02:46:48 +02:00
Bernd Worsch	7530468d80	refactor: separated command script Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 02:39:47 +02:00
Bernd Worsch	0bfdf465c1	chore: improved quickstart for newbies Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 02:00:25 +02:00
Bernd Worsch	96eccc6b67	feat: rails style bootkit bin/railiance with quickstart Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 01:32:19 +02:00
Bernd Worsch	676ec32379	chore: remove outdated readme Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 00:15:44 +02:00
Bernd Worsch	75d11583e4	feat(tools): directed panspermia inspired colonization scripts Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-13 00:11:18 +02:00
Bernd Worsch	be750b005b	chore: add MIT License Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-12 02:44:03 +02:00
Bernd Worsch	038adef94a	docs: update root README with quick start and structure Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-12 02:36:45 +02:00
Bernd Worsch	15935b78aa	docs: add Railiance overview README Some checks failed railiance-tests / smoke (push) Has been cancelled Details	2025-09-12 02:31:24 +02:00

1 2

56 Commits