diff --git a/DECISIONS.md b/DECISIONS.md index 49f4a70..31ab8b7 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -64,3 +64,53 @@ has been tested before it matters. See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md` --- + +## D4 — Integrated backup per capability, not centralized backup service + +**Date:** 2026-03-10 +**Decided by:** Tegwick + +**Decision:** Each railiance repo implements its own backup for the +infrastructure it owns. There is no central backup service. + +**Rationale:** + +A centralized backup service (e.g., in railiance-enablement) couples every +stack layer to a shared component. As each layer matures and evolves at its +own pace, this coupling repeatedly breaks the backup. A service that breaks +when the thing it is supposed to protect is being changed is not a safety net. + +Integrated backup per repo means: +- The backup for S1 lives in railiance-infra and knows exactly what S1 owns +- The backup for S2 lives in railiance-cluster and knows what S2 owns +- Each repo can be backed up independently, without any other repo, service, + or network connection being available +- Each backup implementation matures with its layer + +**Standard interface (Q3 Operability & Resilience):** + +Every railiance repo that manages persistent state must provide: + +1. `make backup` — creates an encrypted backup of what this layer owns, + writes to a local directory on the server (`/opt/backup/railiance//`) +2. `make restore` — restores from the most recent local backup +3. Encryption: age, reusing the same key pair used for SOPS secrets +4. No runtime dependencies: must work without custodian, state-hub, network + file share, or any other external service being available + +**Extension point EP-RAIL-005:** The custodian can provide orchestration +guidelines. If each repo follows the standard interface, the custodian can +call `make backup` across the full stack in dependency order (S1 → S5) +and aggregate results. This is deliberately deferred — integrate first, +orchestrate later. + +**What changes from the previous approach (D2):** + +D2 established Nextcloud as the backup destination for a single monolithic +script in railiance-cluster. That script backed up the wrong things (custodian +DB and operator config — neither of which are S2 concerns). The Nextcloud +upload becomes an optional extension, not a requirement. + +See: `workplans/RAIL-BS-WP-0004-safety-net.md` + +--- diff --git a/workplans/RAIL-BS-WP-0004-safety-net.md b/workplans/RAIL-BS-WP-0004-safety-net.md index 13cad50..c84e4aa 100644 --- a/workplans/RAIL-BS-WP-0004-safety-net.md +++ b/workplans/RAIL-BS-WP-0004-safety-net.md @@ -1,7 +1,7 @@ --- id: RAIL-BS-WP-0004 type: workplan -title: "Current-Environment Safety Net" +title: "Integrated Backup — S2 Kubernetes Runtime Layer" domain: railiance repo: railiance-cluster status: active @@ -12,118 +12,162 @@ created: "2026-02-25" updated: "2026-03-10" --- -# Current-Environment Safety Net +# Integrated Backup — S2 Kubernetes Runtime Layer ## Goal -Ensure backup and disaster recovery for the current single-server environment -is operational and tested before any ThreePhoenix infrastructure migration -work begins. Aligned to OAS Stack S2 (railiance-cluster owns backup tooling). +Implement the Q3 (Operability & Resilience) integrated backup for +railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state — +encrypted with age, written to a local directory on the server. No external +dependencies required. -## Context +## Architecture (Decision D4) -The backup toolchain lives in `tools/cmd/railiance-backup` and -`tools/cmd/railiance-preflight`, dispatched via `bin/railiance`. It protects: +Each railiance repo implements its own backup for what it owns. No central +backup service. See `DECISIONS.md` D4 for full rationale. -| Asset | Method | Risk without backup | -|---|---|---| -| Custodian State Hub DB | pg_dump → age → Nextcloud | Total loss of workstreams, decisions, history | -| Claude config + memory | tar → age → Nextcloud | Loss of MCP registration, project memory | -| Git repos | Gitea remotes | SPOF: Gitea runs on the same server being migrated | +**Standard interface every railiance repo must provide:** -Decision D2: Nextcloud upload-only file drop as backup destination. +```bash +make backup # encrypt + write to /opt/backup/railiance// +make restore # restore from most recent local backup +``` -## OAS Alignment +Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key). +Output: `/opt/backup/railiance/cluster/` on the server. -Per ADR-003, backup tooling lives in **S2 (railiance-cluster)**. The preflight -check covers all five OAS stack repos: +## What S2 (railiance-cluster) owns and must back up -| Repo | OAS Layer | +| Asset | Why it matters | |---|---| -| railiance-infra | S1 — OS & Provisioning | -| railiance-cluster | S2 — Kubernetes Runtime | -| railiance-platform | S3 — Platform Services | -| railiance-enablement | S4 — Developer Tooling | -| railiance-apps | S5 — Workloads & Endpoints | +| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets | +| Helm release values | Runtime values not in git (any manually applied overrides) | +| kubeconfig | Admin access to the cluster | -Plus cross-domain repos: the-custodian, markitect_project, activity-core, -net-kingdom, issue-facade, binect-js, kaizen-agentic. +**Not S2's responsibility:** +- Custodian State Hub DB → the-custodian owns this +- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern +- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this +- PostgreSQL data volumes → S3 (railiance-platform) owns this -## Boundary +## Encryption -Backup execution: this repo (`bin/railiance backup`). -Backup destination: Nextcloud file drop (URL in `~/.config/railiance/nc-upload-url` or hardcoded). -Restore procedure: `docs/backup-restore.md`. +Reuse the age public key from `.sops.yaml`: + +```bash +AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}') +tar -czf - | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age +``` + +Decryption requires the private key at `~/.config/sops/age/keys.txt` +(same key used for `sops -d`). No additional key management needed. + +## Extension Point EP-RAIL-005 + +Once all five OAS layers implement this interface, the custodian can +orchestrate a full-stack backup with: + +```bash +for repo in railiance-infra railiance-cluster railiance-platform \ + railiance-enablement railiance-apps; do + make -C ~/$repo backup +done +``` + +No special protocol needed — just the standard interface. --- ## Tasks -### T01 — Update preflight repo list to OAS 5-repo layout +### T01 — Define backup directory and encryption wrapper ```task id: T01 -status: done +status: todo priority: high state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b" ``` -Update `tools/cmd/railiance-preflight` REPOS array: remove `railiance-bootstrap`, -add `railiance-infra`, `railiance-cluster`, `railiance-platform`, -`railiance-enablement`, `railiance-apps`. Add all active project repos. +Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`): -**Done when:** `bin/railiance preflight` checks all current repos. +- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`) +- Encrypt each artifact with age using public key from `.sops.yaml` +- Write timestamp-named files: `etcd-.snap.age`, `helm-values-.tar.gz.age`, `kubeconfig-.yaml.age` +- Keep last 7 of each type +- Write `.last-backup` stamp +- Exit 0 on success, non-zero on any failure +- No network required + +**Done when:** `make backup` runs on COULOMBCORE without error and files +appear in `/opt/backup/railiance/cluster/`. --- -### T02 — Fix stale repo references in backup-restore.md +### T02 — Back up k3s etcd snapshots ```task id: T02 -status: done -priority: medium +status: todo +priority: high state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880" ``` -Update restore procedure: `railiance-bootstrap` → `railiance-cluster`, -`railiance-hosts` → `railiance-infra`, add the three new OAS repos. +k3s has built-in etcd snapshot support: -**Done when:** doc accurately reflects the current 5-repo OAS stack. +```bash +sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ) +# Default location: /var/lib/rancher/k3s/server/db/snapshots/ +``` + +Add to the backup script: take a fresh snapshot, encrypt with age, +copy to `/opt/backup/railiance/cluster/`. + +**Done when:** backup includes a current etcd snapshot. --- -### T03 — Add make backup and make preflight targets +### T03 — Back up Helm release values ```task id: T03 -status: done +status: todo priority: medium state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3" ``` -Add to root Makefile so the safety net is discoverable from `make help`. +Capture current runtime Helm values for all releases: -**Done when:** `make backup` and `make preflight` both work. +```bash +helm list -A -o json | jq -r '.[].name + " " + .namespace' | \ + while read name ns; do + helm get values "$name" -n "$ns" -o yaml + done +``` + +Tar and age-encrypt into `helm-values-.tar.gz.age`. + +**Done when:** backup includes a snapshot of all Helm release values. --- -### T04 — Run current backup and verify upload +### T04 — Back up kubeconfig ```task id: T04 -status: done -priority: high +status: todo +priority: medium state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665" ``` -Run `bin/railiance backup` and confirm both DB and config files appear -in the Nextcloud file drop. +Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`) +into `kubeconfig-.yaml.age` in the backup directory. -**Done when:** backup completes without error and `.last-backup` stamp is fresh. +**Done when:** backup includes the encrypted kubeconfig. --- -### T05 — Server backup: Gitea data and Zulip chat +### T05 — make restore target ```task id: T05 @@ -132,33 +176,20 @@ priority: medium state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8" ``` -**Scope correction (2026-03-10):** The original task assumed the `railiance-backup` -script in `tools/cmd/railiance-backup` applied here. It does not — that script -is for a developer workstation (custodian DB in Docker + Claude config) and is -unrelated to the server. +Add `tools/cmd/railiance-restore-s2` that decrypts and lists available +backups, with guided restore for the etcd snapshot case. -The server's safety net must protect: +Restore of etcd from snapshot: +```bash +sudo k3s server --cluster-reset \ + --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/ +``` -| Asset | Method | -|---|---| -| Gitea repositories + DB | `k3s kubectl exec` into gitea pod → `gitea dump` | -| Zulip chat data | Zulip's built-in export or volume snapshot | - -This work belongs in **railiance-infra** (S1 — OS & Provisioning layer) as an -Ansible role or playbook, not here. A cron job on the server should call that -script once it exists. - -**Do not** wire up a cron job that calls the existing `bin/railiance backup` — -that script targets Docker containers that do not exist on this server. - -**Done when:** -1. A backup playbook/role exists in `railiance-infra` covering Gitea + Zulip -2. It is deployed via Ansible and a cron job on the server calls it daily -3. At least one successful backup run is verified in the log +**Done when:** `make restore` prints available backups and a restore guide. --- -### T06 — Run restore drill +### T06 — Install cron job and run restore drill ```task id: T06 @@ -167,16 +198,25 @@ priority: medium state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07" ``` -Run the minimal restore drill from `docs/backup-restore.md` against the -current backup. Record completion in `~/.cache/railiance/restore-drill.log`. +Install the daily cron and verify decrypt works: -**Done when:** drill exits 0 and log entry is written. +```bash +# Install cron on COULOMBCORE +(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab - + +# Drill: decrypt etcd snapshot and verify it's readable +age -d -i ~/.config/sops/age/keys.txt \ + /opt/backup/railiance/cluster/etcd-.snap.age | file - +``` + +**Done when:** cron installed, drill completes without error, log entry written. --- ## References -- Decision D2: Nextcloud as backup destination (`DECISIONS.md`) -- Backup tooling: `tools/cmd/railiance-backup`, `tools/cmd/railiance-preflight` -- Restore procedure: `docs/backup-restore.md` -- Extension points: EP-RAIL-003 (git bare mirrors), EP-RAIL-004 (secondary offsite copy) +- Decision D4: Integrated backup per capability (`DECISIONS.md`) +- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement) +- OAS Q3: Operability & Resilience +- Extension point EP-RAIL-005: Custodian full-stack backup orchestration +- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore