diff --git a/Makefile b/Makefile index 8fbfa3b..0dad428 100644 --- a/Makefile +++ b/Makefile @@ -4,8 +4,11 @@ INVENTORY ?= ansible/hosts.ini ##@ Safety Net -backup: ## Backup postgres + config to Nextcloud (age-encrypted) - bin/railiance backup +backup: ## Backup k3s etcd + Helm values + kubeconfig (age-encrypted, root required) + tools/cmd/railiance-backup-s2 + +restore: ## List available backups and print restore guide + tools/cmd/railiance-restore-s2 preflight: ## Pre-migration safety gate — must pass before cluster work bin/railiance preflight @@ -28,4 +31,4 @@ help: ## Show this help /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \ /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST) -.PHONY: k3s-install smoke test-ha-failover help +.PHONY: backup restore preflight k3s-install smoke test-ha-failover help diff --git a/workplans/RAIL-BS-WP-0004-safety-net.md b/workplans/RAIL-BS-WP-0004-safety-net.md index c84e4aa..228842e 100644 --- a/workplans/RAIL-BS-WP-0004-safety-net.md +++ b/workplans/RAIL-BS-WP-0004-safety-net.md @@ -99,6 +99,9 @@ Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`): - Exit 0 on success, non-zero on any failure - No network required +Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based +custodian DB — wrong scope, not applicable to this server). + **Done when:** `make backup` runs on COULOMBCORE without error and files appear in `/opt/backup/railiance/cluster/`. @@ -123,6 +126,14 @@ sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ) Add to the backup script: take a fresh snapshot, encrypt with age, copy to `/opt/backup/railiance/cluster/`. +> **Note — verify etcd is in use before implementing:** +> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`. +> Without it, k3s uses SQLite and this command will fail. +> Verify first: `sudo k3s etcd-snapshot ls 2>&1` + +> **Note — sudo required:** etcd snapshot requires root. See T06 for how +> this is resolved (backup runs under root's crontab). + **Done when:** backup includes a current etcd snapshot. --- @@ -139,14 +150,20 @@ state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3" Capture current runtime Helm values for all releases: ```bash -helm list -A -o json | jq -r '.[].name + " " + .namespace' | \ +KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \ + jq -r '.[].name + " " + .namespace' | \ while read name ns; do - helm get values "$name" -n "$ns" -o yaml + KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml done ``` Tar and age-encrypt into `helm-values-.tar.gz.age`. +> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable +> only by default. The backup script must either run as root (see T06) or k3s +> must be configured with `--write-kubeconfig-mode=644`. Running as root +> (via root crontab) is the chosen approach — no config change needed. + **Done when:** backup includes a snapshot of all Helm release values. --- @@ -198,18 +215,52 @@ priority: medium state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07" ``` -Install the daily cron and verify decrypt works: +#### Solving the sudo problem + +The backup script needs root for two reasons: +- `k3s etcd-snapshot save` requires root +- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only + +**Solution: run the cron under root's crontab.** + +This is the correct pattern for system-level backup jobs. It avoids a +proliferating sudoers whitelist (one entry per command, brittle to maintain) +and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in +production. The backup writes to `/opt/backup/` which is root-owned anyway. + +Install the cron as root: ```bash -# Install cron on COULOMBCORE -(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab - - -# Drill: decrypt etcd snapshot and verify it's readable -age -d -i ~/.config/sops/age/keys.txt \ - /opt/backup/railiance/cluster/etcd-.snap.age | file - +sudo crontab -e +# Add: +0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1 ``` -**Done when:** cron installed, drill completes without error, log entry written. +Note: use the absolute path to the repo — `~` does not expand reliably in +root's crontab unless HOME is set. + +Verify it is installed: +```bash +sudo crontab -l | grep railiance +``` + +#### Restore drill + +Once T01–T04 are done, run a decrypt-and-verify drill: + +```bash +# Decrypt the etcd snapshot and verify it is a valid snapshot file +sudo age -d -i ~/.config/sops/age/keys.txt \ + /opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \ + | file - + +# Record the drill +echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \ + >> /opt/backup/railiance/cluster/restore-drill.log +``` + +**Done when:** cron installed under root, drill completes without error, +log entry written. ---