--- id: RAIL-BS-WP-0004 type: workplan title: "Integrated Backup — S2 Kubernetes Runtime Layer" domain: railiance repo: railiance-cluster status: done owner: tegwick topic_slug: railiance state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625" created: "2026-02-25" updated: "2026-03-26" --- # Integrated Backup — S2 Kubernetes Runtime Layer ## Goal Implement the Q3 (Operability & Resilience) integrated backup for railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state — encrypted with age, written to a local directory on the server. No external dependencies required. ## Architecture (Decision D4) Each railiance repo implements its own backup for what it owns. No central backup service. See `DECISIONS.md` D4 for full rationale. **Standard interface every railiance repo must provide:** ```bash make backup # encrypt + write to /opt/backup/railiance// make restore # restore from most recent local backup ``` Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key). Output: `/opt/backup/railiance/cluster/` on the server. ## What S2 (railiance-cluster) owns and must back up | Asset | Why it matters | |---|---| | k3s etcd snapshots | Full cluster state — all workloads, configs, secrets | | Helm release values | Runtime values not in git (any manually applied overrides) | | kubeconfig | Admin access to the cluster | **Not S2's responsibility:** - Custodian State Hub DB → the-custodian owns this - Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern - Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this - PostgreSQL data volumes → S3 (railiance-platform) owns this ## Encryption Reuse the age public key from `.sops.yaml`: ```bash AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}') tar -czf - | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age ``` Decryption requires the private key at `~/.config/sops/age/keys.txt` (same key used for `sops -d`). No additional key management needed. ## Extension Point EP-RAIL-005 Once all five OAS layers implement this interface, the custodian can orchestrate a full-stack backup with: ```bash for repo in railiance-infra railiance-cluster railiance-platform \ railiance-enablement railiance-apps; do make -C ~/$repo backup done ``` No special protocol needed — just the standard interface. --- ## Tasks ### T01 — Define backup directory and encryption wrapper ```task id: T01 status: done priority: high state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b" ``` Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`): - Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`) - Encrypt each artifact with age using public key from `.sops.yaml` - Write timestamp-named files: `etcd-.snap.age`, `helm-values-.tar.gz.age`, `kubeconfig-.yaml.age` - Keep last 7 of each type - Write `.last-backup` stamp - Exit 0 on success, non-zero on any failure - No network required Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based custodian DB — wrong scope, not applicable to this server). **Done when:** `make backup` runs on COULOMBCORE without error and files appear in `/opt/backup/railiance/cluster/`. --- ### T02 — Back up k3s state (SQLite hot backup) ```task id: T02 status: done priority: high state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880" ``` k3s has built-in etcd snapshot support: ```bash sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ) # Default location: /var/lib/rancher/k3s/server/db/snapshots/ ``` Add to the backup script: take a fresh snapshot, encrypt with age, copy to `/opt/backup/railiance/cluster/`. > **Note — verify etcd is in use before implementing:** > `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`. > Without it, k3s uses SQLite and this command will fail. > Verify first: `sudo k3s etcd-snapshot ls 2>&1` > **Note — sudo required:** etcd snapshot requires root. See T06 for how > this is resolved (backup runs under root's crontab). **Done when:** backup includes a current etcd snapshot. --- ### T03 — Back up Helm release values ```task id: T03 status: done priority: medium state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3" ``` Capture current runtime Helm values for all releases: ```bash KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \ jq -r '.[].name + " " + .namespace' | \ while read name ns; do KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml done ``` Tar and age-encrypt into `helm-values-.tar.gz.age`. > **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable > only by default. The backup script must either run as root (see T06) or k3s > must be configured with `--write-kubeconfig-mode=644`. Running as root > (via root crontab) is the chosen approach — no config change needed. **Done when:** backup includes a snapshot of all Helm release values. --- ### T04 — Back up kubeconfig ```task id: T04 status: done priority: medium state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665" ``` Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`) into `kubeconfig-.yaml.age` in the backup directory. **Done when:** backup includes the encrypted kubeconfig. --- ### T05 — make restore target ```task id: T05 status: done priority: medium state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8" ``` Add `tools/cmd/railiance-restore-s2` that decrypts and lists available backups, with guided restore for the etcd snapshot case. Restore of etcd from snapshot: ```bash sudo k3s server --cluster-reset \ --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/ ``` **Done when:** `make restore` prints available backups and a restore guide. --- ### T06 — Install cron job and run restore drill ```task id: T06 status: done priority: medium state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07" ``` #### Solving the sudo problem The backup script needs root for two reasons: - `k3s etcd-snapshot save` requires root - `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only **Solution: run the cron under root's crontab.** This is the correct pattern for system-level backup jobs. It avoids a proliferating sudoers whitelist (one entry per command, brittle to maintain) and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in production. The backup writes to `/opt/backup/` which is root-owned anyway. Install the cron as root: ```bash sudo crontab -e # Add: 0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1 ``` Note: use the absolute path to the repo — `~` does not expand reliably in root's crontab unless HOME is set. Verify it is installed: ```bash sudo crontab -l | grep railiance ``` #### Restore drill Once T01–T04 are done, run a decrypt-and-verify drill: ```bash # Decrypt the etcd snapshot and verify it is a valid snapshot file sudo age -d -i ~/.config/sops/age/keys.txt \ /opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \ | file - # Record the drill echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \ >> /opt/backup/railiance/cluster/restore-drill.log ``` **Done when:** cron installed under root, drill completes without error, log entry written. --- ## References - Decision D4: Integrated backup per capability (`DECISIONS.md`) - Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement) - OAS Q3: Operability & Resilience - Extension point EP-RAIL-005: Custodian full-stack backup orchestration - k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore