Files
railiance-cluster/workplans/RAIL-BS-WP-0004-safety-net.md
tegwick 9fc5a033d5
Some checks failed
railiance-tests / smoke (push) Has been cancelled
feat(s2): add Gitea SSH NodePort service + close WP-0004 (backup tool, scope updates)
- helm/gitea-ssh-nodeport.yaml: expose Gitea SSH on NodePort 30022 (targetPort 2222)
  for on-node git automation (RAIL-HO-WP-0004-T07)
- tools/cmd/railiance-backup-s2: fix SQLite hot backup (was broken etcd-snapshot)
- tools/cmd/railiance-restore-s2: update restore instructions for SQLite mode
- workplans/RAIL-BS-WP-0004-safety-net.md: mark done
- SCOPE.md: update current state, document boundary violations, fix connectivity docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 01:07:02 +01:00

274 lines
7.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: RAIL-BS-WP-0004
type: workplan
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
domain: railiance
repo: railiance-cluster
status: done
owner: tegwick
topic_slug: railiance
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
created: "2026-02-25"
updated: "2026-03-26"
---
# Integrated Backup — S2 Kubernetes Runtime Layer
## Goal
Implement the Q3 (Operability & Resilience) integrated backup for
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
encrypted with age, written to a local directory on the server. No external
dependencies required.
## Architecture (Decision D4)
Each railiance repo implements its own backup for what it owns. No central
backup service. See `DECISIONS.md` D4 for full rationale.
**Standard interface every railiance repo must provide:**
```bash
make backup # encrypt + write to /opt/backup/railiance/<layer>/
make restore # restore from most recent local backup
```
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
Output: `/opt/backup/railiance/cluster/` on the server.
## What S2 (railiance-cluster) owns and must back up
| Asset | Why it matters |
|---|---|
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
| Helm release values | Runtime values not in git (any manually applied overrides) |
| kubeconfig | Admin access to the cluster |
**Not S2's responsibility:**
- Custodian State Hub DB → the-custodian owns this
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
- PostgreSQL data volumes → S3 (railiance-platform) owns this
## Encryption
Reuse the age public key from `.sops.yaml`:
```bash
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
```
Decryption requires the private key at `~/.config/sops/age/keys.txt`
(same key used for `sops -d`). No additional key management needed.
## Extension Point EP-RAIL-005
Once all five OAS layers implement this interface, the custodian can
orchestrate a full-stack backup with:
```bash
for repo in railiance-infra railiance-cluster railiance-platform \
railiance-enablement railiance-apps; do
make -C ~/$repo backup
done
```
No special protocol needed — just the standard interface.
---
## Tasks
### T01 — Define backup directory and encryption wrapper
```task
id: T01
status: done
priority: high
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
```
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
- Encrypt each artifact with age using public key from `.sops.yaml`
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
- Keep last 7 of each type
- Write `.last-backup` stamp
- Exit 0 on success, non-zero on any failure
- No network required
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
custodian DB — wrong scope, not applicable to this server).
**Done when:** `make backup` runs on COULOMBCORE without error and files
appear in `/opt/backup/railiance/cluster/`.
---
### T02 — Back up k3s state (SQLite hot backup)
```task
id: T02
status: done
priority: high
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
```
k3s has built-in etcd snapshot support:
```bash
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
```
Add to the backup script: take a fresh snapshot, encrypt with age,
copy to `/opt/backup/railiance/cluster/`.
> **Note — verify etcd is in use before implementing:**
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
> Without it, k3s uses SQLite and this command will fail.
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
> this is resolved (backup runs under root's crontab).
**Done when:** backup includes a current etcd snapshot.
---
### T03 — Back up Helm release values
```task
id: T03
status: done
priority: medium
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
```
Capture current runtime Helm values for all releases:
```bash
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
jq -r '.[].name + " " + .namespace' | \
while read name ns; do
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
done
```
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
> only by default. The backup script must either run as root (see T06) or k3s
> must be configured with `--write-kubeconfig-mode=644`. Running as root
> (via root crontab) is the chosen approach — no config change needed.
**Done when:** backup includes a snapshot of all Helm release values.
---
### T04 — Back up kubeconfig
```task
id: T04
status: done
priority: medium
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
```
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
into `kubeconfig-<ts>.yaml.age` in the backup directory.
**Done when:** backup includes the encrypted kubeconfig.
---
### T05 — make restore target
```task
id: T05
status: done
priority: medium
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
```
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
backups, with guided restore for the etcd snapshot case.
Restore of etcd from snapshot:
```bash
sudo k3s server --cluster-reset \
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
```
**Done when:** `make restore` prints available backups and a restore guide.
---
### T06 — Install cron job and run restore drill
```task
id: T06
status: done
priority: medium
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
```
#### Solving the sudo problem
The backup script needs root for two reasons:
- `k3s etcd-snapshot save` requires root
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
**Solution: run the cron under root's crontab.**
This is the correct pattern for system-level backup jobs. It avoids a
proliferating sudoers whitelist (one entry per command, brittle to maintain)
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
production. The backup writes to `/opt/backup/` which is root-owned anyway.
Install the cron as root:
```bash
sudo crontab -e
# Add:
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
```
Note: use the absolute path to the repo — `~` does not expand reliably in
root's crontab unless HOME is set.
Verify it is installed:
```bash
sudo crontab -l | grep railiance
```
#### Restore drill
Once T01T04 are done, run a decrypt-and-verify drill:
```bash
# Decrypt the etcd snapshot and verify it is a valid snapshot file
sudo age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
| file -
# Record the drill
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> /opt/backup/railiance/cluster/restore-drill.log
```
**Done when:** cron installed under root, drill completes without error,
log entry written.
---
## References
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
- OAS Q3: Operability & Resilience
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore