docs(wp-0004): add implementation notes for sudo, etcd, helm, cron
Some checks failed
railiance-tests / smoke (push) Has been cancelled

T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
     not a sudoers whitelist. Add restore drill commands. Fix cron to use
     absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-10 16:52:40 +00:00
parent 5b0cfbf10a
commit 66f8ca4009
2 changed files with 67 additions and 13 deletions

View File

@@ -99,6 +99,9 @@ Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
- Exit 0 on success, non-zero on any failure
- No network required
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
custodian DB — wrong scope, not applicable to this server).
**Done when:** `make backup` runs on COULOMBCORE without error and files
appear in `/opt/backup/railiance/cluster/`.
@@ -123,6 +126,14 @@ sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
Add to the backup script: take a fresh snapshot, encrypt with age,
copy to `/opt/backup/railiance/cluster/`.
> **Note — verify etcd is in use before implementing:**
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
> Without it, k3s uses SQLite and this command will fail.
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
> this is resolved (backup runs under root's crontab).
**Done when:** backup includes a current etcd snapshot.
---
@@ -139,14 +150,20 @@ state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
Capture current runtime Helm values for all releases:
```bash
helm list -A -o json | jq -r '.[].name + " " + .namespace' | \
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
jq -r '.[].name + " " + .namespace' | \
while read name ns; do
helm get values "$name" -n "$ns" -o yaml
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
done
```
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
> only by default. The backup script must either run as root (see T06) or k3s
> must be configured with `--write-kubeconfig-mode=644`. Running as root
> (via root crontab) is the chosen approach — no config change needed.
**Done when:** backup includes a snapshot of all Helm release values.
---
@@ -198,18 +215,52 @@ priority: medium
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
```
Install the daily cron and verify decrypt works:
#### Solving the sudo problem
The backup script needs root for two reasons:
- `k3s etcd-snapshot save` requires root
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
**Solution: run the cron under root's crontab.**
This is the correct pattern for system-level backup jobs. It avoids a
proliferating sudoers whitelist (one entry per command, brittle to maintain)
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
production. The backup writes to `/opt/backup/` which is root-owned anyway.
Install the cron as root:
```bash
# Install cron on COULOMBCORE
(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab -
# Drill: decrypt etcd snapshot and verify it's readable
age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/railiance/cluster/etcd-<latest>.snap.age | file -
sudo crontab -e
# Add:
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
```
**Done when:** cron installed, drill completes without error, log entry written.
Note: use the absolute path to the repo — `~` does not expand reliably in
root's crontab unless HOME is set.
Verify it is installed:
```bash
sudo crontab -l | grep railiance
```
#### Restore drill
Once T01T04 are done, run a decrypt-and-verify drill:
```bash
# Decrypt the etcd snapshot and verify it is a valid snapshot file
sudo age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
| file -
# Record the drill
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> /opt/backup/railiance/cluster/restore-drill.log
```
**Done when:** cron installed under root, drill completes without error,
log entry written.
---