docs(wp-0004): add implementation notes for sudo, etcd, helm, cron
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
T02: note to verify etcd is in use before implementing; flags root requirement
T03: add KUBECONFIG to helm commands; note root access approach
T06: document solution to sudo problem — run cron under root's crontab,
not a sudoers whitelist. Add restore drill commands. Fix cron to use
absolute path (~ unreliable in root crontab).
T01: note to remove old railiance-backup script (wrong scope)
Makefile: fix stale backup description, add restore target, fix .PHONY
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -99,6 +99,9 @@ Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
|
||||
- Exit 0 on success, non-zero on any failure
|
||||
- No network required
|
||||
|
||||
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
|
||||
custodian DB — wrong scope, not applicable to this server).
|
||||
|
||||
**Done when:** `make backup` runs on COULOMBCORE without error and files
|
||||
appear in `/opt/backup/railiance/cluster/`.
|
||||
|
||||
@@ -123,6 +126,14 @@ sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
|
||||
Add to the backup script: take a fresh snapshot, encrypt with age,
|
||||
copy to `/opt/backup/railiance/cluster/`.
|
||||
|
||||
> **Note — verify etcd is in use before implementing:**
|
||||
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
|
||||
> Without it, k3s uses SQLite and this command will fail.
|
||||
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
|
||||
|
||||
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
|
||||
> this is resolved (backup runs under root's crontab).
|
||||
|
||||
**Done when:** backup includes a current etcd snapshot.
|
||||
|
||||
---
|
||||
@@ -139,14 +150,20 @@ state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
|
||||
Capture current runtime Helm values for all releases:
|
||||
|
||||
```bash
|
||||
helm list -A -o json | jq -r '.[].name + " " + .namespace' | \
|
||||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
|
||||
jq -r '.[].name + " " + .namespace' | \
|
||||
while read name ns; do
|
||||
helm get values "$name" -n "$ns" -o yaml
|
||||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
|
||||
done
|
||||
```
|
||||
|
||||
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
|
||||
|
||||
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
|
||||
> only by default. The backup script must either run as root (see T06) or k3s
|
||||
> must be configured with `--write-kubeconfig-mode=644`. Running as root
|
||||
> (via root crontab) is the chosen approach — no config change needed.
|
||||
|
||||
**Done when:** backup includes a snapshot of all Helm release values.
|
||||
|
||||
---
|
||||
@@ -198,18 +215,52 @@ priority: medium
|
||||
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
|
||||
```
|
||||
|
||||
Install the daily cron and verify decrypt works:
|
||||
#### Solving the sudo problem
|
||||
|
||||
The backup script needs root for two reasons:
|
||||
- `k3s etcd-snapshot save` requires root
|
||||
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
|
||||
|
||||
**Solution: run the cron under root's crontab.**
|
||||
|
||||
This is the correct pattern for system-level backup jobs. It avoids a
|
||||
proliferating sudoers whitelist (one entry per command, brittle to maintain)
|
||||
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
|
||||
production. The backup writes to `/opt/backup/` which is root-owned anyway.
|
||||
|
||||
Install the cron as root:
|
||||
|
||||
```bash
|
||||
# Install cron on COULOMBCORE
|
||||
(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab -
|
||||
|
||||
# Drill: decrypt etcd snapshot and verify it's readable
|
||||
age -d -i ~/.config/sops/age/keys.txt \
|
||||
/opt/backup/railiance/cluster/etcd-<latest>.snap.age | file -
|
||||
sudo crontab -e
|
||||
# Add:
|
||||
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
|
||||
```
|
||||
|
||||
**Done when:** cron installed, drill completes without error, log entry written.
|
||||
Note: use the absolute path to the repo — `~` does not expand reliably in
|
||||
root's crontab unless HOME is set.
|
||||
|
||||
Verify it is installed:
|
||||
```bash
|
||||
sudo crontab -l | grep railiance
|
||||
```
|
||||
|
||||
#### Restore drill
|
||||
|
||||
Once T01–T04 are done, run a decrypt-and-verify drill:
|
||||
|
||||
```bash
|
||||
# Decrypt the etcd snapshot and verify it is a valid snapshot file
|
||||
sudo age -d -i ~/.config/sops/age/keys.txt \
|
||||
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
|
||||
| file -
|
||||
|
||||
# Record the drill
|
||||
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
||||
>> /opt/backup/railiance/cluster/restore-drill.log
|
||||
```
|
||||
|
||||
**Done when:** cron installed under root, drill completes without error,
|
||||
log entry written.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user