Some checks failed
railiance-tests / smoke (push) Has been cancelled
- helm/gitea-ssh-nodeport.yaml: expose Gitea SSH on NodePort 30022 (targetPort 2222) for on-node git automation (RAIL-HO-WP-0004-T07) - tools/cmd/railiance-backup-s2: fix SQLite hot backup (was broken etcd-snapshot) - tools/cmd/railiance-restore-s2: update restore instructions for SQLite mode - workplans/RAIL-BS-WP-0004-safety-net.md: mark done - SCOPE.md: update current state, document boundary violations, fix connectivity docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
274 lines
7.7 KiB
Markdown
274 lines
7.7 KiB
Markdown
---
|
||
id: RAIL-BS-WP-0004
|
||
type: workplan
|
||
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
|
||
domain: railiance
|
||
repo: railiance-cluster
|
||
status: done
|
||
owner: tegwick
|
||
topic_slug: railiance
|
||
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
|
||
created: "2026-02-25"
|
||
updated: "2026-03-26"
|
||
---
|
||
|
||
# Integrated Backup — S2 Kubernetes Runtime Layer
|
||
|
||
## Goal
|
||
|
||
Implement the Q3 (Operability & Resilience) integrated backup for
|
||
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
|
||
encrypted with age, written to a local directory on the server. No external
|
||
dependencies required.
|
||
|
||
## Architecture (Decision D4)
|
||
|
||
Each railiance repo implements its own backup for what it owns. No central
|
||
backup service. See `DECISIONS.md` D4 for full rationale.
|
||
|
||
**Standard interface every railiance repo must provide:**
|
||
|
||
```bash
|
||
make backup # encrypt + write to /opt/backup/railiance/<layer>/
|
||
make restore # restore from most recent local backup
|
||
```
|
||
|
||
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
|
||
Output: `/opt/backup/railiance/cluster/` on the server.
|
||
|
||
## What S2 (railiance-cluster) owns and must back up
|
||
|
||
| Asset | Why it matters |
|
||
|---|---|
|
||
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
|
||
| Helm release values | Runtime values not in git (any manually applied overrides) |
|
||
| kubeconfig | Admin access to the cluster |
|
||
|
||
**Not S2's responsibility:**
|
||
- Custodian State Hub DB → the-custodian owns this
|
||
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
|
||
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
|
||
- PostgreSQL data volumes → S3 (railiance-platform) owns this
|
||
|
||
## Encryption
|
||
|
||
Reuse the age public key from `.sops.yaml`:
|
||
|
||
```bash
|
||
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
|
||
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
|
||
```
|
||
|
||
Decryption requires the private key at `~/.config/sops/age/keys.txt`
|
||
(same key used for `sops -d`). No additional key management needed.
|
||
|
||
## Extension Point EP-RAIL-005
|
||
|
||
Once all five OAS layers implement this interface, the custodian can
|
||
orchestrate a full-stack backup with:
|
||
|
||
```bash
|
||
for repo in railiance-infra railiance-cluster railiance-platform \
|
||
railiance-enablement railiance-apps; do
|
||
make -C ~/$repo backup
|
||
done
|
||
```
|
||
|
||
No special protocol needed — just the standard interface.
|
||
|
||
---
|
||
|
||
## Tasks
|
||
|
||
### T01 — Define backup directory and encryption wrapper
|
||
|
||
```task
|
||
id: T01
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
|
||
```
|
||
|
||
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
|
||
|
||
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
|
||
- Encrypt each artifact with age using public key from `.sops.yaml`
|
||
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
|
||
- Keep last 7 of each type
|
||
- Write `.last-backup` stamp
|
||
- Exit 0 on success, non-zero on any failure
|
||
- No network required
|
||
|
||
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
|
||
custodian DB — wrong scope, not applicable to this server).
|
||
|
||
**Done when:** `make backup` runs on COULOMBCORE without error and files
|
||
appear in `/opt/backup/railiance/cluster/`.
|
||
|
||
---
|
||
|
||
### T02 — Back up k3s state (SQLite hot backup)
|
||
|
||
```task
|
||
id: T02
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
|
||
```
|
||
|
||
k3s has built-in etcd snapshot support:
|
||
|
||
```bash
|
||
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
|
||
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
|
||
```
|
||
|
||
Add to the backup script: take a fresh snapshot, encrypt with age,
|
||
copy to `/opt/backup/railiance/cluster/`.
|
||
|
||
> **Note — verify etcd is in use before implementing:**
|
||
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
|
||
> Without it, k3s uses SQLite and this command will fail.
|
||
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
|
||
|
||
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
|
||
> this is resolved (backup runs under root's crontab).
|
||
|
||
**Done when:** backup includes a current etcd snapshot.
|
||
|
||
---
|
||
|
||
### T03 — Back up Helm release values
|
||
|
||
```task
|
||
id: T03
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
|
||
```
|
||
|
||
Capture current runtime Helm values for all releases:
|
||
|
||
```bash
|
||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
|
||
jq -r '.[].name + " " + .namespace' | \
|
||
while read name ns; do
|
||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
|
||
done
|
||
```
|
||
|
||
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
|
||
|
||
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
|
||
> only by default. The backup script must either run as root (see T06) or k3s
|
||
> must be configured with `--write-kubeconfig-mode=644`. Running as root
|
||
> (via root crontab) is the chosen approach — no config change needed.
|
||
|
||
**Done when:** backup includes a snapshot of all Helm release values.
|
||
|
||
---
|
||
|
||
### T04 — Back up kubeconfig
|
||
|
||
```task
|
||
id: T04
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
|
||
```
|
||
|
||
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
|
||
into `kubeconfig-<ts>.yaml.age` in the backup directory.
|
||
|
||
**Done when:** backup includes the encrypted kubeconfig.
|
||
|
||
---
|
||
|
||
### T05 — make restore target
|
||
|
||
```task
|
||
id: T05
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
|
||
```
|
||
|
||
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
|
||
backups, with guided restore for the etcd snapshot case.
|
||
|
||
Restore of etcd from snapshot:
|
||
```bash
|
||
sudo k3s server --cluster-reset \
|
||
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
|
||
```
|
||
|
||
**Done when:** `make restore` prints available backups and a restore guide.
|
||
|
||
---
|
||
|
||
### T06 — Install cron job and run restore drill
|
||
|
||
```task
|
||
id: T06
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
|
||
```
|
||
|
||
#### Solving the sudo problem
|
||
|
||
The backup script needs root for two reasons:
|
||
- `k3s etcd-snapshot save` requires root
|
||
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
|
||
|
||
**Solution: run the cron under root's crontab.**
|
||
|
||
This is the correct pattern for system-level backup jobs. It avoids a
|
||
proliferating sudoers whitelist (one entry per command, brittle to maintain)
|
||
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
|
||
production. The backup writes to `/opt/backup/` which is root-owned anyway.
|
||
|
||
Install the cron as root:
|
||
|
||
```bash
|
||
sudo crontab -e
|
||
# Add:
|
||
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
|
||
```
|
||
|
||
Note: use the absolute path to the repo — `~` does not expand reliably in
|
||
root's crontab unless HOME is set.
|
||
|
||
Verify it is installed:
|
||
```bash
|
||
sudo crontab -l | grep railiance
|
||
```
|
||
|
||
#### Restore drill
|
||
|
||
Once T01–T04 are done, run a decrypt-and-verify drill:
|
||
|
||
```bash
|
||
# Decrypt the etcd snapshot and verify it is a valid snapshot file
|
||
sudo age -d -i ~/.config/sops/age/keys.txt \
|
||
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
|
||
| file -
|
||
|
||
# Record the drill
|
||
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
||
>> /opt/backup/railiance/cluster/restore-drill.log
|
||
```
|
||
|
||
**Done when:** cron installed under root, drill completes without error,
|
||
log entry written.
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
|
||
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
|
||
- OAS Q3: Operability & Resilience
|
||
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
|
||
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore
|