Files

railiance-tests / smoke (push) Has been cancelled

Details

feat(backup): revise WP-0004 — integrated backup per capability (D4)

WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-10 17:43:30 +01:00

5.9 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, state_hub_workstream_id, created, updated

id	type	title	domain	repo	status	owner	topic_slug	state_hub_workstream_id	created	updated
RAIL-BS-WP-0004	workplan	Integrated Backup — S2 Kubernetes Runtime Layer	railiance	railiance-cluster	active	tegwick	railiance	7e8b0c20-51eb-40c9-9e3b-85dd380d7625	2026-02-25	2026-03-10

Integrated Backup — S2 Kubernetes Runtime Layer

Goal

Implement the Q3 (Operability & Resilience) integrated backup for railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state — encrypted with age, written to a local directory on the server. No external dependencies required.

Architecture (Decision D4)

Each railiance repo implements its own backup for what it owns. No central backup service. See DECISIONS.md D4 for full rationale.

Standard interface every railiance repo must provide:

make backup   # encrypt + write to /opt/backup/railiance/<layer>/
make restore  # restore from most recent local backup

Encryption: age, same key pair as SOPS secrets (.sops.yaml public key). Output: /opt/backup/railiance/cluster/ on the server.

What S2 (railiance-cluster) owns and must back up

Asset	Why it matters
k3s etcd snapshots	Full cluster state — all workloads, configs, secrets
Helm release values	Runtime values not in git (any manually applied overrides)
kubeconfig	Admin access to the cluster

Not S2's responsibility:

Custodian State Hub DB → the-custodian owns this
Operator workstation config (.claude/, .gitconfig) → operator's own concern
Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
PostgreSQL data volumes → S3 (railiance-platform) owns this

Encryption

Reuse the age public key from .sops.yaml:

AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age

Decryption requires the private key at ~/.config/sops/age/keys.txt (same key used for sops -d). No additional key management needed.

Extension Point EP-RAIL-005

Once all five OAS layers implement this interface, the custodian can orchestrate a full-stack backup with:

for repo in railiance-infra railiance-cluster railiance-platform \
            railiance-enablement railiance-apps; do
  make -C ~/$repo backup
done

No special protocol needed — just the standard interface.

Tasks

T01 — Define backup directory and encryption wrapper

id: T01
status: todo
priority: high
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"

Create tools/cmd/railiance-backup-s2 (replacing the old railiance-backup):

Backup dir: /opt/backup/railiance/cluster/ (create with mkdir -p)
Encrypt each artifact with age using public key from .sops.yaml
Write timestamp-named files: etcd-<ts>.snap.age, helm-values-<ts>.tar.gz.age, kubeconfig-<ts>.yaml.age
Keep last 7 of each type
Write .last-backup stamp
Exit 0 on success, non-zero on any failure
No network required

Done when: make backup runs on COULOMBCORE without error and files appear in /opt/backup/railiance/cluster/.

T02 — Back up k3s etcd snapshots

id: T02
status: todo
priority: high
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"

k3s has built-in etcd snapshot support:

sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
# Default location: /var/lib/rancher/k3s/server/db/snapshots/

Add to the backup script: take a fresh snapshot, encrypt with age, copy to /opt/backup/railiance/cluster/.

Done when: backup includes a current etcd snapshot.

T03 — Back up Helm release values

id: T03
status: todo
priority: medium
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"

Capture current runtime Helm values for all releases:

helm list -A -o json | jq -r '.[].name + " " + .namespace' | \
  while read name ns; do
    helm get values "$name" -n "$ns" -o yaml
  done

Tar and age-encrypt into helm-values-<ts>.tar.gz.age.

Done when: backup includes a snapshot of all Helm release values.

T04 — Back up kubeconfig

id: T04
status: todo
priority: medium
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"

Age-encrypt ~/.kube/config-hosteurope (or /etc/rancher/k3s/k3s.yaml) into kubeconfig-<ts>.yaml.age in the backup directory.

Done when: backup includes the encrypted kubeconfig.

T05 — make restore target

id: T05
status: todo
priority: medium
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"

Add tools/cmd/railiance-restore-s2 that decrypts and lists available backups, with guided restore for the etcd snapshot case.

Restore of etcd from snapshot:

sudo k3s server --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>

Done when: make restore prints available backups and a restore guide.

T06 — Install cron job and run restore drill

id: T06
status: todo
priority: medium
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"

Install the daily cron and verify decrypt works:

# Install cron on COULOMBCORE
(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab -

# Drill: decrypt etcd snapshot and verify it's readable
age -d -i ~/.config/sops/age/keys.txt \
  /opt/backup/railiance/cluster/etcd-<latest>.snap.age | file -

Done when: cron installed, drill completes without error, log entry written.

References

Decision D4: Integrated backup per capability (DECISIONS.md)
Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
OAS Q3: Operability & Resilience
Extension point EP-RAIL-005: Custodian full-stack backup orchestration
k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore

5.9 KiB Raw Blame History