feat(backup): revise WP-0004 — integrated backup per capability (D4)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots, Helm values, kubeconfig). No external dependencies. age encryption reuses SOPS key pair. Output to /opt/backup/railiance/cluster/. DECISIONS.md D4: integrated backup per capability, not centralized. EP-RAIL-005 registered in state hub: custodian orchestration deferred until all layers implement the standard interface. The old monolithic backup (custodian DB + operator config) was not S2's concern and has been removed from this workplan scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
50
DECISIONS.md
50
DECISIONS.md
@@ -64,3 +64,53 @@ has been tested before it matters.
|
||||
See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`
|
||||
|
||||
---
|
||||
|
||||
## D4 — Integrated backup per capability, not centralized backup service
|
||||
|
||||
**Date:** 2026-03-10
|
||||
**Decided by:** Tegwick
|
||||
|
||||
**Decision:** Each railiance repo implements its own backup for the
|
||||
infrastructure it owns. There is no central backup service.
|
||||
|
||||
**Rationale:**
|
||||
|
||||
A centralized backup service (e.g., in railiance-enablement) couples every
|
||||
stack layer to a shared component. As each layer matures and evolves at its
|
||||
own pace, this coupling repeatedly breaks the backup. A service that breaks
|
||||
when the thing it is supposed to protect is being changed is not a safety net.
|
||||
|
||||
Integrated backup per repo means:
|
||||
- The backup for S1 lives in railiance-infra and knows exactly what S1 owns
|
||||
- The backup for S2 lives in railiance-cluster and knows what S2 owns
|
||||
- Each repo can be backed up independently, without any other repo, service,
|
||||
or network connection being available
|
||||
- Each backup implementation matures with its layer
|
||||
|
||||
**Standard interface (Q3 Operability & Resilience):**
|
||||
|
||||
Every railiance repo that manages persistent state must provide:
|
||||
|
||||
1. `make backup` — creates an encrypted backup of what this layer owns,
|
||||
writes to a local directory on the server (`/opt/backup/railiance/<layer>/`)
|
||||
2. `make restore` — restores from the most recent local backup
|
||||
3. Encryption: age, reusing the same key pair used for SOPS secrets
|
||||
4. No runtime dependencies: must work without custodian, state-hub, network
|
||||
file share, or any other external service being available
|
||||
|
||||
**Extension point EP-RAIL-005:** The custodian can provide orchestration
|
||||
guidelines. If each repo follows the standard interface, the custodian can
|
||||
call `make backup` across the full stack in dependency order (S1 → S5)
|
||||
and aggregate results. This is deliberately deferred — integrate first,
|
||||
orchestrate later.
|
||||
|
||||
**What changes from the previous approach (D2):**
|
||||
|
||||
D2 established Nextcloud as the backup destination for a single monolithic
|
||||
script in railiance-cluster. That script backed up the wrong things (custodian
|
||||
DB and operator config — neither of which are S2 concerns). The Nextcloud
|
||||
upload becomes an optional extension, not a requirement.
|
||||
|
||||
See: `workplans/RAIL-BS-WP-0004-safety-net.md`
|
||||
|
||||
---
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0004
|
||||
type: workplan
|
||||
title: "Current-Environment Safety Net"
|
||||
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
|
||||
domain: railiance
|
||||
repo: railiance-cluster
|
||||
status: active
|
||||
@@ -12,118 +12,162 @@ created: "2026-02-25"
|
||||
updated: "2026-03-10"
|
||||
---
|
||||
|
||||
# Current-Environment Safety Net
|
||||
# Integrated Backup — S2 Kubernetes Runtime Layer
|
||||
|
||||
## Goal
|
||||
|
||||
Ensure backup and disaster recovery for the current single-server environment
|
||||
is operational and tested before any ThreePhoenix infrastructure migration
|
||||
work begins. Aligned to OAS Stack S2 (railiance-cluster owns backup tooling).
|
||||
Implement the Q3 (Operability & Resilience) integrated backup for
|
||||
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
|
||||
encrypted with age, written to a local directory on the server. No external
|
||||
dependencies required.
|
||||
|
||||
## Context
|
||||
## Architecture (Decision D4)
|
||||
|
||||
The backup toolchain lives in `tools/cmd/railiance-backup` and
|
||||
`tools/cmd/railiance-preflight`, dispatched via `bin/railiance`. It protects:
|
||||
Each railiance repo implements its own backup for what it owns. No central
|
||||
backup service. See `DECISIONS.md` D4 for full rationale.
|
||||
|
||||
| Asset | Method | Risk without backup |
|
||||
|---|---|---|
|
||||
| Custodian State Hub DB | pg_dump → age → Nextcloud | Total loss of workstreams, decisions, history |
|
||||
| Claude config + memory | tar → age → Nextcloud | Loss of MCP registration, project memory |
|
||||
| Git repos | Gitea remotes | SPOF: Gitea runs on the same server being migrated |
|
||||
**Standard interface every railiance repo must provide:**
|
||||
|
||||
Decision D2: Nextcloud upload-only file drop as backup destination.
|
||||
```bash
|
||||
make backup # encrypt + write to /opt/backup/railiance/<layer>/
|
||||
make restore # restore from most recent local backup
|
||||
```
|
||||
|
||||
## OAS Alignment
|
||||
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
|
||||
Output: `/opt/backup/railiance/cluster/` on the server.
|
||||
|
||||
Per ADR-003, backup tooling lives in **S2 (railiance-cluster)**. The preflight
|
||||
check covers all five OAS stack repos:
|
||||
## What S2 (railiance-cluster) owns and must back up
|
||||
|
||||
| Repo | OAS Layer |
|
||||
| Asset | Why it matters |
|
||||
|---|---|
|
||||
| railiance-infra | S1 — OS & Provisioning |
|
||||
| railiance-cluster | S2 — Kubernetes Runtime |
|
||||
| railiance-platform | S3 — Platform Services |
|
||||
| railiance-enablement | S4 — Developer Tooling |
|
||||
| railiance-apps | S5 — Workloads & Endpoints |
|
||||
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
|
||||
| Helm release values | Runtime values not in git (any manually applied overrides) |
|
||||
| kubeconfig | Admin access to the cluster |
|
||||
|
||||
Plus cross-domain repos: the-custodian, markitect_project, activity-core,
|
||||
net-kingdom, issue-facade, binect-js, kaizen-agentic.
|
||||
**Not S2's responsibility:**
|
||||
- Custodian State Hub DB → the-custodian owns this
|
||||
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
|
||||
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
|
||||
- PostgreSQL data volumes → S3 (railiance-platform) owns this
|
||||
|
||||
## Boundary
|
||||
## Encryption
|
||||
|
||||
Backup execution: this repo (`bin/railiance backup`).
|
||||
Backup destination: Nextcloud file drop (URL in `~/.config/railiance/nc-upload-url` or hardcoded).
|
||||
Restore procedure: `docs/backup-restore.md`.
|
||||
Reuse the age public key from `.sops.yaml`:
|
||||
|
||||
```bash
|
||||
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
|
||||
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
|
||||
```
|
||||
|
||||
Decryption requires the private key at `~/.config/sops/age/keys.txt`
|
||||
(same key used for `sops -d`). No additional key management needed.
|
||||
|
||||
## Extension Point EP-RAIL-005
|
||||
|
||||
Once all five OAS layers implement this interface, the custodian can
|
||||
orchestrate a full-stack backup with:
|
||||
|
||||
```bash
|
||||
for repo in railiance-infra railiance-cluster railiance-platform \
|
||||
railiance-enablement railiance-apps; do
|
||||
make -C ~/$repo backup
|
||||
done
|
||||
```
|
||||
|
||||
No special protocol needed — just the standard interface.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Update preflight repo list to OAS 5-repo layout
|
||||
### T01 — Define backup directory and encryption wrapper
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: done
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
|
||||
```
|
||||
|
||||
Update `tools/cmd/railiance-preflight` REPOS array: remove `railiance-bootstrap`,
|
||||
add `railiance-infra`, `railiance-cluster`, `railiance-platform`,
|
||||
`railiance-enablement`, `railiance-apps`. Add all active project repos.
|
||||
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
|
||||
|
||||
**Done when:** `bin/railiance preflight` checks all current repos.
|
||||
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
|
||||
- Encrypt each artifact with age using public key from `.sops.yaml`
|
||||
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
|
||||
- Keep last 7 of each type
|
||||
- Write `.last-backup` stamp
|
||||
- Exit 0 on success, non-zero on any failure
|
||||
- No network required
|
||||
|
||||
**Done when:** `make backup` runs on COULOMBCORE without error and files
|
||||
appear in `/opt/backup/railiance/cluster/`.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Fix stale repo references in backup-restore.md
|
||||
### T02 — Back up k3s etcd snapshots
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: done
|
||||
priority: medium
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
|
||||
```
|
||||
|
||||
Update restore procedure: `railiance-bootstrap` → `railiance-cluster`,
|
||||
`railiance-hosts` → `railiance-infra`, add the three new OAS repos.
|
||||
k3s has built-in etcd snapshot support:
|
||||
|
||||
**Done when:** doc accurately reflects the current 5-repo OAS stack.
|
||||
```bash
|
||||
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
|
||||
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
|
||||
```
|
||||
|
||||
Add to the backup script: take a fresh snapshot, encrypt with age,
|
||||
copy to `/opt/backup/railiance/cluster/`.
|
||||
|
||||
**Done when:** backup includes a current etcd snapshot.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Add make backup and make preflight targets
|
||||
### T03 — Back up Helm release values
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: done
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
|
||||
```
|
||||
|
||||
Add to root Makefile so the safety net is discoverable from `make help`.
|
||||
Capture current runtime Helm values for all releases:
|
||||
|
||||
**Done when:** `make backup` and `make preflight` both work.
|
||||
```bash
|
||||
helm list -A -o json | jq -r '.[].name + " " + .namespace' | \
|
||||
while read name ns; do
|
||||
helm get values "$name" -n "$ns" -o yaml
|
||||
done
|
||||
```
|
||||
|
||||
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
|
||||
|
||||
**Done when:** backup includes a snapshot of all Helm release values.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Run current backup and verify upload
|
||||
### T04 — Back up kubeconfig
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: done
|
||||
priority: high
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
|
||||
```
|
||||
|
||||
Run `bin/railiance backup` and confirm both DB and config files appear
|
||||
in the Nextcloud file drop.
|
||||
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
|
||||
into `kubeconfig-<ts>.yaml.age` in the backup directory.
|
||||
|
||||
**Done when:** backup completes without error and `.last-backup` stamp is fresh.
|
||||
**Done when:** backup includes the encrypted kubeconfig.
|
||||
|
||||
---
|
||||
|
||||
### T05 — Server backup: Gitea data and Zulip chat
|
||||
### T05 — make restore target
|
||||
|
||||
```task
|
||||
id: T05
|
||||
@@ -132,33 +176,20 @@ priority: medium
|
||||
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
|
||||
```
|
||||
|
||||
**Scope correction (2026-03-10):** The original task assumed the `railiance-backup`
|
||||
script in `tools/cmd/railiance-backup` applied here. It does not — that script
|
||||
is for a developer workstation (custodian DB in Docker + Claude config) and is
|
||||
unrelated to the server.
|
||||
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
|
||||
backups, with guided restore for the etcd snapshot case.
|
||||
|
||||
The server's safety net must protect:
|
||||
Restore of etcd from snapshot:
|
||||
```bash
|
||||
sudo k3s server --cluster-reset \
|
||||
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
|
||||
```
|
||||
|
||||
| Asset | Method |
|
||||
|---|---|
|
||||
| Gitea repositories + DB | `k3s kubectl exec` into gitea pod → `gitea dump` |
|
||||
| Zulip chat data | Zulip's built-in export or volume snapshot |
|
||||
|
||||
This work belongs in **railiance-infra** (S1 — OS & Provisioning layer) as an
|
||||
Ansible role or playbook, not here. A cron job on the server should call that
|
||||
script once it exists.
|
||||
|
||||
**Do not** wire up a cron job that calls the existing `bin/railiance backup` —
|
||||
that script targets Docker containers that do not exist on this server.
|
||||
|
||||
**Done when:**
|
||||
1. A backup playbook/role exists in `railiance-infra` covering Gitea + Zulip
|
||||
2. It is deployed via Ansible and a cron job on the server calls it daily
|
||||
3. At least one successful backup run is verified in the log
|
||||
**Done when:** `make restore` prints available backups and a restore guide.
|
||||
|
||||
---
|
||||
|
||||
### T06 — Run restore drill
|
||||
### T06 — Install cron job and run restore drill
|
||||
|
||||
```task
|
||||
id: T06
|
||||
@@ -167,16 +198,25 @@ priority: medium
|
||||
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
|
||||
```
|
||||
|
||||
Run the minimal restore drill from `docs/backup-restore.md` against the
|
||||
current backup. Record completion in `~/.cache/railiance/restore-drill.log`.
|
||||
Install the daily cron and verify decrypt works:
|
||||
|
||||
**Done when:** drill exits 0 and log entry is written.
|
||||
```bash
|
||||
# Install cron on COULOMBCORE
|
||||
(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab -
|
||||
|
||||
# Drill: decrypt etcd snapshot and verify it's readable
|
||||
age -d -i ~/.config/sops/age/keys.txt \
|
||||
/opt/backup/railiance/cluster/etcd-<latest>.snap.age | file -
|
||||
```
|
||||
|
||||
**Done when:** cron installed, drill completes without error, log entry written.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Decision D2: Nextcloud as backup destination (`DECISIONS.md`)
|
||||
- Backup tooling: `tools/cmd/railiance-backup`, `tools/cmd/railiance-preflight`
|
||||
- Restore procedure: `docs/backup-restore.md`
|
||||
- Extension points: EP-RAIL-003 (git bare mirrors), EP-RAIL-004 (secondary offsite copy)
|
||||
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
|
||||
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
|
||||
- OAS Q3: Operability & Resilience
|
||||
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
|
||||
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore
|
||||
|
||||
Reference in New Issue
Block a user