From ab5b12334dd22f0cbfd3f398b01e9babca69cd8d Mon Sep 17 00:00:00 2001 From: tegwick Date: Thu, 26 Feb 2026 00:08:14 +0100 Subject: [PATCH] docs: backup and restore runbook Covers encryption (age key management), what is protected, backup command, daily cron, preflight checks, full step-by-step restore procedure, restore drill instructions, and two extension points (EP-RAIL-003 git mirrors, EP-RAIL-004 offsite secondary copy). Co-Authored-By: Claude Sonnet 4.6 --- docs/backup-restore.md | 293 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 293 insertions(+) create mode 100644 docs/backup-restore.md diff --git a/docs/backup-restore.md b/docs/backup-restore.md new file mode 100644 index 0000000..440c4b3 --- /dev/null +++ b/docs/backup-restore.md @@ -0,0 +1,293 @@ +# Backup & Restore + +Covers the current single-server development environment. This is the safety +net that must be operational before any infrastructure migration work begins. + +--- + +## What is protected + +| Asset | Location | Risk without backup | +|---|---|---| +| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history | +| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings | +| Git config | `~/.gitconfig` | Minor friction, recoverable | +| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup | + +Git repositories are **not** included in the backup — they are protected by +being pushed to Gitea remotes. The preflight check verifies this. + +--- + +## Encryption + +All backups are encrypted with [age](https://age-encryption.org/) before +leaving the machine. + +**Key locations:** + +| Copy | Location | Purpose | +|---|---|---| +| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills | +| Recovery | Password manager | Used when the machine is gone | + +Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key` + +The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it: + +```bash +grep "public key" ~/.config/age/railiance-backup.key +``` + +> **The password manager copy is the only key that survives hardware failure.** +> Verify it is there before doing any infrastructure work. + +--- + +## Destination + +Backups are uploaded to a Nextcloud file drop (upload-only, no credentials +required to write, cannot be read back without Nextcloud admin access). The +endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt` +(gitignored). + +Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud +file drop links only permit PUT requests. + +A local cache of the last 7 backups of each type is kept in +`~/.cache/railiance/backups/`. + +--- + +## Running a backup + +```bash +bin/railiance backup +``` + +This runs two steps: + +1. **PostgreSQL dump** — `pg_dump` from the running `infra-postgres-1` + container, piped through `age`, uploaded as `db-.sql.age`. + +2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`, + encrypted with `age`, uploaded as `config-.tar.gz.age`. + +A `.last-backup` stamp is written to the local cache; the preflight check +reads this to verify freshness. + +**Automated:** a cron job runs the backup daily at 02:00: + +``` +0 2 * * * /home/worsch/railiance-bootstrap/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1 +``` + +--- + +## Pre-migration preflight + +Before touching any infrastructure, run: + +```bash +bin/railiance preflight +``` + +Checks performed: + +| Check | Pass condition | +|---|---| +| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old | +| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old | +| Git repos clean | No uncommitted changes in any tracked repo | +| Git repos pushed | No unpushed commits in any tracked repo | +| age key present | `~/.config/age/railiance-backup.key` exists | + +Exit 0 = safe to proceed. Exit 1 = do not proceed. + +--- + +## Restore procedure + +Use this when recovering from hardware failure, WSL2 corruption, or +accidental data loss. Work through it in order — each step depends on +the previous one. + +### Step 0 — Prerequisites + +On a fresh Ubuntu / WSL2 instance, install the required tools: + +```bash +sudo apt-get update && sudo apt-get install -y \ + git curl docker.io age postgresql-client +``` + +Start Docker: + +```bash +sudo service docker start +``` + +### Step 1 — Retrieve the age private key + +Copy the private key from your password manager into the machine: + +```bash +mkdir -p ~/.config/age +# paste the key content: +cat > ~/.config/age/railiance-backup.key +# (paste, then Ctrl-D) +chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key +``` + +### Step 2 — Get the backup files + +Download the most recent backup files from Nextcloud (ask the Nextcloud admin +for read access, or retrieve from `~/.cache/railiance/backups/` on a +secondary machine if the local cache survived). + +Files needed: +- `db-.sql.age` +- `config-.tar.gz.age` + +### Step 3 — Restore PostgreSQL + +Start a fresh postgres container: + +```bash +cd ~/the-custodian/state-hub +cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD +make db +``` + +Decrypt and restore the database dump: + +```bash +age --decrypt \ + -i ~/.config/age/railiance-backup.key \ + db-.sql.age \ + | docker exec -i infra-postgres-1 psql -U custodian custodian +``` + +Verify row counts look sane: + +```bash +docker exec infra-postgres-1 psql -U custodian custodian \ + -c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" +``` + +### Step 4 — Restore config files + +```bash +age --decrypt \ + -i ~/.config/age/railiance-backup.key \ + config-.tar.gz.age \ + | tar -xz -C ~ +``` + +This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`. + +### Step 5 — Clone repositories + +```bash +git clone /coulomb/railiance-bootstrap.git ~/railiance-bootstrap +git clone /tegwick/the-custodian.git ~/the-custodian +git clone /coulomb/markitect_project.git ~/markitect_project +# ... remaining repos as needed +``` + +If Gitea is offline, clone from the local bare mirrors in +`~/.cache/railiance/git-mirrors/` if they were set up (see T3). + +### Step 6 — Register the MCP server + +```bash +cd ~/the-custodian/state-hub +python3 scripts/patch_mcp_cwd.py +``` + +### Step 7 — Start the state hub and verify + +```bash +cd ~/the-custodian/state-hub +make api & # in background or a separate terminal +``` + +Smoke test — confirm state hub is responding: + +```bash +curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20 +``` + +And from Claude Code, confirm MCP tools are available: + +``` +bin/railiance preflight +``` + +--- + +## Restore drill (validation) + +Run a restore drill before doing any major infrastructure work. The drill +validates that the procedure above actually works without waiting for a real +disaster. + +A minimal drill that does not require a second machine: + +```bash +# 1. Start a second postgres container on a different port +docker run -d --name restore-test \ + -e POSTGRES_DB=custodian \ + -e POSTGRES_USER=custodian \ + -e POSTGRES_PASSWORD=testpass \ + -p 5433:5432 \ + postgres:16-alpine + +# 2. Decrypt and restore to it +age --decrypt \ + -i ~/.config/age/railiance-backup.key \ + ~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \ + | docker exec -i restore-test psql -U custodian custodian + +# 3. Check row counts +docker exec restore-test psql -U custodian custodian \ + -c "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" + +# 4. Clean up +docker rm -f restore-test +``` + +Record the drill completion with a dated file (preflight checks for this in T5): + +```bash +echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \ + >> ~/.cache/railiance/restore-drill.log +``` + +--- + +## Extension Points + +### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source + +The current design relies on Gitea remotes for git repo recovery. If Gitea +is offline during a migration, repos can only be recovered from local working +copies (if they survive). A secondary bare-repo mirror (e.g., in a local +directory or on a NAS) would make git recovery independent of Gitea availability. + +**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g., +during ThreePhoenix migration work on the server that runs Gitea). +**Constraint:** mirrors must be updated on the same schedule as the DB backup; +stale mirrors provide false confidence. + +### EP-RAIL-004 — Offsite secondary copy of encrypted backups + +The current Nextcloud file drop is the only offsite copy. A second destination +(rclone to an S3-compatible store, or rsync to a NAS) would protect against +Nextcloud unavailability. + +**Trigger:** when Nextcloud is not available or is itself hosted on the same +infrastructure being migrated. +**Constraint:** the second destination must also be write-only or similarly +access-controlled; duplicating to a readable location without additional +access controls widens the blast radius of a credential leak.