docs: backup and restore runbook
Some checks failed
railiance-tests / smoke (push) Has been cancelled

Covers encryption (age key management), what is protected, backup
command, daily cron, preflight checks, full step-by-step restore
procedure, restore drill instructions, and two extension points
(EP-RAIL-003 git mirrors, EP-RAIL-004 offsite secondary copy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-26 00:08:14 +01:00
parent 4381a079a2
commit ab5b12334d

293
docs/backup-restore.md Normal file
View File

@@ -0,0 +1,293 @@
# Backup & Restore
Covers the current single-server development environment. This is the safety
net that must be operational before any infrastructure migration work begins.
---
## What is protected
| Asset | Location | Risk without backup |
|---|---|---|
| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history |
| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings |
| Git config | `~/.gitconfig` | Minor friction, recoverable |
| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup |
Git repositories are **not** included in the backup — they are protected by
being pushed to Gitea remotes. The preflight check verifies this.
---
## Encryption
All backups are encrypted with [age](https://age-encryption.org/) before
leaving the machine.
**Key locations:**
| Copy | Location | Purpose |
|---|---|---|
| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills |
| Recovery | Password manager | Used when the machine is gone |
Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key`
The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it:
```bash
grep "public key" ~/.config/age/railiance-backup.key
```
> **The password manager copy is the only key that survives hardware failure.**
> Verify it is there before doing any infrastructure work.
---
## Destination
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
required to write, cannot be read back without Nextcloud admin access). The
endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt`
(gitignored).
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud
file drop links only permit PUT requests.
A local cache of the last 7 backups of each type is kept in
`~/.cache/railiance/backups/`.
---
## Running a backup
```bash
bin/railiance backup
```
This runs two steps:
1. **PostgreSQL dump**`pg_dump` from the running `infra-postgres-1`
container, piped through `age`, uploaded as `db-<timestamp>.sql.age`.
2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`,
encrypted with `age`, uploaded as `config-<timestamp>.tar.gz.age`.
A `.last-backup` stamp is written to the local cache; the preflight check
reads this to verify freshness.
**Automated:** a cron job runs the backup daily at 02:00:
```
0 2 * * * /home/worsch/railiance-bootstrap/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
```
---
## Pre-migration preflight
Before touching any infrastructure, run:
```bash
bin/railiance preflight
```
Checks performed:
| Check | Pass condition |
|---|---|
| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old |
| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old |
| Git repos clean | No uncommitted changes in any tracked repo |
| Git repos pushed | No unpushed commits in any tracked repo |
| age key present | `~/.config/age/railiance-backup.key` exists |
Exit 0 = safe to proceed. Exit 1 = do not proceed.
---
## Restore procedure
Use this when recovering from hardware failure, WSL2 corruption, or
accidental data loss. Work through it in order — each step depends on
the previous one.
### Step 0 — Prerequisites
On a fresh Ubuntu / WSL2 instance, install the required tools:
```bash
sudo apt-get update && sudo apt-get install -y \
git curl docker.io age postgresql-client
```
Start Docker:
```bash
sudo service docker start
```
### Step 1 — Retrieve the age private key
Copy the private key from your password manager into the machine:
```bash
mkdir -p ~/.config/age
# paste the key content:
cat > ~/.config/age/railiance-backup.key
# (paste, then Ctrl-D)
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
```
### Step 2 — Get the backup files
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
for read access, or retrieve from `~/.cache/railiance/backups/` on a
secondary machine if the local cache survived).
Files needed:
- `db-<timestamp>.sql.age`
- `config-<timestamp>.tar.gz.age`
### Step 3 — Restore PostgreSQL
Start a fresh postgres container:
```bash
cd ~/the-custodian/state-hub
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
make db
```
Decrypt and restore the database dump:
```bash
age --decrypt \
-i ~/.config/age/railiance-backup.key \
db-<timestamp>.sql.age \
| docker exec -i infra-postgres-1 psql -U custodian custodian
```
Verify row counts look sane:
```bash
docker exec infra-postgres-1 psql -U custodian custodian \
-c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
```
### Step 4 — Restore config files
```bash
age --decrypt \
-i ~/.config/age/railiance-backup.key \
config-<timestamp>.tar.gz.age \
| tar -xz -C ~
```
This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`.
### Step 5 — Clone repositories
```bash
git clone <gitea-url>/coulomb/railiance-bootstrap.git ~/railiance-bootstrap
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
# ... remaining repos as needed
```
If Gitea is offline, clone from the local bare mirrors in
`~/.cache/railiance/git-mirrors/` if they were set up (see T3).
### Step 6 — Register the MCP server
```bash
cd ~/the-custodian/state-hub
python3 scripts/patch_mcp_cwd.py
```
### Step 7 — Start the state hub and verify
```bash
cd ~/the-custodian/state-hub
make api & # in background or a separate terminal
```
Smoke test — confirm state hub is responding:
```bash
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
```
And from Claude Code, confirm MCP tools are available:
```
bin/railiance preflight
```
---
## Restore drill (validation)
Run a restore drill before doing any major infrastructure work. The drill
validates that the procedure above actually works without waiting for a real
disaster.
A minimal drill that does not require a second machine:
```bash
# 1. Start a second postgres container on a different port
docker run -d --name restore-test \
-e POSTGRES_DB=custodian \
-e POSTGRES_USER=custodian \
-e POSTGRES_PASSWORD=testpass \
-p 5433:5432 \
postgres:16-alpine
# 2. Decrypt and restore to it
age --decrypt \
-i ~/.config/age/railiance-backup.key \
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
| docker exec -i restore-test psql -U custodian custodian
# 3. Check row counts
docker exec restore-test psql -U custodian custodian \
-c "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
# 4. Clean up
docker rm -f restore-test
```
Record the drill completion with a dated file (preflight checks for this in T5):
```bash
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> ~/.cache/railiance/restore-drill.log
```
---
## Extension Points
### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
The current design relies on Gitea remotes for git repo recovery. If Gitea
is offline during a migration, repos can only be recovered from local working
copies (if they survive). A secondary bare-repo mirror (e.g., in a local
directory or on a NAS) would make git recovery independent of Gitea availability.
**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g.,
during ThreePhoenix migration work on the server that runs Gitea).
**Constraint:** mirrors must be updated on the same schedule as the DB backup;
stale mirrors provide false confidence.
### EP-RAIL-004 — Offsite secondary copy of encrypted backups
The current Nextcloud file drop is the only offsite copy. A second destination
(rclone to an S3-compatible store, or rsync to a NAS) would protect against
Nextcloud unavailability.
**Trigger:** when Nextcloud is not available or is itself hosted on the same
infrastructure being migrated.
**Constraint:** the second destination must also be write-only or similarly
access-controlled; duplicating to a readable location without additional
access controls widens the blast radius of a credential leak.