docs: backup and restore runbook
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Covers encryption (age key management), what is protected, backup command, daily cron, preflight checks, full step-by-step restore procedure, restore drill instructions, and two extension points (EP-RAIL-003 git mirrors, EP-RAIL-004 offsite secondary copy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
293
docs/backup-restore.md
Normal file
293
docs/backup-restore.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Backup & Restore
|
||||
|
||||
Covers the current single-server development environment. This is the safety
|
||||
net that must be operational before any infrastructure migration work begins.
|
||||
|
||||
---
|
||||
|
||||
## What is protected
|
||||
|
||||
| Asset | Location | Risk without backup |
|
||||
|---|---|---|
|
||||
| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history |
|
||||
| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings |
|
||||
| Git config | `~/.gitconfig` | Minor friction, recoverable |
|
||||
| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup |
|
||||
|
||||
Git repositories are **not** included in the backup — they are protected by
|
||||
being pushed to Gitea remotes. The preflight check verifies this.
|
||||
|
||||
---
|
||||
|
||||
## Encryption
|
||||
|
||||
All backups are encrypted with [age](https://age-encryption.org/) before
|
||||
leaving the machine.
|
||||
|
||||
**Key locations:**
|
||||
|
||||
| Copy | Location | Purpose |
|
||||
|---|---|---|
|
||||
| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills |
|
||||
| Recovery | Password manager | Used when the machine is gone |
|
||||
|
||||
Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key`
|
||||
|
||||
The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it:
|
||||
|
||||
```bash
|
||||
grep "public key" ~/.config/age/railiance-backup.key
|
||||
```
|
||||
|
||||
> **The password manager copy is the only key that survives hardware failure.**
|
||||
> Verify it is there before doing any infrastructure work.
|
||||
|
||||
---
|
||||
|
||||
## Destination
|
||||
|
||||
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
|
||||
required to write, cannot be read back without Nextcloud admin access). The
|
||||
endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt`
|
||||
(gitignored).
|
||||
|
||||
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud
|
||||
file drop links only permit PUT requests.
|
||||
|
||||
A local cache of the last 7 backups of each type is kept in
|
||||
`~/.cache/railiance/backups/`.
|
||||
|
||||
---
|
||||
|
||||
## Running a backup
|
||||
|
||||
```bash
|
||||
bin/railiance backup
|
||||
```
|
||||
|
||||
This runs two steps:
|
||||
|
||||
1. **PostgreSQL dump** — `pg_dump` from the running `infra-postgres-1`
|
||||
container, piped through `age`, uploaded as `db-<timestamp>.sql.age`.
|
||||
|
||||
2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`,
|
||||
encrypted with `age`, uploaded as `config-<timestamp>.tar.gz.age`.
|
||||
|
||||
A `.last-backup` stamp is written to the local cache; the preflight check
|
||||
reads this to verify freshness.
|
||||
|
||||
**Automated:** a cron job runs the backup daily at 02:00:
|
||||
|
||||
```
|
||||
0 2 * * * /home/worsch/railiance-bootstrap/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pre-migration preflight
|
||||
|
||||
Before touching any infrastructure, run:
|
||||
|
||||
```bash
|
||||
bin/railiance preflight
|
||||
```
|
||||
|
||||
Checks performed:
|
||||
|
||||
| Check | Pass condition |
|
||||
|---|---|
|
||||
| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old |
|
||||
| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old |
|
||||
| Git repos clean | No uncommitted changes in any tracked repo |
|
||||
| Git repos pushed | No unpushed commits in any tracked repo |
|
||||
| age key present | `~/.config/age/railiance-backup.key` exists |
|
||||
|
||||
Exit 0 = safe to proceed. Exit 1 = do not proceed.
|
||||
|
||||
---
|
||||
|
||||
## Restore procedure
|
||||
|
||||
Use this when recovering from hardware failure, WSL2 corruption, or
|
||||
accidental data loss. Work through it in order — each step depends on
|
||||
the previous one.
|
||||
|
||||
### Step 0 — Prerequisites
|
||||
|
||||
On a fresh Ubuntu / WSL2 instance, install the required tools:
|
||||
|
||||
```bash
|
||||
sudo apt-get update && sudo apt-get install -y \
|
||||
git curl docker.io age postgresql-client
|
||||
```
|
||||
|
||||
Start Docker:
|
||||
|
||||
```bash
|
||||
sudo service docker start
|
||||
```
|
||||
|
||||
### Step 1 — Retrieve the age private key
|
||||
|
||||
Copy the private key from your password manager into the machine:
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.config/age
|
||||
# paste the key content:
|
||||
cat > ~/.config/age/railiance-backup.key
|
||||
# (paste, then Ctrl-D)
|
||||
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
|
||||
```
|
||||
|
||||
### Step 2 — Get the backup files
|
||||
|
||||
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
|
||||
for read access, or retrieve from `~/.cache/railiance/backups/` on a
|
||||
secondary machine if the local cache survived).
|
||||
|
||||
Files needed:
|
||||
- `db-<timestamp>.sql.age`
|
||||
- `config-<timestamp>.tar.gz.age`
|
||||
|
||||
### Step 3 — Restore PostgreSQL
|
||||
|
||||
Start a fresh postgres container:
|
||||
|
||||
```bash
|
||||
cd ~/the-custodian/state-hub
|
||||
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
|
||||
make db
|
||||
```
|
||||
|
||||
Decrypt and restore the database dump:
|
||||
|
||||
```bash
|
||||
age --decrypt \
|
||||
-i ~/.config/age/railiance-backup.key \
|
||||
db-<timestamp>.sql.age \
|
||||
| docker exec -i infra-postgres-1 psql -U custodian custodian
|
||||
```
|
||||
|
||||
Verify row counts look sane:
|
||||
|
||||
```bash
|
||||
docker exec infra-postgres-1 psql -U custodian custodian \
|
||||
-c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
|
||||
```
|
||||
|
||||
### Step 4 — Restore config files
|
||||
|
||||
```bash
|
||||
age --decrypt \
|
||||
-i ~/.config/age/railiance-backup.key \
|
||||
config-<timestamp>.tar.gz.age \
|
||||
| tar -xz -C ~
|
||||
```
|
||||
|
||||
This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`.
|
||||
|
||||
### Step 5 — Clone repositories
|
||||
|
||||
```bash
|
||||
git clone <gitea-url>/coulomb/railiance-bootstrap.git ~/railiance-bootstrap
|
||||
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
|
||||
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
|
||||
# ... remaining repos as needed
|
||||
```
|
||||
|
||||
If Gitea is offline, clone from the local bare mirrors in
|
||||
`~/.cache/railiance/git-mirrors/` if they were set up (see T3).
|
||||
|
||||
### Step 6 — Register the MCP server
|
||||
|
||||
```bash
|
||||
cd ~/the-custodian/state-hub
|
||||
python3 scripts/patch_mcp_cwd.py
|
||||
```
|
||||
|
||||
### Step 7 — Start the state hub and verify
|
||||
|
||||
```bash
|
||||
cd ~/the-custodian/state-hub
|
||||
make api & # in background or a separate terminal
|
||||
```
|
||||
|
||||
Smoke test — confirm state hub is responding:
|
||||
|
||||
```bash
|
||||
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
|
||||
```
|
||||
|
||||
And from Claude Code, confirm MCP tools are available:
|
||||
|
||||
```
|
||||
bin/railiance preflight
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Restore drill (validation)
|
||||
|
||||
Run a restore drill before doing any major infrastructure work. The drill
|
||||
validates that the procedure above actually works without waiting for a real
|
||||
disaster.
|
||||
|
||||
A minimal drill that does not require a second machine:
|
||||
|
||||
```bash
|
||||
# 1. Start a second postgres container on a different port
|
||||
docker run -d --name restore-test \
|
||||
-e POSTGRES_DB=custodian \
|
||||
-e POSTGRES_USER=custodian \
|
||||
-e POSTGRES_PASSWORD=testpass \
|
||||
-p 5433:5432 \
|
||||
postgres:16-alpine
|
||||
|
||||
# 2. Decrypt and restore to it
|
||||
age --decrypt \
|
||||
-i ~/.config/age/railiance-backup.key \
|
||||
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
|
||||
| docker exec -i restore-test psql -U custodian custodian
|
||||
|
||||
# 3. Check row counts
|
||||
docker exec restore-test psql -U custodian custodian \
|
||||
-c "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
|
||||
|
||||
# 4. Clean up
|
||||
docker rm -f restore-test
|
||||
```
|
||||
|
||||
Record the drill completion with a dated file (preflight checks for this in T5):
|
||||
|
||||
```bash
|
||||
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
||||
>> ~/.cache/railiance/restore-drill.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extension Points
|
||||
|
||||
### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
|
||||
|
||||
The current design relies on Gitea remotes for git repo recovery. If Gitea
|
||||
is offline during a migration, repos can only be recovered from local working
|
||||
copies (if they survive). A secondary bare-repo mirror (e.g., in a local
|
||||
directory or on a NAS) would make git recovery independent of Gitea availability.
|
||||
|
||||
**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g.,
|
||||
during ThreePhoenix migration work on the server that runs Gitea).
|
||||
**Constraint:** mirrors must be updated on the same schedule as the DB backup;
|
||||
stale mirrors provide false confidence.
|
||||
|
||||
### EP-RAIL-004 — Offsite secondary copy of encrypted backups
|
||||
|
||||
The current Nextcloud file drop is the only offsite copy. A second destination
|
||||
(rclone to an S3-compatible store, or rsync to a NAS) would protect against
|
||||
Nextcloud unavailability.
|
||||
|
||||
**Trigger:** when Nextcloud is not available or is itself hosted on the same
|
||||
infrastructure being migrated.
|
||||
**Constraint:** the second destination must also be write-only or similarly
|
||||
access-controlled; duplicating to a readable location without additional
|
||||
access controls widens the blast radius of a credential leak.
|
||||
Reference in New Issue
Block a user