Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.3 KiB
Backup & Restore
Covers the current single-server development environment. This is the safety net that must be operational before any infrastructure migration work begins.
What is protected
| Asset | Location | Risk without backup |
|---|---|---|
| Custodian State Hub database | Docker volume infra_pg_data |
Total loss of all workstreams, tasks, decisions, progress history |
| Claude config & memory | ~/.claude/, ~/.claude.json |
Loss of project memory, MCP registration, settings |
| Git config | ~/.gitconfig |
Minor friction, recoverable |
| age private key | ~/.config/age/railiance-backup.key |
Cannot decrypt any existing backup |
Git repositories are not included in the backup — they are protected by being pushed to Gitea remotes. The preflight check verifies this.
Encryption
All backups are encrypted with age before leaving the machine.
Key locations:
| Copy | Location | Purpose |
|---|---|---|
| Operational | ~/.config/age/railiance-backup.key |
Used locally for restore drills |
| Recovery | Password manager | Used when the machine is gone |
Permissions: chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
The public key is hardcoded in tools/cmd/railiance-backup. To retrieve it:
grep "public key" ~/.config/age/railiance-backup.key
The password manager copy is the only key that survives hardware failure. Verify it is there before doing any infrastructure work.
Destination
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
required to write, cannot be read back without Nextcloud admin access). The
endpoint URL is stored locally in wiki/260225-backup-dropoff-link.txt
(gitignored).
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud file drop links only permit PUT requests.
A local cache of the last 7 backups of each type is kept in
~/.cache/railiance/backups/.
Running a backup
bin/railiance backup
This runs two steps:
-
PostgreSQL dump —
pg_dumpfrom the runninginfra-postgres-1container, piped throughage, uploaded asdb-<timestamp>.sql.age. -
Config snapshot — tar of
~/.claude/,~/.claude.json,~/.gitconfig, encrypted withage, uploaded asconfig-<timestamp>.tar.gz.age.
A .last-backup stamp is written to the local cache; the preflight check
reads this to verify freshness.
Automated: a cron job runs the backup daily at 02:00:
0 2 * * * /home/worsch/railiance-bootstrap/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
Pre-migration preflight
Before touching any infrastructure, run:
bin/railiance preflight
Checks performed:
| Check | Pass condition |
|---|---|
| DB backup freshness | Latest db-*.sql.age is less than 24 hours old |
| Config backup freshness | Latest config-*.tar.gz.age is less than 24 hours old |
| Git repos clean | No uncommitted changes in any tracked repo |
| Git repos pushed | No unpushed commits in any tracked repo |
| age key present | ~/.config/age/railiance-backup.key exists |
Exit 0 = safe to proceed. Exit 1 = do not proceed.
Restore procedure
Use this when recovering from hardware failure, WSL2 corruption, or accidental data loss. Work through it in order — each step depends on the previous one.
Step 0 — Prerequisites
On a fresh Ubuntu / WSL2 instance, install the required tools:
sudo apt-get update && sudo apt-get install -y \
git curl docker.io age postgresql-client
Start Docker:
sudo service docker start
Step 1 — Retrieve the age private key
Copy the private key from your password manager into the machine:
mkdir -p ~/.config/age
# paste the key content:
cat > ~/.config/age/railiance-backup.key
# (paste, then Ctrl-D)
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
Step 2 — Get the backup files
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
for read access, or retrieve from ~/.cache/railiance/backups/ on a
secondary machine if the local cache survived).
Files needed:
db-<timestamp>.sql.ageconfig-<timestamp>.tar.gz.age
Step 3 — Restore PostgreSQL
Start a fresh postgres container:
cd ~/the-custodian/state-hub
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
make db
Decrypt and restore the database dump:
age --decrypt \
-i ~/.config/age/railiance-backup.key \
db-<timestamp>.sql.age \
| docker exec -i infra-postgres-1 psql -U custodian custodian
Verify row counts look sane:
docker exec infra-postgres-1 psql -U custodian custodian \
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
Step 4 — Restore config files
age --decrypt \
-i ~/.config/age/railiance-backup.key \
config-<timestamp>.tar.gz.age \
| tar -xz -C ~
This restores ~/.claude/, ~/.claude.json, and ~/.gitconfig.
Step 5 — Clone repositories
git clone <gitea-url>/coulomb/railiance-bootstrap.git ~/railiance-bootstrap
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
# ... remaining repos as needed
If Gitea is offline, clone from the local bare mirrors in
~/.cache/railiance/git-mirrors/ if they were set up (see T3).
Step 6 — Register the MCP server
cd ~/the-custodian/state-hub
python3 scripts/patch_mcp_cwd.py
Step 7 — Start the state hub and verify
cd ~/the-custodian/state-hub
make api & # in background or a separate terminal
Smoke test — confirm state hub is responding:
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
And from Claude Code, confirm MCP tools are available:
bin/railiance preflight
Restore drill (validation)
Run a restore drill before doing any major infrastructure work. The drill validates that the procedure above actually works without waiting for a real disaster.
A minimal drill that does not require a second machine:
# 1. Start a second postgres container on a different port
docker run -d --name restore-test \
-e POSTGRES_DB=custodian \
-e POSTGRES_USER=custodian \
-e POSTGRES_PASSWORD=testpass \
-p 5433:5432 \
postgres:16-alpine
# 2. Decrypt and restore to it
age --decrypt \
-i ~/.config/age/railiance-backup.key \
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
| docker exec -i restore-test psql -U custodian custodian
# 3. Check row counts
docker exec restore-test psql -U custodian custodian \
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
# 4. Clean up
docker rm -f restore-test
Record the drill completion with a dated file (preflight checks for this in T5):
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> ~/.cache/railiance/restore-drill.log
Extension Points
EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
The current design relies on Gitea remotes for git repo recovery. If Gitea is offline during a migration, repos can only be recovered from local working copies (if they survive). A secondary bare-repo mirror (e.g., in a local directory or on a NAS) would make git recovery independent of Gitea availability.
Trigger: when the Gitea server becomes a SPOF for restore operations (e.g., during ThreePhoenix migration work on the server that runs Gitea). Constraint: mirrors must be updated on the same schedule as the DB backup; stale mirrors provide false confidence.
EP-RAIL-004 — Offsite secondary copy of encrypted backups
The current Nextcloud file drop is the only offsite copy. A second destination (rclone to an S3-compatible store, or rsync to a NAS) would protect against Nextcloud unavailability.
Trigger: when Nextcloud is not available or is itself hosted on the same infrastructure being migrated. Constraint: the second destination must also be write-only or similarly access-controlled; duplicating to a readable location without additional access controls widens the blast radius of a credential leak.