Files
railiance-cluster/docs/backup-restore.md
tegwick 75467673a8 feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout
- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for
  current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo
- tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack
  (railiance-infra/cluster/platform/enablement/apps) + project repos;
  remove stale railiance-bootstrap reference
- docs/backup-restore.md: fix Step 5 clone commands to current repo names
- Makefile: add make backup and make preflight targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:21:29 +01:00

8.8 KiB
Raw Permalink Blame History

Backup & Restore

Covers the current single-server development environment. This is the safety net that must be operational before any infrastructure migration work begins.


What is protected

Asset Location Risk without backup
Custodian State Hub database Docker volume infra_pg_data Total loss of all workstreams, tasks, decisions, progress history
Claude config & memory ~/.claude/, ~/.claude.json Loss of project memory, MCP registration, settings
Git config ~/.gitconfig Minor friction, recoverable
age private key ~/.config/age/railiance-backup.key Cannot decrypt any existing backup

Git repositories are not included in the backup — they are protected by being pushed to Gitea remotes. The preflight check verifies this.


Encryption

All backups are encrypted with age before leaving the machine.

Key locations:

Copy Location Purpose
Operational ~/.config/age/railiance-backup.key Used locally for restore drills
Recovery Password manager Used when the machine is gone

Permissions: chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key

The public key is hardcoded in tools/cmd/railiance-backup. To retrieve it:

grep "public key" ~/.config/age/railiance-backup.key

The password manager copy is the only key that survives hardware failure. Verify it is there before doing any infrastructure work.


Destination

Backups are uploaded to a Nextcloud file drop (upload-only, no credentials required to write, cannot be read back without Nextcloud admin access). The endpoint URL is stored locally in wiki/260225-backup-dropoff-link.txt (gitignored).

Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud file drop links only permit PUT requests.

A local cache of the last 7 backups of each type is kept in ~/.cache/railiance/backups/.


Running a backup

bin/railiance backup

This runs two steps:

  1. PostgreSQL dumppg_dump from the running infra-postgres-1 container, piped through age, uploaded as db-<timestamp>.sql.age.

  2. Config snapshot — tar of ~/.claude/, ~/.claude.json, ~/.gitconfig, encrypted with age, uploaded as config-<timestamp>.tar.gz.age.

A .last-backup stamp is written to the local cache; the preflight check reads this to verify freshness.

Automated: a cron job runs the backup daily at 02:00:

0 2 * * * /home/worsch/railiance-cluster/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1

Pre-migration preflight

Before touching any infrastructure, run:

bin/railiance preflight

Checks performed:

Check Pass condition
DB backup freshness Latest db-*.sql.age is less than 24 hours old
Config backup freshness Latest config-*.tar.gz.age is less than 24 hours old
Git repos clean No uncommitted changes in any tracked repo
Git repos pushed No unpushed commits in any tracked repo
age key present ~/.config/age/railiance-backup.key exists

Exit 0 = safe to proceed. Exit 1 = do not proceed.


Restore procedure

Use this when recovering from hardware failure, WSL2 corruption, or accidental data loss. Work through it in order — each step depends on the previous one.

Step 0 — Prerequisites

On a fresh Ubuntu / WSL2 instance, install the required tools:

sudo apt-get update && sudo apt-get install -y \
  git curl docker.io age postgresql-client

Start Docker:

sudo service docker start

Step 1 — Retrieve the age private key

Copy the private key from your password manager into the machine:

mkdir -p ~/.config/age
# paste the key content:
cat > ~/.config/age/railiance-backup.key
# (paste, then Ctrl-D)
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key

Step 2 — Get the backup files

Download the most recent backup files from Nextcloud (ask the Nextcloud admin for read access, or retrieve from ~/.cache/railiance/backups/ on a secondary machine if the local cache survived).

Files needed:

  • db-<timestamp>.sql.age
  • config-<timestamp>.tar.gz.age

Step 3 — Restore PostgreSQL

Start a fresh postgres container:

cd ~/the-custodian/state-hub
cp infra/.env.example infra/.env   # fill in POSTGRES_PASSWORD
make db

Decrypt and restore the database dump:

age --decrypt \
  -i ~/.config/age/railiance-backup.key \
  db-<timestamp>.sql.age \
  | docker exec -i infra-postgres-1 psql -U custodian custodian

Verify row counts look sane:

docker exec infra-postgres-1 psql -U custodian custodian \
  -c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"

Step 4 — Restore config files

age --decrypt \
  -i ~/.config/age/railiance-backup.key \
  config-<timestamp>.tar.gz.age \
  | tar -xz -C ~

This restores ~/.claude/, ~/.claude.json, and ~/.gitconfig.

Step 5 — Clone repositories

OAS Stack repos (S1S5, per ADR-003):

git clone <gitea-url>/coulomb/railiance-infra.git       ~/railiance-infra
git clone <gitea-url>/coulomb/railiance-cluster.git     ~/railiance-cluster
git clone <gitea-url>/coulomb/railiance-platform.git    ~/railiance-platform
git clone <gitea-url>/coulomb/railiance-enablement.git  ~/railiance-enablement
git clone <gitea-url>/coulomb/railiance-apps.git        ~/railiance-apps

Core and project repos:

git clone <gitea-url>/tegwick/the-custodian.git     ~/the-custodian
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
git clone <gitea-url>/coulomb/activity-core.git     ~/activity-core
git clone <gitea-url>/coulomb/net-kingdom.git       ~/net-kingdom
# ... remaining repos as needed

If Gitea is offline, clone from the local bare mirrors in ~/.cache/railiance/git-mirrors/ if they were set up (see T3).

Step 6 — Register the MCP server

cd ~/the-custodian/state-hub
python3 scripts/patch_mcp_cwd.py

Step 7 — Start the state hub and verify

cd ~/the-custodian/state-hub
make api &   # in background or a separate terminal

Smoke test — confirm state hub is responding:

curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20

And from Claude Code, confirm MCP tools are available:

bin/railiance preflight

Restore drill (validation)

Run a restore drill before doing any major infrastructure work. The drill validates that the procedure above actually works without waiting for a real disaster.

A minimal drill that does not require a second machine:

# 1. Start a second postgres container on a different port
docker run -d --name restore-test \
  -e POSTGRES_DB=custodian \
  -e POSTGRES_USER=custodian \
  -e POSTGRES_PASSWORD=testpass \
  -p 5433:5432 \
  postgres:16-alpine

# 2. Decrypt and restore to it
age --decrypt \
  -i ~/.config/age/railiance-backup.key \
  ~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
  | docker exec -i restore-test psql -U custodian custodian

# 3. Check row counts
docker exec restore-test psql -U custodian custodian \
  -c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"

# 4. Clean up
docker rm -f restore-test

Record the drill completion with a dated file (preflight checks for this in T5):

echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
  >> ~/.cache/railiance/restore-drill.log

Extension Points

EP-RAIL-003 — Git bare-repo mirrors as secondary restore source

The current design relies on Gitea remotes for git repo recovery. If Gitea is offline during a migration, repos can only be recovered from local working copies (if they survive). A secondary bare-repo mirror (e.g., in a local directory or on a NAS) would make git recovery independent of Gitea availability.

Trigger: when the Gitea server becomes a SPOF for restore operations (e.g., during ThreePhoenix migration work on the server that runs Gitea). Constraint: mirrors must be updated on the same schedule as the DB backup; stale mirrors provide false confidence.

EP-RAIL-004 — Offsite secondary copy of encrypted backups

The current Nextcloud file drop is the only offsite copy. A second destination (rclone to an S3-compatible store, or rsync to a NAS) would protect against Nextcloud unavailability.

Trigger: when Nextcloud is not available or is itself hosted on the same infrastructure being migrated. Constraint: the second destination must also be write-only or similarly access-controlled; duplicating to a readable location without additional access controls widens the blast radius of a credential leak.