- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo - tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack (railiance-infra/cluster/platform/enablement/apps) + project repos; remove stale railiance-bootstrap reference - docs/backup-restore.md: fix Step 5 clone commands to current repo names - Makefile: add make backup and make preflight targets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
307 lines
8.8 KiB
Markdown
307 lines
8.8 KiB
Markdown
# Backup & Restore
|
||
|
||
Covers the current single-server development environment. This is the safety
|
||
net that must be operational before any infrastructure migration work begins.
|
||
|
||
---
|
||
|
||
## What is protected
|
||
|
||
| Asset | Location | Risk without backup |
|
||
|---|---|---|
|
||
| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history |
|
||
| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings |
|
||
| Git config | `~/.gitconfig` | Minor friction, recoverable |
|
||
| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup |
|
||
|
||
Git repositories are **not** included in the backup — they are protected by
|
||
being pushed to Gitea remotes. The preflight check verifies this.
|
||
|
||
---
|
||
|
||
## Encryption
|
||
|
||
All backups are encrypted with [age](https://age-encryption.org/) before
|
||
leaving the machine.
|
||
|
||
**Key locations:**
|
||
|
||
| Copy | Location | Purpose |
|
||
|---|---|---|
|
||
| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills |
|
||
| Recovery | Password manager | Used when the machine is gone |
|
||
|
||
Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key`
|
||
|
||
The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it:
|
||
|
||
```bash
|
||
grep "public key" ~/.config/age/railiance-backup.key
|
||
```
|
||
|
||
> **The password manager copy is the only key that survives hardware failure.**
|
||
> Verify it is there before doing any infrastructure work.
|
||
|
||
---
|
||
|
||
## Destination
|
||
|
||
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
|
||
required to write, cannot be read back without Nextcloud admin access). The
|
||
endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt`
|
||
(gitignored).
|
||
|
||
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud
|
||
file drop links only permit PUT requests.
|
||
|
||
A local cache of the last 7 backups of each type is kept in
|
||
`~/.cache/railiance/backups/`.
|
||
|
||
---
|
||
|
||
## Running a backup
|
||
|
||
```bash
|
||
bin/railiance backup
|
||
```
|
||
|
||
This runs two steps:
|
||
|
||
1. **PostgreSQL dump** — `pg_dump` from the running `infra-postgres-1`
|
||
container, piped through `age`, uploaded as `db-<timestamp>.sql.age`.
|
||
|
||
2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`,
|
||
encrypted with `age`, uploaded as `config-<timestamp>.tar.gz.age`.
|
||
|
||
A `.last-backup` stamp is written to the local cache; the preflight check
|
||
reads this to verify freshness.
|
||
|
||
**Automated:** a cron job runs the backup daily at 02:00:
|
||
|
||
```
|
||
0 2 * * * /home/worsch/railiance-cluster/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
|
||
```
|
||
|
||
---
|
||
|
||
## Pre-migration preflight
|
||
|
||
Before touching any infrastructure, run:
|
||
|
||
```bash
|
||
bin/railiance preflight
|
||
```
|
||
|
||
Checks performed:
|
||
|
||
| Check | Pass condition |
|
||
|---|---|
|
||
| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old |
|
||
| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old |
|
||
| Git repos clean | No uncommitted changes in any tracked repo |
|
||
| Git repos pushed | No unpushed commits in any tracked repo |
|
||
| age key present | `~/.config/age/railiance-backup.key` exists |
|
||
|
||
Exit 0 = safe to proceed. Exit 1 = do not proceed.
|
||
|
||
---
|
||
|
||
## Restore procedure
|
||
|
||
Use this when recovering from hardware failure, WSL2 corruption, or
|
||
accidental data loss. Work through it in order — each step depends on
|
||
the previous one.
|
||
|
||
### Step 0 — Prerequisites
|
||
|
||
On a fresh Ubuntu / WSL2 instance, install the required tools:
|
||
|
||
```bash
|
||
sudo apt-get update && sudo apt-get install -y \
|
||
git curl docker.io age postgresql-client
|
||
```
|
||
|
||
Start Docker:
|
||
|
||
```bash
|
||
sudo service docker start
|
||
```
|
||
|
||
### Step 1 — Retrieve the age private key
|
||
|
||
Copy the private key from your password manager into the machine:
|
||
|
||
```bash
|
||
mkdir -p ~/.config/age
|
||
# paste the key content:
|
||
cat > ~/.config/age/railiance-backup.key
|
||
# (paste, then Ctrl-D)
|
||
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
|
||
```
|
||
|
||
### Step 2 — Get the backup files
|
||
|
||
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
|
||
for read access, or retrieve from `~/.cache/railiance/backups/` on a
|
||
secondary machine if the local cache survived).
|
||
|
||
Files needed:
|
||
- `db-<timestamp>.sql.age`
|
||
- `config-<timestamp>.tar.gz.age`
|
||
|
||
### Step 3 — Restore PostgreSQL
|
||
|
||
Start a fresh postgres container:
|
||
|
||
```bash
|
||
cd ~/the-custodian/state-hub
|
||
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
|
||
make db
|
||
```
|
||
|
||
Decrypt and restore the database dump:
|
||
|
||
```bash
|
||
age --decrypt \
|
||
-i ~/.config/age/railiance-backup.key \
|
||
db-<timestamp>.sql.age \
|
||
| docker exec -i infra-postgres-1 psql -U custodian custodian
|
||
```
|
||
|
||
Verify row counts look sane:
|
||
|
||
```bash
|
||
docker exec infra-postgres-1 psql -U custodian custodian \
|
||
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
|
||
```
|
||
|
||
### Step 4 — Restore config files
|
||
|
||
```bash
|
||
age --decrypt \
|
||
-i ~/.config/age/railiance-backup.key \
|
||
config-<timestamp>.tar.gz.age \
|
||
| tar -xz -C ~
|
||
```
|
||
|
||
This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`.
|
||
|
||
### Step 5 — Clone repositories
|
||
|
||
OAS Stack repos (S1–S5, per ADR-003):
|
||
|
||
```bash
|
||
git clone <gitea-url>/coulomb/railiance-infra.git ~/railiance-infra
|
||
git clone <gitea-url>/coulomb/railiance-cluster.git ~/railiance-cluster
|
||
git clone <gitea-url>/coulomb/railiance-platform.git ~/railiance-platform
|
||
git clone <gitea-url>/coulomb/railiance-enablement.git ~/railiance-enablement
|
||
git clone <gitea-url>/coulomb/railiance-apps.git ~/railiance-apps
|
||
```
|
||
|
||
Core and project repos:
|
||
|
||
```bash
|
||
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
|
||
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
|
||
git clone <gitea-url>/coulomb/activity-core.git ~/activity-core
|
||
git clone <gitea-url>/coulomb/net-kingdom.git ~/net-kingdom
|
||
# ... remaining repos as needed
|
||
```
|
||
|
||
If Gitea is offline, clone from the local bare mirrors in
|
||
`~/.cache/railiance/git-mirrors/` if they were set up (see T3).
|
||
|
||
### Step 6 — Register the MCP server
|
||
|
||
```bash
|
||
cd ~/the-custodian/state-hub
|
||
python3 scripts/patch_mcp_cwd.py
|
||
```
|
||
|
||
### Step 7 — Start the state hub and verify
|
||
|
||
```bash
|
||
cd ~/the-custodian/state-hub
|
||
make api & # in background or a separate terminal
|
||
```
|
||
|
||
Smoke test — confirm state hub is responding:
|
||
|
||
```bash
|
||
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
|
||
```
|
||
|
||
And from Claude Code, confirm MCP tools are available:
|
||
|
||
```
|
||
bin/railiance preflight
|
||
```
|
||
|
||
---
|
||
|
||
## Restore drill (validation)
|
||
|
||
Run a restore drill before doing any major infrastructure work. The drill
|
||
validates that the procedure above actually works without waiting for a real
|
||
disaster.
|
||
|
||
A minimal drill that does not require a second machine:
|
||
|
||
```bash
|
||
# 1. Start a second postgres container on a different port
|
||
docker run -d --name restore-test \
|
||
-e POSTGRES_DB=custodian \
|
||
-e POSTGRES_USER=custodian \
|
||
-e POSTGRES_PASSWORD=testpass \
|
||
-p 5433:5432 \
|
||
postgres:16-alpine
|
||
|
||
# 2. Decrypt and restore to it
|
||
age --decrypt \
|
||
-i ~/.config/age/railiance-backup.key \
|
||
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
|
||
| docker exec -i restore-test psql -U custodian custodian
|
||
|
||
# 3. Check row counts
|
||
docker exec restore-test psql -U custodian custodian \
|
||
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
|
||
|
||
# 4. Clean up
|
||
docker rm -f restore-test
|
||
```
|
||
|
||
Record the drill completion with a dated file (preflight checks for this in T5):
|
||
|
||
```bash
|
||
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
||
>> ~/.cache/railiance/restore-drill.log
|
||
```
|
||
|
||
---
|
||
|
||
## Extension Points
|
||
|
||
### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
|
||
|
||
The current design relies on Gitea remotes for git repo recovery. If Gitea
|
||
is offline during a migration, repos can only be recovered from local working
|
||
copies (if they survive). A secondary bare-repo mirror (e.g., in a local
|
||
directory or on a NAS) would make git recovery independent of Gitea availability.
|
||
|
||
**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g.,
|
||
during ThreePhoenix migration work on the server that runs Gitea).
|
||
**Constraint:** mirrors must be updated on the same schedule as the DB backup;
|
||
stale mirrors provide false confidence.
|
||
|
||
### EP-RAIL-004 — Offsite secondary copy of encrypted backups
|
||
|
||
The current Nextcloud file drop is the only offsite copy. A second destination
|
||
(rclone to an S3-compatible store, or rsync to a NAS) would protect against
|
||
Nextcloud unavailability.
|
||
|
||
**Trigger:** when Nextcloud is not available or is itself hosted on the same
|
||
infrastructure being migrated.
|
||
**Constraint:** the second destination must also be write-only or similarly
|
||
access-controlled; duplicating to a readable location without additional
|
||
access controls widens the blast radius of a credential leak.
|