Update all operational references to reflect the new repo name per ADR-003 (OAS S2 Cluster Runtime). Historical text in docs preserved. Gitea remote URL updated locally (Gitea repo rename is a manual step). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
294 lines
8.3 KiB
Markdown
294 lines
8.3 KiB
Markdown
# Backup & Restore
|
|
|
|
Covers the current single-server development environment. This is the safety
|
|
net that must be operational before any infrastructure migration work begins.
|
|
|
|
---
|
|
|
|
## What is protected
|
|
|
|
| Asset | Location | Risk without backup |
|
|
|---|---|---|
|
|
| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history |
|
|
| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings |
|
|
| Git config | `~/.gitconfig` | Minor friction, recoverable |
|
|
| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup |
|
|
|
|
Git repositories are **not** included in the backup — they are protected by
|
|
being pushed to Gitea remotes. The preflight check verifies this.
|
|
|
|
---
|
|
|
|
## Encryption
|
|
|
|
All backups are encrypted with [age](https://age-encryption.org/) before
|
|
leaving the machine.
|
|
|
|
**Key locations:**
|
|
|
|
| Copy | Location | Purpose |
|
|
|---|---|---|
|
|
| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills |
|
|
| Recovery | Password manager | Used when the machine is gone |
|
|
|
|
Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key`
|
|
|
|
The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it:
|
|
|
|
```bash
|
|
grep "public key" ~/.config/age/railiance-backup.key
|
|
```
|
|
|
|
> **The password manager copy is the only key that survives hardware failure.**
|
|
> Verify it is there before doing any infrastructure work.
|
|
|
|
---
|
|
|
|
## Destination
|
|
|
|
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
|
|
required to write, cannot be read back without Nextcloud admin access). The
|
|
endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt`
|
|
(gitignored).
|
|
|
|
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud
|
|
file drop links only permit PUT requests.
|
|
|
|
A local cache of the last 7 backups of each type is kept in
|
|
`~/.cache/railiance/backups/`.
|
|
|
|
---
|
|
|
|
## Running a backup
|
|
|
|
```bash
|
|
bin/railiance backup
|
|
```
|
|
|
|
This runs two steps:
|
|
|
|
1. **PostgreSQL dump** — `pg_dump` from the running `infra-postgres-1`
|
|
container, piped through `age`, uploaded as `db-<timestamp>.sql.age`.
|
|
|
|
2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`,
|
|
encrypted with `age`, uploaded as `config-<timestamp>.tar.gz.age`.
|
|
|
|
A `.last-backup` stamp is written to the local cache; the preflight check
|
|
reads this to verify freshness.
|
|
|
|
**Automated:** a cron job runs the backup daily at 02:00:
|
|
|
|
```
|
|
0 2 * * * /home/worsch/railiance-cluster/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
## Pre-migration preflight
|
|
|
|
Before touching any infrastructure, run:
|
|
|
|
```bash
|
|
bin/railiance preflight
|
|
```
|
|
|
|
Checks performed:
|
|
|
|
| Check | Pass condition |
|
|
|---|---|
|
|
| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old |
|
|
| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old |
|
|
| Git repos clean | No uncommitted changes in any tracked repo |
|
|
| Git repos pushed | No unpushed commits in any tracked repo |
|
|
| age key present | `~/.config/age/railiance-backup.key` exists |
|
|
|
|
Exit 0 = safe to proceed. Exit 1 = do not proceed.
|
|
|
|
---
|
|
|
|
## Restore procedure
|
|
|
|
Use this when recovering from hardware failure, WSL2 corruption, or
|
|
accidental data loss. Work through it in order — each step depends on
|
|
the previous one.
|
|
|
|
### Step 0 — Prerequisites
|
|
|
|
On a fresh Ubuntu / WSL2 instance, install the required tools:
|
|
|
|
```bash
|
|
sudo apt-get update && sudo apt-get install -y \
|
|
git curl docker.io age postgresql-client
|
|
```
|
|
|
|
Start Docker:
|
|
|
|
```bash
|
|
sudo service docker start
|
|
```
|
|
|
|
### Step 1 — Retrieve the age private key
|
|
|
|
Copy the private key from your password manager into the machine:
|
|
|
|
```bash
|
|
mkdir -p ~/.config/age
|
|
# paste the key content:
|
|
cat > ~/.config/age/railiance-backup.key
|
|
# (paste, then Ctrl-D)
|
|
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
|
|
```
|
|
|
|
### Step 2 — Get the backup files
|
|
|
|
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
|
|
for read access, or retrieve from `~/.cache/railiance/backups/` on a
|
|
secondary machine if the local cache survived).
|
|
|
|
Files needed:
|
|
- `db-<timestamp>.sql.age`
|
|
- `config-<timestamp>.tar.gz.age`
|
|
|
|
### Step 3 — Restore PostgreSQL
|
|
|
|
Start a fresh postgres container:
|
|
|
|
```bash
|
|
cd ~/the-custodian/state-hub
|
|
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
|
|
make db
|
|
```
|
|
|
|
Decrypt and restore the database dump:
|
|
|
|
```bash
|
|
age --decrypt \
|
|
-i ~/.config/age/railiance-backup.key \
|
|
db-<timestamp>.sql.age \
|
|
| docker exec -i infra-postgres-1 psql -U custodian custodian
|
|
```
|
|
|
|
Verify row counts look sane:
|
|
|
|
```bash
|
|
docker exec infra-postgres-1 psql -U custodian custodian \
|
|
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
|
|
```
|
|
|
|
### Step 4 — Restore config files
|
|
|
|
```bash
|
|
age --decrypt \
|
|
-i ~/.config/age/railiance-backup.key \
|
|
config-<timestamp>.tar.gz.age \
|
|
| tar -xz -C ~
|
|
```
|
|
|
|
This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`.
|
|
|
|
### Step 5 — Clone repositories
|
|
|
|
```bash
|
|
git clone <gitea-url>/coulomb/railiance-bootstrap.git ~/railiance-bootstrap
|
|
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
|
|
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
|
|
# ... remaining repos as needed
|
|
```
|
|
|
|
If Gitea is offline, clone from the local bare mirrors in
|
|
`~/.cache/railiance/git-mirrors/` if they were set up (see T3).
|
|
|
|
### Step 6 — Register the MCP server
|
|
|
|
```bash
|
|
cd ~/the-custodian/state-hub
|
|
python3 scripts/patch_mcp_cwd.py
|
|
```
|
|
|
|
### Step 7 — Start the state hub and verify
|
|
|
|
```bash
|
|
cd ~/the-custodian/state-hub
|
|
make api & # in background or a separate terminal
|
|
```
|
|
|
|
Smoke test — confirm state hub is responding:
|
|
|
|
```bash
|
|
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
|
|
```
|
|
|
|
And from Claude Code, confirm MCP tools are available:
|
|
|
|
```
|
|
bin/railiance preflight
|
|
```
|
|
|
|
---
|
|
|
|
## Restore drill (validation)
|
|
|
|
Run a restore drill before doing any major infrastructure work. The drill
|
|
validates that the procedure above actually works without waiting for a real
|
|
disaster.
|
|
|
|
A minimal drill that does not require a second machine:
|
|
|
|
```bash
|
|
# 1. Start a second postgres container on a different port
|
|
docker run -d --name restore-test \
|
|
-e POSTGRES_DB=custodian \
|
|
-e POSTGRES_USER=custodian \
|
|
-e POSTGRES_PASSWORD=testpass \
|
|
-p 5433:5432 \
|
|
postgres:16-alpine
|
|
|
|
# 2. Decrypt and restore to it
|
|
age --decrypt \
|
|
-i ~/.config/age/railiance-backup.key \
|
|
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
|
|
| docker exec -i restore-test psql -U custodian custodian
|
|
|
|
# 3. Check row counts
|
|
docker exec restore-test psql -U custodian custodian \
|
|
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
|
|
|
|
# 4. Clean up
|
|
docker rm -f restore-test
|
|
```
|
|
|
|
Record the drill completion with a dated file (preflight checks for this in T5):
|
|
|
|
```bash
|
|
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
|
>> ~/.cache/railiance/restore-drill.log
|
|
```
|
|
|
|
---
|
|
|
|
## Extension Points
|
|
|
|
### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
|
|
|
|
The current design relies on Gitea remotes for git repo recovery. If Gitea
|
|
is offline during a migration, repos can only be recovered from local working
|
|
copies (if they survive). A secondary bare-repo mirror (e.g., in a local
|
|
directory or on a NAS) would make git recovery independent of Gitea availability.
|
|
|
|
**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g.,
|
|
during ThreePhoenix migration work on the server that runs Gitea).
|
|
**Constraint:** mirrors must be updated on the same schedule as the DB backup;
|
|
stale mirrors provide false confidence.
|
|
|
|
### EP-RAIL-004 — Offsite secondary copy of encrypted backups
|
|
|
|
The current Nextcloud file drop is the only offsite copy. A second destination
|
|
(rclone to an S3-compatible store, or rsync to a NAS) would protect against
|
|
Nextcloud unavailability.
|
|
|
|
**Trigger:** when Nextcloud is not available or is itself hosted on the same
|
|
infrastructure being migrated.
|
|
**Constraint:** the second destination must also be write-only or similarly
|
|
access-controlled; duplicating to a readable location without additional
|
|
access controls widens the blast radius of a credential leak.
|