Files
railiance-cluster/docs/backup-restore.md
tegwick 75467673a8 feat(safety-net): create WP-0004, update preflight for OAS 5-repo layout
- workplans/RAIL-BS-WP-0004-safety-net.md: ADR-001 workplan file for
  current-env-safety-net workstream (7e8b0c20), T01-T04 done, T05-T06 todo
- tools/cmd/railiance-preflight: update REPOS to OAS S1-S5 stack
  (railiance-infra/cluster/platform/enablement/apps) + project repos;
  remove stale railiance-bootstrap reference
- docs/backup-restore.md: fix Step 5 clone commands to current repo names
- Makefile: add make backup and make preflight targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 15:21:29 +01:00

307 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Backup & Restore
Covers the current single-server development environment. This is the safety
net that must be operational before any infrastructure migration work begins.
---
## What is protected
| Asset | Location | Risk without backup |
|---|---|---|
| Custodian State Hub database | Docker volume `infra_pg_data` | Total loss of all workstreams, tasks, decisions, progress history |
| Claude config & memory | `~/.claude/`, `~/.claude.json` | Loss of project memory, MCP registration, settings |
| Git config | `~/.gitconfig` | Minor friction, recoverable |
| age private key | `~/.config/age/railiance-backup.key` | Cannot decrypt any existing backup |
Git repositories are **not** included in the backup — they are protected by
being pushed to Gitea remotes. The preflight check verifies this.
---
## Encryption
All backups are encrypted with [age](https://age-encryption.org/) before
leaving the machine.
**Key locations:**
| Copy | Location | Purpose |
|---|---|---|
| Operational | `~/.config/age/railiance-backup.key` | Used locally for restore drills |
| Recovery | Password manager | Used when the machine is gone |
Permissions: `chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key`
The public key is hardcoded in `tools/cmd/railiance-backup`. To retrieve it:
```bash
grep "public key" ~/.config/age/railiance-backup.key
```
> **The password manager copy is the only key that survives hardware failure.**
> Verify it is there before doing any infrastructure work.
---
## Destination
Backups are uploaded to a Nextcloud file drop (upload-only, no credentials
required to write, cannot be read back without Nextcloud admin access). The
endpoint URL is stored locally in `wiki/260225-backup-dropoff-link.txt`
(gitignored).
Uploads use a direct HTTP PUT via curl — rclone is not used because Nextcloud
file drop links only permit PUT requests.
A local cache of the last 7 backups of each type is kept in
`~/.cache/railiance/backups/`.
---
## Running a backup
```bash
bin/railiance backup
```
This runs two steps:
1. **PostgreSQL dump**`pg_dump` from the running `infra-postgres-1`
container, piped through `age`, uploaded as `db-<timestamp>.sql.age`.
2. **Config snapshot** — tar of `~/.claude/`, `~/.claude.json`, `~/.gitconfig`,
encrypted with `age`, uploaded as `config-<timestamp>.tar.gz.age`.
A `.last-backup` stamp is written to the local cache; the preflight check
reads this to verify freshness.
**Automated:** a cron job runs the backup daily at 02:00:
```
0 2 * * * /home/worsch/railiance-cluster/bin/railiance backup >> ~/.cache/railiance/backup.log 2>&1
```
---
## Pre-migration preflight
Before touching any infrastructure, run:
```bash
bin/railiance preflight
```
Checks performed:
| Check | Pass condition |
|---|---|
| DB backup freshness | Latest `db-*.sql.age` is less than 24 hours old |
| Config backup freshness | Latest `config-*.tar.gz.age` is less than 24 hours old |
| Git repos clean | No uncommitted changes in any tracked repo |
| Git repos pushed | No unpushed commits in any tracked repo |
| age key present | `~/.config/age/railiance-backup.key` exists |
Exit 0 = safe to proceed. Exit 1 = do not proceed.
---
## Restore procedure
Use this when recovering from hardware failure, WSL2 corruption, or
accidental data loss. Work through it in order — each step depends on
the previous one.
### Step 0 — Prerequisites
On a fresh Ubuntu / WSL2 instance, install the required tools:
```bash
sudo apt-get update && sudo apt-get install -y \
git curl docker.io age postgresql-client
```
Start Docker:
```bash
sudo service docker start
```
### Step 1 — Retrieve the age private key
Copy the private key from your password manager into the machine:
```bash
mkdir -p ~/.config/age
# paste the key content:
cat > ~/.config/age/railiance-backup.key
# (paste, then Ctrl-D)
chmod 700 ~/.config/age && chmod 600 ~/.config/age/railiance-backup.key
```
### Step 2 — Get the backup files
Download the most recent backup files from Nextcloud (ask the Nextcloud admin
for read access, or retrieve from `~/.cache/railiance/backups/` on a
secondary machine if the local cache survived).
Files needed:
- `db-<timestamp>.sql.age`
- `config-<timestamp>.tar.gz.age`
### Step 3 — Restore PostgreSQL
Start a fresh postgres container:
```bash
cd ~/the-custodian/state-hub
cp infra/.env.example infra/.env # fill in POSTGRES_PASSWORD
make db
```
Decrypt and restore the database dump:
```bash
age --decrypt \
-i ~/.config/age/railiance-backup.key \
db-<timestamp>.sql.age \
| docker exec -i infra-postgres-1 psql -U custodian custodian
```
Verify row counts look sane:
```bash
docker exec infra-postgres-1 psql -U custodian custodian \
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
```
### Step 4 — Restore config files
```bash
age --decrypt \
-i ~/.config/age/railiance-backup.key \
config-<timestamp>.tar.gz.age \
| tar -xz -C ~
```
This restores `~/.claude/`, `~/.claude.json`, and `~/.gitconfig`.
### Step 5 — Clone repositories
OAS Stack repos (S1S5, per ADR-003):
```bash
git clone <gitea-url>/coulomb/railiance-infra.git ~/railiance-infra
git clone <gitea-url>/coulomb/railiance-cluster.git ~/railiance-cluster
git clone <gitea-url>/coulomb/railiance-platform.git ~/railiance-platform
git clone <gitea-url>/coulomb/railiance-enablement.git ~/railiance-enablement
git clone <gitea-url>/coulomb/railiance-apps.git ~/railiance-apps
```
Core and project repos:
```bash
git clone <gitea-url>/tegwick/the-custodian.git ~/the-custodian
git clone <gitea-url>/coulomb/markitect_project.git ~/markitect_project
git clone <gitea-url>/coulomb/activity-core.git ~/activity-core
git clone <gitea-url>/coulomb/net-kingdom.git ~/net-kingdom
# ... remaining repos as needed
```
If Gitea is offline, clone from the local bare mirrors in
`~/.cache/railiance/git-mirrors/` if they were set up (see T3).
### Step 6 — Register the MCP server
```bash
cd ~/the-custodian/state-hub
python3 scripts/patch_mcp_cwd.py
```
### Step 7 — Start the state hub and verify
```bash
cd ~/the-custodian/state-hub
make api & # in background or a separate terminal
```
Smoke test — confirm state hub is responding:
```bash
curl -sf http://127.0.0.1:8000/state/summary | python3 -m json.tool | head -20
```
And from Claude Code, confirm MCP tools are available:
```
bin/railiance preflight
```
---
## Restore drill (validation)
Run a restore drill before doing any major infrastructure work. The drill
validates that the procedure above actually works without waiting for a real
disaster.
A minimal drill that does not require a second machine:
```bash
# 1. Start a second postgres container on a different port
docker run -d --name restore-test \
-e POSTGRES_DB=custodian \
-e POSTGRES_USER=custodian \
-e POSTGRES_PASSWORD=testpass \
-p 5433:5432 \
postgres:16-alpine
# 2. Decrypt and restore to it
age --decrypt \
-i ~/.config/age/railiance-backup.key \
~/.cache/railiance/backups/db-$(ls ~/.cache/railiance/backups/db-*.sql.age | sort -r | head -1 | xargs basename | sed 's/db-//;s/.sql.age//').sql.age \
| docker exec -i restore-test psql -U custodian custodian
# 3. Check row counts
docker exec restore-test psql -U custodian custodian \
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_live_tup DESC;"
# 4. Clean up
docker rm -f restore-test
```
Record the drill completion with a dated file (preflight checks for this in T5):
```bash
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> ~/.cache/railiance/restore-drill.log
```
---
## Extension Points
### EP-RAIL-003 — Git bare-repo mirrors as secondary restore source
The current design relies on Gitea remotes for git repo recovery. If Gitea
is offline during a migration, repos can only be recovered from local working
copies (if they survive). A secondary bare-repo mirror (e.g., in a local
directory or on a NAS) would make git recovery independent of Gitea availability.
**Trigger:** when the Gitea server becomes a SPOF for restore operations (e.g.,
during ThreePhoenix migration work on the server that runs Gitea).
**Constraint:** mirrors must be updated on the same schedule as the DB backup;
stale mirrors provide false confidence.
### EP-RAIL-004 — Offsite secondary copy of encrypted backups
The current Nextcloud file drop is the only offsite copy. A second destination
(rclone to an S3-compatible store, or rsync to a NAS) would protect against
Nextcloud unavailability.
**Trigger:** when Nextcloud is not available or is itself hosted on the same
infrastructure being migrated.
**Constraint:** the second destination must also be write-only or similarly
access-controlled; duplicating to a readable location without additional
access controls widens the blast radius of a credential leak.