feat(ops+workplans): fix tunnel targets, plan custodian migration, close legacy ADR-001 gaps

Tunnel (state-hub/Makefile): - Replace interactive `make tunnel` (now non-blocking with -N flag) - Add tunnel-daemon (autossh background), tunnel-loop (reconnect fallback), tunnel-status, tunnel-stop - Default COULOMBCORE=tegwick@92.205.130.254; TUNNEL_PORT configurable - Clarified server topology: COULOMBCORE=92.205.130.254 (old), Railiance01=92.205.62.239 (ThreePhoenix node 1) Workplans: - CUST-WP-0011: Migrate Custodian State Hub to ThreePhoenix cluster — 9-task plan with hard pre-condition gates (3-node cluster, Longhorn HA, backup drill), data migration, 2-week stabilisation, WSL2 retirement - CUST-WP-0000: Retroactive record for state-hub v0.1 (pre-ADR-001) - CUST-WP-0000b: Retroactive record for state-hub v0.2 (pre-ADR-001) Consistency: repo now ✓ PASS (0 fail, 18 warn — all pre-ADR-001 C-12 history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 01:09:07 +01:00
parent 678512c6e4
commit 890b2f9fc7
4 changed files with 487 additions and 6 deletions
--- a/state-hub/Makefile
+++ b/state-hub/Makefile
@@ -1,4 +1,4 @@
-.PHONY: install install-cli db db-tools migrate seed api dashboard check start clean register-project validate-adr add-domain rename-domain add-repo list-repos cleanup-stale
+.PHONY: install install-cli db db-tools migrate seed api dashboard check start clean register-project validate-adr add-domain rename-domain add-repo list-repos cleanup-stale tunnel tunnel-daemon tunnel-loop tunnel-status tunnel-stop

 COMPOSE = docker compose -f infra/docker-compose.yml --env-file .env

@@ -34,14 +34,65 @@ dashboard:
 check:
 	curl -sf http://127.0.0.1:8000/state/health | python3 -m json.tool

-## Open a reverse SSH tunnel so a remote host can reach the local State Hub.
-## Usage: make tunnel HOST=user@hostname
-## The remote host will then reach the hub at http://127.0.0.1:8000
+## COULOMBCORE host (default target for tunnel targets)
+COULOMBCORE ?= tegwick@92.205.130.254
+TUNNEL_PORT ?= 8000
+
+## Foreground reverse tunnel — good for debugging. Ctrl-C to stop.
+## Usage: make tunnel HOST=tegwick@92.205.130.254
 tunnel:
 	@test -n "$(HOST)" || (echo "ERROR: HOST is required. Usage: make tunnel HOST=user@hostname"; exit 1)
-	@echo "Opening reverse tunnel → $(HOST) (remote :8000 → local :8000)"
+	@echo "Opening reverse tunnel → $(HOST) (remote :$(TUNNEL_PORT) → local :$(TUNNEL_PORT))"
 	@echo "Keep this terminal open. Ctrl-C to close the tunnel."
-	ssh -R 8000:127.0.0.1:8000 $(HOST)
+	ssh -N -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
+	  -R $(TUNNEL_PORT):127.0.0.1:$(TUNNEL_PORT) $(HOST)
+
+## Background tunnel to COULOMBCORE with auto-reconnect.
+## Uses autossh if available; prints install hint and exits if not.
+## After running, COULOMBCORE can reach the State Hub at http://127.0.0.1:8000
+tunnel-daemon:
+	@if command -v autossh >/dev/null 2>&1; then \
+	  echo "Starting autossh tunnel → $(COULOMBCORE)"; \
+	  autossh -f -N -M 0 \
+	    -o "ServerAliveInterval=30" \
+	    -o "ServerAliveCountMax=3" \
+	    -o "ExitOnForwardFailure=yes" \
+	    -R $(TUNNEL_PORT):127.0.0.1:$(TUNNEL_PORT) $(COULOMBCORE); \
+	  echo "Tunnel running in background. Use 'make tunnel-status' to check."; \
+	else \
+	  echo "autossh not found — install it: sudo apt-get install autossh"; \
+	  echo "Fallback: run 'make tunnel-loop HOST=$(COULOMBCORE)' in a dedicated terminal."; \
+	  exit 1; \
+	fi
+
+## Reconnect loop — works without autossh. Run in a terminal you can leave open.
+## Usage: make tunnel-loop HOST=tegwick@92.205.130.254
+tunnel-loop:
+	@test -n "$(HOST)" || (echo "ERROR: HOST is required. Usage: make tunnel-loop HOST=user@hostname"; exit 1)
+	@echo "Reconnect loop → $(HOST). Ctrl-C to stop."
+	@while true; do \
+	  echo "[$(shell date -u +%Y-%m-%dT%H:%M:%SZ)] Connecting..."; \
+	  ssh -N -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
+	    -o "ExitOnForwardFailure=yes" \
+	    -R $(TUNNEL_PORT):127.0.0.1:$(TUNNEL_PORT) $(HOST) || true; \
+	  echo "[$(shell date -u +%Y-%m-%dT%H:%M:%SZ)] Connection lost — retrying in 5s..."; \
+	  sleep 5; \
+	done
+
+## Check whether a tunnel is currently active
+tunnel-status:
+	@if command -v autossh >/dev/null 2>&1 && pgrep -f "autossh.*$(TUNNEL_PORT)" > /dev/null 2>&1; then \
+	  echo "autossh tunnel: RUNNING (PIDs: $$(pgrep -f 'autossh.*$(TUNNEL_PORT)' | tr '\n' ' '))"; \
+	elif pgrep -f "ssh.*-R $(TUNNEL_PORT)" > /dev/null 2>&1; then \
+	  echo "ssh tunnel: RUNNING (PIDs: $$(pgrep -f 'ssh.*-R $(TUNNEL_PORT)' | tr '\n' ' '))"; \
+	else \
+	  echo "Tunnel: NOT running"; \
+	fi
+
+## Stop any active tunnel (autossh or plain ssh)
+tunnel-stop:
+	@pkill -f "autossh.*$(TUNNEL_PORT)" 2>/dev/null && echo "autossh stopped" || true
+	@pkill -f "ssh.*-R $(TUNNEL_PORT)" 2>/dev/null && echo "ssh loop stopped" || true

 start: db
 	sleep 3
--- a/workplans/CUST-WP-0000-state-hub-v0.1-build-deploy.md
+++ b/workplans/CUST-WP-0000-state-hub-v0.1-build-deploy.md
@@ -0,0 +1,42 @@
+---
+id: CUST-WP-0000
+type: workplan
+title: "State Hub v0.1 — Build & Deploy"
+domain: custodian
+repo: the-custodian
+status: completed
+owner: custodian
+topic_slug: custodian
+created: "2026-02-24"
+updated: "2026-02-24"
+completed: "2026-02-24"
+state_hub_workstream_id: "2b0efa54-0209-4ca9-8ab3-30dfbdb991b0"
+note: >
+  Pre-ADR-001 record. This workstream was created DB-first during the first
+  Custodian session (2026-02-24) before the workplan-as-repository-artefact
+  convention was established. This file is a retroactive record written on
+  2026-03-11 to satisfy the ADR-001 consistency checker (C-08).
+---
+
+# State Hub v0.1 — Build & Deploy
+
+## What was built
+
+The first live implementation layer of the Custodian system, delivered in the
+initial session on 2026-02-24:
+
+- PostgreSQL schema (topics, workstreams, tasks, decisions, progress_events)
+- FastAPI app with routers for all entities + `/state/summary`
+- FastMCP stdio server (11 tools, 5 resources/templates)
+- Observable Framework dashboard (4 pages: index, workstreams, decisions, progress)
+- Docker Compose for local PostgreSQL
+- Alembic migration `0001_initial_schema`
+- Seed script inserting 6 canonical topics
+- `.mcp.json` at repo root for Claude Code discovery
+- `make register-project` automation for onboarding domain repos
+
+## References
+
+- Commit range: initial state-hub implementation (2026-02-24)
+- Superseded by: CUST-WP-0000 (this file) covers only v0.1 baseline;
+  subsequent features tracked in CUST-WP-0001 onward
--- a/workplans/CUST-WP-0000b-state-hub-v0.2-decisions-dependencies.md
+++ b/workplans/CUST-WP-0000b-state-hub-v0.2-decisions-dependencies.md
@@ -0,0 +1,42 @@
+---
+id: CUST-WP-0000b
+type: workplan
+title: "State Hub v0.2 — Decisions, Suggestions & Dependencies"
+domain: custodian
+repo: the-custodian
+status: completed
+owner: custodian
+topic_slug: custodian
+created: "2026-02-25"
+updated: "2026-02-25"
+completed: "2026-02-25"
+state_hub_workstream_id: "6585ee66-aa4e-436e-bbec-d83293c33e8f"
+note: >
+  Pre-ADR-001 record. This workstream was created DB-first before the
+  workplan-as-repository-artefact convention was established. Retroactive
+  file written on 2026-03-11 to satisfy the ADR-001 consistency checker (C-08).
+---
+
+# State Hub v0.2 — Decisions, Suggestions & Dependencies
+
+## What was built
+
+Delivered 2026-02-25, evolving the hub from a state tracker to an active
+coordination layer:
+
+- `WorkstreamDependency` model + migration `0b547c153153` — directed
+  dependency graph between workstreams
+- API: `POST/GET /workstreams/{id}/dependencies/`,
+  `DELETE /workstreams/{id}/dependencies/{dep_id}`
+- API: `GET /state/next_steps` — derived next-action suggestions (never persisted)
+- `StateSummary` extended with `next_steps` and `depends_on`/`blocks` on workstreams
+- Design boundary formalised: hub is a read model with exactly two write use
+  cases — resolving decisions and suggesting next steps
+- MCP: `get_next_steps()` tool added
+- `scripts/script.py.mako` added (required for Alembic autogenerate)
+
+## References
+
+- Alembic migration: `0b547c153153`
+- Design boundary document: `canon/architecture/` (hub as read model)
+- CLAUDE.md global + railiance updated with `get_next_steps()` in session start
--- a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
+++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
@@ -0,0 +1,346 @@
+---
+id: CUST-WP-0011
+type: workplan
+title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
+domain: custodian
+repo: the-custodian
+status: active
+owner: custodian
+topic_slug: custodian
+created: "2026-03-11"
+updated: "2026-03-11"
+state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
+---
+
+# Migrate Custodian State Hub to ThreePhoenix Cluster
+
+## Goal
+
+Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
+the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
+(Railiance01/02/03), making it available to Claude Code sessions running on
+any machine with cluster access — without public internet exposure.
+
+The State Hub is **irreplaceable episodic memory**. This migration must be
+executed with zero tolerance for data loss and a tested rollback path at
+every stage.
+
+## Pre-conditions (gate — do not start until all satisfied)
+
+- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
+- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
+- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
+- [ ] S2 integrated backup operational and tested on the cluster
+- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
+
+These gates are mandatory. A single-node cluster or unverified storage is not
+an acceptable migration target for the Custodian.
+
+## Architecture after migration
+
+```
+COULOMBCORE / operator workstation (WSL2)
+  └─ Claude Code
+       └─ MCP server subprocess (Python, local clone of the-custodian)
+            └─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
+                          └─ Railiance01 k3s
+                               └─ state-hub ClusterIP service
+                                    ├─ FastAPI pod (1–2 replicas)
+                                    └─ PostgreSQL PVC (Longhorn, 2-way replicated)
+```
+
+Key properties:
+- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
+- **Replicated storage** — Longhorn replicates the PG data volume across nodes
+- **WSL2 instance retained as DR fallback** during the stabilisation period
+- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
+  the SSH port-forward provides the binding
+
+## Backup and disaster recovery contract
+
+Before and during migration, the following must hold at all times:
+
+| Asset | Backup mechanism | RPO | Tested? |
+|---|---|---|---|
+| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
+| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
+| WSL2 instance | Remains live during stabilisation period | — | Running |
+
+**Rollback rule:** at any task boundary, if something is wrong, revert to
+WSL2. No task should leave the system in a state where both WSL2 and cluster
+are broken.
+
+---
+
+## Tasks
+
+### T01 — Drill WSL2 backup restore end-to-end
+
+```task
+id: T01
+status: todo
+priority: high
+state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
+```
+
+Before touching anything, prove the current backup can actually be restored:
+
+```bash
+# In the-custodian/state-hub/
+make backup                         # take fresh backup
+# Spin up a test postgres container
+docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
+  -p 5433:5432 postgres:16
+# Decrypt and restore
+age -d -i ~/.config/sops/age/keys.txt \
+  /opt/backup/custodian/state-hub-latest.sql.gz.age | \
+  gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
+# Spot-check: count topics
+psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
+docker rm -f pg-restore-test
+```
+
+**Done when:** restore completes, topic count matches production, drill logged
+in `memory/episodic/`.
+
+---
+
+### T02 — Helm chart for State Hub (new: railiance-platform)
+
+```task
+id: T02
+status: todo
+priority: high
+state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
+```
+
+Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
+services). The chart must deploy:
+
+- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
+  1 replica initially (scale to 2 after T06)
+- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
+  (minimum 5 Gi); HA not required here — Longhorn replication IS the HA
+- **ClusterIP service** `state-hub` on port 8000
+- **ConfigMap** for non-secret config (DB URL template, log level)
+- **Secret** for DB credentials (SOPS-encrypted values file)
+- **Liveness/readiness probe** — `GET /state/health`
+
+Values:
+```yaml
+image:
+  repository: gitea.local/custodian/state-hub
+  tag: latest
+postgres:
+  storageClass: longhorn
+  size: 5Gi
+replicaCount: 1
+```
+
+**Done when:** `helm lint` passes; chart committed in railiance-platform.
+
+---
+
+### T03 — Build and push State Hub container image
+
+```task
+id: T03
+status: todo
+priority: high
+state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
+```
+
+Add `state-hub/Dockerfile` to the-custodian:
+
+```dockerfile
+FROM python:3.12-slim
+WORKDIR /app
+COPY pyproject.toml uv.lock ./
+RUN pip install uv && uv sync --frozen --no-dev
+COPY api/ ./api/
+COPY mcp_server/ ./mcp_server/
+CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+Build and push to the cluster-local Gitea registry:
+
+```bash
+docker build -t gitea.local/custodian/state-hub:latest .
+docker push gitea.local/custodian/state-hub:latest
+```
+
+**Done when:** image available in Gitea registry; `helm install --dry-run`
+resolves the image.
+
+---
+
+### T04 — Deploy to cluster and run Alembic migrations
+
+```task
+id: T04
+status: todo
+priority: high
+state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
+```
+
+```bash
+# From operator workstation via SSH port-forward to k3s API
+helm install state-hub ./helm/state-hub/ \
+  -n custodian --create-namespace \
+  -f helm/state-hub/values-production.yaml
+
+# Wait for pods
+kubectl -n custodian rollout status deployment/state-hub
+
+# Run migrations inside the pod
+kubectl -n custodian exec -it deploy/state-hub -- \
+  uv run alembic upgrade head
+```
+
+**Done when:** pod Running, `/state/health` returns 200, Alembic reports
+"head" from inside the pod.
+
+---
+
+### T05 — Migrate data from WSL2 to cluster
+
+```task
+id: T05
+status: todo
+priority: high
+state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
+```
+
+This is the point of no return for the DB — execute with care:
+
+```bash
+# 1. Take final WSL2 backup
+make -C ~/the-custodian/state-hub backup
+
+# 2. Copy dump into the cluster postgres pod
+kubectl -n custodian cp /tmp/state-hub-migration.sql \
+  $(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
+
+# 3. Restore
+kubectl -n custodian exec -it deploy/state-hub-postgres -- \
+  psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
+
+# 4. Spot-check counts match WSL2
+kubectl -n custodian exec -it deploy/state-hub -- \
+  psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
+```
+
+**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
+WSL2 is still live and unchanged.
+
+**Done when:** all table row counts match the WSL2 instance.
+
+---
+
+### T06 — Drill cluster backup restore
+
+```task
+id: T06
+status: todo
+priority: high
+state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
+```
+
+Before cutting over, prove the cluster backup can be restored:
+
+```bash
+# Trigger a backup via the cluster cron (or manually)
+kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
+
+# Verify output in /opt/backup/ on the node holding the PVC
+# Decrypt and restore to a test namespace
+kubectl create ns restore-test
+# ... restore steps similar to T01 but against cluster postgres
+```
+
+**Done when:** restore drill passes; drill logged.
+
+---
+
+### T07 — Cutover: redirect MCP config to cluster
+
+```task
+id: T07
+status: todo
+priority: medium
+state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
+```
+
+Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
+reach the cluster state hub via SSH port-forward instead of the local process.
+
+The MCP server subprocess still runs locally (Python, same `server.py`).
+Only the API endpoint it calls changes — via a persistent port-forward:
+
+```bash
+# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
+ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
+```
+
+No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
+
+Alternatively: update the MCP server's `API_BASE` env var to point directly
+to the port-forward. Either approach is valid; document the chosen one.
+
+**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
+returns live cluster data.
+
+---
+
+### T08 — Stabilisation period (2 weeks minimum)
+
+```task
+id: T08
+status: todo
+priority: medium
+state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
+```
+
+Run the cluster state hub as the primary for two weeks before retiring WSL2:
+
+- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
+- Monitor cluster pod restarts, storage health, backup cron
+- Run `get_state_summary()` at the start of each session; confirm data is live
+- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
+
+**Done when:** two weeks elapsed with no data loss events; all backup drills
+passed.
+
+---
+
+### T09 — Retire WSL2 instance
+
+```task
+id: T09
+status: todo
+priority: low
+state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
+```
+
+Once T08 stabilisation passes:
+
+1. Take a final WSL2 backup (archive, keep indefinitely)
+2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
+3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
+4. Update MEMORY.md — state hub is now cluster-hosted
+5. Record a decision in the state hub: "State Hub WSL2 instance retired"
+
+**Done when:** WSL2 state hub no longer running; documentation updated.
+
+---
+
+## References
+
+- Constitution constraint: irreversible actions require human approval — T05
+  (data migration) and T09 (WSL2 retirement) require explicit sign-off
+- OAS layer: S3 Platform Services (railiance-platform)
+- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
+- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
+  `make backup` / `make restore` standard interface before T06
+- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
+  Kubernetes Infrastructure)