Repo hygiene + new workplans (RAIL-BS-WP-0008/0009)

- Add RAIL-BS-WP-0008 (activity-core WP-0016 deploy) and RAIL-BS-WP-0009 (admin-sync smoke) from inbox asks 87952ff1 / aa8b7986 - Archive finished workplans to workplans/archived/ per ADR-001 convention; normalize frontmatter statuses (completed/done -> finished) - Fill stack-and-commands.md, complete repo-boundary.md, refresh SCOPE Current State, add docs/operator-runbook.md for production-touching targets Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 00:02:36 +02:00
parent eefa6c1b2a
commit b3b0c3e3ff
15 changed files with 206 additions and 24 deletions
--- a/workplans/archived/260622-RAIL-BS-WP-0001-dependency-management.md
+++ b/workplans/archived/260622-RAIL-BS-WP-0001-dependency-management.md
@@ -0,0 +1,135 @@
+---
+id: RAIL-BS-WP-0001
+type: workplan
+title: "Dependency Management — Add lockfile for Ansible control-node deps"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: railiance
+topic_slug: railiance
+state_hub_workstream_id: 59155efb-b461-4caa-ad7b-b3fce348db84
+state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
+created: "2026-03-01"
+updated: "2026-03-01"
+completed: "2026-03-01"
+---
+
+# Dependency Management — Add Ansible control-node lockfile
+
+## Problem
+
+This repo drives all Ansible automation but carries no pinned, machine-readable
+inventory of its own runtime dependencies.
+
+The Ansible version (and all pip packages it depends on) are whatever is
+installed on the control node at any given time. This means:
+
+- Behaviour is not reproducible across machines or over time
+- The Custodian State Hub SBOM scanner finds nothing to ingest (`last_sbom_at = null`)
+- Licence and vulnerability auditing of the actual dependencies in use is impossible
+- The `railiance-cluster` repo appears as a gap in the SBOM coverage map
+
+## Root cause
+
+No `pyproject.toml` (or `requirements.txt`) declares the control-node pip
+dependencies. No `ansible/requirements.yml` exists for Galaxy collections
+(correct if none are used; but it should be explicit).
+
+## Expected state after this task
+
+- `pyproject.toml` at repo root declares `ansible` as a dependency (and any
+  other pip packages used by playbooks or the `bin/` commands)
+- `uv.lock` is generated and committed — pins Ansible + full transitive pip tree
+- If Galaxy collections are used: `ansible/requirements.yml` lists them
+- SBOM is ingested: `last_sbom_at` is not null in the State Hub
+- The SBOM dashboard shows `railiance-cluster` in the railiance domain row
+  with a package count
+
+## Tasks
+
+### T1 — Audit control-node pip dependencies
+
+```task
+id: RAIL-BS-WP-0001-T01
+state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
+status: done
+priority: medium
+completed: "2026-03-01"
+```
+
+Review `bin/` commands, Ansible playbooks, and any Python scripts in the repo.
+List all pip packages that must be present on the control node:
+- `ansible` (minimum version)
+- Any collections-related tools (ansible-core, ansible-lint, etc.)
+- Any other pip deps called from scripts (e.g. `paramiko`, `netaddr`, `jinja2`)
+
+### T2 — Create pyproject.toml and generate uv.lock
+
+```task
+id: RAIL-BS-WP-0001-T02
+status: done
+priority: medium
+completed: "2026-03-01"
+state_hub_task_id: "8aa8a9d3-6560-4176-b933-72a21e6d43d4"
+```
+
+1. Create `pyproject.toml`:
+   ```toml
+   [project]
+   name = "railiance-cluster"
+   version = "0.1.0"
+   requires-python = ">=3.11"
+   dependencies = [
+     "ansible>=10",  # adjust version as appropriate
+     # add other deps found in T1
+   ]
+   ```
+2. Run `uv lock` to generate `uv.lock`
+3. Commit both files
+
+### T3 — Ingest SBOM into State Hub
+
+```task
+id: RAIL-BS-WP-0001-T03
+status: done
+priority: medium
+completed: "2026-03-01"
+state_hub_task_id: "4fb477e9-dbac-4e43-84d0-5202c68f4705"
+```
+
+From `~/the-custodian/state-hub/`:
+
+```bash
+make ingest-sbom REPO=railiance-cluster SCAN=1 REPO_PATH=/home/worsch/railiance-cluster
+```
+
+Verify in the SBOM dashboard: railiance domain should show `railiance-cluster`
+with a package count and no gap warning.
+
+### T4 — Create ansible/requirements.yml (even if empty)
+
+```task
+id: RAIL-BS-WP-0001-T04
+status: done
+priority: low
+completed: "2026-03-01"
+state_hub_task_id: "d0eb1c96-e7c2-4f6b-b934-a3f295e4db72"
+```
+
+Create `ansible/requirements.yml`. If no Galaxy roles or collections are used,
+create it empty with a comment. This makes the absence of collections explicit:
+
+```yaml
+---
+# No external Ansible Galaxy roles or collections required.
+# Add roles/collections here as needed:
+# roles: []
+# collections: []
+```
+
+## References
+
+- Custodian SBOM Convention: `canon/standards/sbom-convention_v0.1.md`
+- SBOM dashboard: http://127.0.0.1:3000/sbom
+- Repos coverage page: http://127.0.0.1:3000/repos
+- State Hub task: `5f8cade5-119c-42e8-ba93-e9d0478650e4`
--- a/workplans/archived/260622-RAIL-BS-WP-0002-k3s-baseline.md
+++ b/workplans/archived/260622-RAIL-BS-WP-0002-k3s-baseline.md
@@ -0,0 +1,175 @@
+---
+id: RAIL-BS-WP-0002
+type: workplan
+title: "k3s and Kubernetes Platform Baseline"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: railiance
+topic_slug: railiance
+repo_goal_id: "70ab2379-fb9d-4fec-a09d-b2a717e4ace8"
+state_hub_workstream_id: "4c63dfc6-9eac-4e79-9f77-8f644ad7147d"
+created: "2026-03-09"
+updated: "2026-03-10"
+completed: "2026-03-10"
+---
+
+# k3s and Kubernetes Platform Baseline
+
+## Goal
+
+Install k3s, Helm, and the baseline Kubernetes services on the converged
+HostEurope node. This workplan picks up exactly where `railiance-hosts`
+leaves off: a hardened, verified OS node that is ready for Kubernetes.
+
+## Pre-condition
+
+`railiance-infra` converge + Goss verify must pass before any task here
+is executed:
+
+```bash
+# In railiance-infra/
+make converge
+make verify    # must exit 0
+```
+
+## Boundary
+
+This repo owns everything from k3s upward. It must not re-configure items
+defined in `railiance-infra/spec/server-baseline.yaml`. See ADR-003:
+`railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`.
+
+**Out of scope here:** platform services (PostgreSQL, storage, identity)
+→ `railiance-platform`. Application deployments (Gitea, coulomb services)
+→ `railiance-apps`.
+
+---
+
+## Tasks
+
+### T01 — Ansible playbook: install k3s (server mode)
+
+```task
+id: T01
+status: done
+priority: high
+state_hub_task_id: "3f042630-eab0-4c6a-9167-e2b28ff20e40"
+completed: "2026-03-10"
+```
+
+Harden `ansible/bootstrap.yml` to a production-ready k3s install:
+
+- Use the official k3s install script pinned to a specific version
+  (`INSTALL_K3S_VERSION=v1.35.1+k3s1`)
+- `INSTALL_K3S_EXEC="server --cluster-init --write-kubeconfig-mode=644"`
+  (cluster-init enables embedded etcd for future HA expansion)
+- Wait for node `Ready` before proceeding:
+  ```bash
+  k3s kubectl wait node --all --for=condition=Ready --timeout=120s
+  ```
+- Fetch kubeconfig to the control node as `~/.kube/config-hosteurope`
+
+**Done when:** `k3s kubectl get nodes` returns `Ready` from both the server
+and the control node (via kubeconfig).
+
+---
+
+### T02 — Helm installation
+
+```task
+id: T02
+status: done
+priority: high
+state_hub_task_id: "e8510646-46ed-4697-a345-f3d3009eea78"
+completed: "2026-03-10"
+```
+
+Add a task (or a role `roles/helm/`) that:
+
+1. Downloads the Helm binary (pinned version) to `/usr/local/bin/helm`
+2. Verifies the checksum
+3. Confirms `helm version` succeeds
+
+**Done when:** `helm version` succeeds on the HostEurope node.
+
+---
+
+### T03 — Smoke test: k3s + Helm
+
+```task
+id: T03
+status: done
+priority: high
+state_hub_task_id: "dab2c07f-8aa0-4635-8df6-857e87e93fc5"
+completed: "2026-03-10"
+```
+
+Extend `tests/smoke_kube.sh` to assert:
+
+- `k3s kubectl get nodes` → node in Ready state
+- `helm version` exits 0
+- CoreDNS pod running in `kube-system`
+- Traefik ingress controller pod running (default in k3s)
+
+Run via:
+```bash
+ansible-playbook -i ansible/hosts.ini ansible/smoke.yml
+```
+or directly over SSH if the kubeconfig is available locally.
+
+**Done when:** all assertions pass and the script exits 0.
+
+---
+
+### T04 — Commit kubeconfig management notes
+
+```task
+id: T04
+status: done
+priority: medium
+state_hub_task_id: "5c3d40e4-239b-488e-9519-6f7a38d2325f"
+completed: "2026-03-10"
+```
+
+Document in `docs/kubeconfig.md`:
+
+- Where the kubeconfig is fetched to (`~/.kube/config-hosteurope`)
+- How to merge it into `~/.kube/config`
+- How to switch context: `kubectl config use-context default`
+- Security note: kubeconfig is gitignored (contains cluster CA + client cert)
+
+**Done when:** doc written and committed.
+
+---
+
+### T05 — Add `make k3s-install` and `make smoke` targets
+
+```task
+id: T05
+status: done
+priority: medium
+state_hub_task_id: "7f9e0e58-a130-467a-a2d0-b3f2564e496f"
+completed: "2026-03-10"
+```
+
+Add to Makefile (create one if none exists):
+
+```makefile
+k3s-install: ## Install k3s and Helm on all inventory hosts
+	ansible-playbook -i ansible/hosts.ini ansible/bootstrap.yml
+
+smoke: ## Run Kubernetes smoke tests
+	bash tests/smoke_kube.sh
+```
+
+**Done when:** both targets work and are listed in `make help`.
+
+---
+
+## References
+
+- Repo goal: `70ab2379-fb9d-4fec-a09d-b2a717e4ace8` (Install k3s and Kubernetes Baseline)
+- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure Kubernetes Infrastructure)
+- Pre-condition: railiance-infra WP-0001 (Secure Single-Server Bootstrap) — completed 2026-03-09
+- Boundary ADR: `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`
+- k3s releases: https://github.com/k3s-io/k3s/releases
--- a/workplans/archived/260622-RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
+++ b/workplans/archived/260622-RAIL-BS-WP-0003-pgpool-ha-failover-fix.md
@@ -0,0 +1,194 @@
+---
+id: RAIL-BS-WP-0003
+type: bug-report
+title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: tegwick
+created: "2026-03-10"
+updated: "2026-03-10"
+state_hub_workstream_id: "7ee9ee22-1fae-4567-9194-8d70a9e0f45b"
+---
+
+# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
+
+## Summary
+
+On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
+restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
+entered CrashLoopBackOff and produced no logs. As a result Gitea's login
+and all write operations hung indefinitely. The root page was still served
+(from Valkey cache) which masked the failure.
+
+The fix was to patch a missing key in a Kubernetes secret. The root cause is
+that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
+populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
+secret, even though the pgpool pod requires it at startup.
+
+---
+
+## Timeline
+
+| Time (UTC) | Event |
+|---|---|
+| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
+| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
+| ~11:00 | User noticed Gitea login hanging; home page still loading |
+| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
+| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
+| ~13:15 | Gitea fully operational |
+
+---
+
+## Root Cause
+
+The Bitnami `pgpool` container startup script reads the file
+`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
+`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
+mount. That secret key was never created by the Helm chart, so the file did
+not exist. The container exited immediately with no logs.
+
+The pod had been running for 20 days without a restart, so this gap was
+never discovered during initial deployment.
+
+---
+
+## Evidence
+
+```bash
+# Secret was missing the pgpool-password key
+sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
+# data: keys were password, postgres-password, repmgr-password only
+# pgpool-password was absent
+
+# pgpool pod describe showed 824 back-off restarts over 173 minutes
+# No logs in either current or --previous output
+sudo k3s kubectl logs -n default <pgpool-pod> --previous
+# (empty)
+
+# Gitea process had zero TCP connections to PostgreSQL port 5432
+# but many connections to Valkey port 6379
+cat /proc/<gitea-pid>/net/tcp | grep 1538  # 1538 = 5432 hex — no results
+```
+
+---
+
+## Immediate Fix Applied
+
+```bash
+# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
+sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
+  --type='json' \
+  -p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
+
+# Restart pgpool
+sudo k3s kubectl delete pod -n default <pgpool-pod-name>
+```
+
+---
+
+## Risk: Fix Will Be Lost on helm upgrade
+
+The patched secret is managed by Helm (annotation:
+`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
+secret from the chart template, which does not include `pgpool-password`,
+and the bug will recur.
+
+---
+
+## Tasks
+
+### T01 — Add pgpool-password to Helm values
+
+```task
+id: T01
+status: done
+priority: high
+state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
+```
+
+Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
+include the pgpool-password so it survives `helm upgrade`:
+
+```yaml
+postgresql-ha:
+  postgresql:
+    pgpoolPassword: <value matching sr-check-password>
+```
+
+**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
+without manual secret patching.
+
+---
+
+### T02 — Add pgpool health check to smoke test
+
+```task
+id: T02
+status: done
+priority: high
+state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
+```
+
+Extend `tests/smoke_kube.sh` to assert:
+
+```bash
+# All postgresql-ha pods Running
+kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
+
+# pgpool specifically not in CrashLoopBackOff
+kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
+  -o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
+```
+
+**Done when:** the smoke test catches a pgpool failure within 5 minutes.
+
+---
+
+### T03 — Add HA failover test
+
+```task
+id: T03
+status: done
+priority: high
+state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
+```
+
+Create `tests/test_ha_failover.sh` that:
+
+1. Records Gitea login response time (baseline)
+2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
+3. Waits for repmgr to promote a replica (max 60s)
+4. Asserts Gitea login POST still succeeds within 10s
+5. Asserts pgpool pod is Running (not CrashLoopBackOff)
+6. Asserts all postgresql pods return to Running
+
+This test must pass before any PostgreSQL HA deployment is considered done.
+
+**Done when:** script exits 0 against a live cluster.
+
+---
+
+### T04 — Document the incident in docs/
+
+```task
+id: T04
+status: done
+priority: medium
+state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
+```
+
+Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
+timeline, root cause, and fix, so future operators understand what happened
+and how to recover.
+
+**Done when:** doc committed and linked from `docs/README.md`.
+
+---
+
+## References
+
+- Bitnami postgresql-ha chart v16.2.2
+- Gitea Helm chart v12.2.0
+- Related decision: D3 (HA testing policy) in `DECISIONS.md`
--- a/workplans/archived/260622-RAIL-BS-WP-0004-safety-net.md
+++ b/workplans/archived/260622-RAIL-BS-WP-0004-safety-net.md
@@ -0,0 +1,273 @@
+---
+id: RAIL-BS-WP-0004
+type: workplan
+title: "Integrated Backup — S2 Kubernetes Runtime Layer"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: tegwick
+topic_slug: railiance
+state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
+created: "2026-02-25"
+updated: "2026-03-26"
+---
+
+# Integrated Backup — S2 Kubernetes Runtime Layer
+
+## Goal
+
+Implement the Q3 (Operability & Resilience) integrated backup for
+railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
+encrypted with age, written to a local directory on the server. No external
+dependencies required.
+
+## Architecture (Decision D4)
+
+Each railiance repo implements its own backup for what it owns. No central
+backup service. See `DECISIONS.md` D4 for full rationale.
+
+**Standard interface every railiance repo must provide:**
+
+```bash
+make backup   # encrypt + write to /opt/backup/railiance/<layer>/
+make restore  # restore from most recent local backup
+```
+
+Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
+Output: `/opt/backup/railiance/cluster/` on the server.
+
+## What S2 (railiance-cluster) owns and must back up
+
+| Asset | Why it matters |
+|---|---|
+| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
+| Helm release values | Runtime values not in git (any manually applied overrides) |
+| kubeconfig | Admin access to the cluster |
+
+**Not S2's responsibility:**
+- Custodian State Hub DB → the-custodian owns this
+- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
+- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
+- PostgreSQL data volumes → S3 (railiance-platform) owns this
+
+## Encryption
+
+Reuse the age public key from `.sops.yaml`:
+
+```bash
+AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
+tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
+```
+
+Decryption requires the private key at `~/.config/sops/age/keys.txt`
+(same key used for `sops -d`). No additional key management needed.
+
+## Extension Point EP-RAIL-005
+
+Once all five OAS layers implement this interface, the custodian can
+orchestrate a full-stack backup with:
+
+```bash
+for repo in railiance-infra railiance-cluster railiance-platform \
+            railiance-enablement railiance-apps; do
+  make -C ~/$repo backup
+done
+```
+
+No special protocol needed — just the standard interface.
+
+---
+
+## Tasks
+
+### T01 — Define backup directory and encryption wrapper
+
+```task
+id: T01
+status: done
+priority: high
+state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
+```
+
+Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
+
+- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
+- Encrypt each artifact with age using public key from `.sops.yaml`
+- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
+- Keep last 7 of each type
+- Write `.last-backup` stamp
+- Exit 0 on success, non-zero on any failure
+- No network required
+
+Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
+custodian DB — wrong scope, not applicable to this server).
+
+**Done when:** `make backup` runs on COULOMBCORE without error and files
+appear in `/opt/backup/railiance/cluster/`.
+
+---
+
+### T02 — Back up k3s state (SQLite hot backup)
+
+```task
+id: T02
+status: done
+priority: high
+state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
+```
+
+k3s has built-in etcd snapshot support:
+
+```bash
+sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
+# Default location: /var/lib/rancher/k3s/server/db/snapshots/
+```
+
+Add to the backup script: take a fresh snapshot, encrypt with age,
+copy to `/opt/backup/railiance/cluster/`.
+
+> **Note — verify etcd is in use before implementing:**
+> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
+> Without it, k3s uses SQLite and this command will fail.
+> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
+
+> **Note — sudo required:** etcd snapshot requires root. See T06 for how
+> this is resolved (backup runs under root's crontab).
+
+**Done when:** backup includes a current etcd snapshot.
+
+---
+
+### T03 — Back up Helm release values
+
+```task
+id: T03
+status: done
+priority: medium
+state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
+```
+
+Capture current runtime Helm values for all releases:
+
+```bash
+KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
+  jq -r '.[].name + " " + .namespace' | \
+  while read name ns; do
+    KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
+  done
+```
+
+Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
+
+> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
+> only by default. The backup script must either run as root (see T06) or k3s
+> must be configured with `--write-kubeconfig-mode=644`. Running as root
+> (via root crontab) is the chosen approach — no config change needed.
+
+**Done when:** backup includes a snapshot of all Helm release values.
+
+---
+
+### T04 — Back up kubeconfig
+
+```task
+id: T04
+status: done
+priority: medium
+state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
+```
+
+Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
+into `kubeconfig-<ts>.yaml.age` in the backup directory.
+
+**Done when:** backup includes the encrypted kubeconfig.
+
+---
+
+### T05 — make restore target
+
+```task
+id: T05
+status: done
+priority: medium
+state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
+```
+
+Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
+backups, with guided restore for the etcd snapshot case.
+
+Restore of etcd from snapshot:
+```bash
+sudo k3s server --cluster-reset \
+  --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
+```
+
+**Done when:** `make restore` prints available backups and a restore guide.
+
+---
+
+### T06 — Install cron job and run restore drill
+
+```task
+id: T06
+status: done
+priority: medium
+state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
+```
+
+#### Solving the sudo problem
+
+The backup script needs root for two reasons:
+- `k3s etcd-snapshot save` requires root
+- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
+
+**Solution: run the cron under root's crontab.**
+
+This is the correct pattern for system-level backup jobs. It avoids a
+proliferating sudoers whitelist (one entry per command, brittle to maintain)
+and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
+production. The backup writes to `/opt/backup/` which is root-owned anyway.
+
+Install the cron as root:
+
+```bash
+sudo crontab -e
+# Add:
+0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
+```
+
+Note: use the absolute path to the repo — `~` does not expand reliably in
+root's crontab unless HOME is set.
+
+Verify it is installed:
+```bash
+sudo crontab -l | grep railiance
+```
+
+#### Restore drill
+
+Once T01–T04 are done, run a decrypt-and-verify drill:
+
+```bash
+# Decrypt the etcd snapshot and verify it is a valid snapshot file
+sudo age -d -i ~/.config/sops/age/keys.txt \
+  /opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
+  | file -
+
+# Record the drill
+echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
+  >> /opt/backup/railiance/cluster/restore-drill.log
+```
+
+**Done when:** cron installed under root, drill completes without error,
+log entry written.
+
+---
+
+## References
+
+- Decision D4: Integrated backup per capability (`DECISIONS.md`)
+- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
+- OAS Q3: Operability & Resilience
+- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
+- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore
--- a/workplans/archived/260622-RAIL-BS-WP-0005-kubeconfig-delivery.md
+++ b/workplans/archived/260622-RAIL-BS-WP-0005-kubeconfig-delivery.md
@@ -0,0 +1,143 @@
+---
+id: RAIL-BS-WP-0005
+type: workplan
+title: "Kubeconfig delivery for netkingdom SSO/MFA stack apply"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: railiance-worker
+topic_slug: railiance
+capability_request_id: "34b97d89-e80a-42ae-a623-a9185e5b17f5"
+created: "2026-03-20"
+updated: "2026-03-20"
+state_hub_workstream_id: "b236de41-2f33-4ebc-bb84-5fcedb2982f8"
+---
+
+# RAIL-BS-WP-0005 — Kubeconfig delivery for netkingdom SSO/MFA stack apply
+
+**Scope:** Fulfil capability request 34b97d89 — deliver a working local kubeconfig so
+the netkingdom SSO/MFA workstream (NK-WP-0001) can apply manifests (T02–T08) against
+the existing K3s cluster on HostEurope (92.205.130.254).
+
+**Context:**
+- Cluster is healthy: one node `Ready`, k3s v1.30.3, 200 days uptime.
+- K3s API listens on `*:6443` (all interfaces); UFW is inactive — direct public access works.
+- The in-cluster kubeconfig uses `server: https://127.0.0.1:6443`; must be rewritten
+  to `https://92.205.130.254:6443` for off-server use.
+- No ops-bridge tunnel needed for kubectl (API is directly reachable).
+- Wrong catalog entry was filed (PostgreSQL HA instead of k3s provisioning) — noted,
+  no API endpoint to correct it retroactively; document here.
+
+**Depends on:** RAIL-BS-WP-0002 (k3s-kubernetes-baseline) ✓ completed
+**Unblocks:** NK-WP-0001 T02–T08 (SSO/MFA stack apply)
+
+---
+
+## Task: Extract kubeconfig from HostEurope server
+
+```task
+id: RAIL-BS-WP-0005-T01
+status: done
+priority: high
+state_hub_task_id: "c59a8e0c-e1fd-4cfd-aa5e-7cbb895609f0"
+```
+
+```bash
+ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
+  "sudo cat /etc/rancher/k3s/k3s.yaml" > /tmp/k3s-raw.yaml
+```
+
+Verify file is non-empty and contains a valid YAML kubeconfig.
+
+---
+
+## Task: Rewrite server address and install kubeconfig
+
+```task
+id: RAIL-BS-WP-0005-T02
+status: done
+priority: high
+state_hub_task_id: "93d61bc6-47e7-442f-8611-97f5f2f208c4"
+```
+
+Replace `127.0.0.1` with `92.205.130.254` in the kubeconfig; place at
+`~/.kube/config` (create `~/.kube/` if absent). Back up any existing config first.
+
+```bash
+mkdir -p ~/.kube
+# back up existing if present
+[ -f ~/.kube/config ] && cp ~/.kube/config ~/.kube/config.bak.$(date +%Y%m%d)
+# rewrite server and install
+sed 's|https://127.0.0.1:6443|https://92.205.130.254:6443|g' /tmp/k3s-raw.yaml \
+  > ~/.kube/config
+chmod 600 ~/.kube/config
+```
+
+---
+
+## Task: Smoke-test kubectl from local machine
+
+```task
+id: RAIL-BS-WP-0005-T03
+status: done
+priority: high
+state_hub_task_id: "f15626c2-73a0-443f-8aae-5515806ae0fa"
+```
+
+```bash
+kubectl get nodes
+kubectl get pods -A
+```
+
+Expected: node `254.130.205.92.host.secureserver.net` in `Ready` state.
+If unreachable, check firewall on server: `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 "sudo ufw status"`.
+
+---
+
+## Task: Resolve capability request
+
+```task
+id: RAIL-BS-WP-0005-T04
+status: done
+priority: high
+state_hub_task_id: "8109450c-95df-4d01-96fd-8847c88beb34"
+```
+
+Patch capability request 34b97d89 to `completed` with a resolution note:
+
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/capability-requests/34b97d89-e80a-42ae-a623-a9185e5b17f5/status" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "status": "completed",
+    "note": "Kubeconfig delivered to ~/.kube/config (server: 92.205.130.254:6443). kubectl smoke-test passed. NK-WP-0001 T02-T08 can proceed. Note: wrong catalog_entry_id filed (PostgreSQL HA eca6e5cc instead of k3s provisioning 9520cc98) — no retroactive API to correct."
+  }'
+```
+
+---
+
+## Task: Register UFW-inactive finding as technical debt
+
+```task
+id: RAIL-BS-WP-0005-T05
+status: done
+priority: medium
+state_hub_task_id: "ea120464-fdeb-4259-99e1-e6743cd86797"
+```
+
+UFW is inactive on 92.205.130.254 — K3s API port 6443 is exposed to the internet,
+protected only by TLS mutual auth. Register as TD item in state-hub so it gets
+addressed in a future railiance-cluster security hardening workplan.
+
+```bash
+curl -s -X POST "http://127.0.0.1:8000/technical-debt/" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "domain": "railiance",
+    "debt_type": "security",
+    "severity": "medium",
+    "title": "UFW inactive on HostEurope K3s node — API port 6443 exposed to internet",
+    "description": "UFW is inactive on 92.205.130.254. K3s API (port 6443) is reachable from anywhere, protected only by TLS client certificates. Should be restricted to known IPs or tunnelled. Discovered 2026-03-20 during kubeconfig delivery workplan.",
+    "status": "open"
+  }'
+```
--- a/workplans/archived/260622-RAILIANCE-WP-0012-activity-core-cluster-owned-deploy-verify.md
+++ b/workplans/archived/260622-RAILIANCE-WP-0012-activity-core-cluster-owned-deploy-verify.md
@@ -0,0 +1,110 @@
+---
+id: RAILIANCE-WP-0012
+type: workplan
+title: "activity-core cluster-owned deploy/verify"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: codex
+topic_slug: railiance
+created: "2026-06-15"
+updated: "2026-06-16"
+state_hub_workstream_id: "6434f7cb-e13c-4c05-839b-197bb239d5cd"
+---
+
+# activity-core cluster-owned deploy/verify
+
+## Context
+
+activity-core `ACTIVITY-WP-0007-T06` needs live Railiance cluster evidence for
+the disabled ops inventory probe. That live verification should be owned by the
+cluster/operator layer, not by arbitrary activity-core sessions with local
+`kubectl` assumptions.
+
+This workplan creates a cluster-owned path that keeps credentials in
+operator-owned locations while returning only non-secret evidence to State Hub.
+
+## Implement cluster-owned verifier
+
+```task
+id: RAILIANCE-WP-0012-T01
+status: done
+priority: high
+state_hub_task_id: "3769fdfb-b4f1-431b-a55a-672d93b3ea55"
+```
+
+Add a repeatable command that:
+
+- reconciles the activity-core Railiance runtime bundle;
+- reruns `actcore-sync`;
+- checks the `ops-service-inventory-probes` ActivityDefinition exists and is
+  still disabled;
+- triggers the disabled definition manually through the in-cluster API path;
+- verifies a fresh `ops_inventory_probe` progress event exists in State Hub;
+- posts a non-secret State Hub evidence note for activity-core to cite.
+
+Implemented as `tools/cmd/railiance-verify-activity-core` with Makefile target
+`verify-activity-core`. The script defaults to the `railiance01` SSH executor;
+use `ACTIVITY_CORE_CLUSTER_HOST=local` only for an explicitly selected local
+`kubectl` context.
+
+## Run live verification and publish evidence
+
+```task
+id: RAILIANCE-WP-0012-T02
+status: done
+priority: high
+state_hub_task_id: "6d7f87c3-a533-4de1-84de-9ca65f2e2779"
+```
+
+Run `make verify-activity-core` against the Railiance cluster. On success, cite
+the State Hub evidence note id in this task and in activity-core
+`ACTIVITY-WP-0007-T06`.
+
+If a gate fails, the verifier must still post a non-secret State Hub note with
+the failing gate and last completed evidence fields.
+
+2026-06-15: Completed against Railiance01 after refreshing the same-tag
+`activity-core:railiance01-prod` image from activity-core commit `ab17378`,
+importing digest `sha256:cff43c72455b9fc4fc11a0a997b4671a38987bb4583a600245dd961965af0e40`
+into k3s containerd, syncing the current runtime bundle to
+`/home/tegwick/activity-core/k8s/railiance`, and restarting the activity-core
+runtime deployments. The verifier reconciled the runtime bundle, completed
+`actcore-sync`, confirmed `ops-service-inventory-probes` exists and remains
+disabled, triggered it manually, verified State Hub progress
+`4c82360d-33e7-455b-8ab4-33facd4a3f8e`, and posted evidence note
+`baeeaeac-aa6d-4406-ae64-e54577f21386`.
+
+An intermediate verifier invocation accidentally targeted the local
+CoulombCore `kubectl` context. It created only `actcore-*` runtime resources in
+the existing `activity-core` namespace; those resources were removed with the
+runtime manifest cleanup, and the pre-existing `llm-connect` deployment remains
+running.
+
+Operational cleanup note: the successful Railiance01 verifier run used
+`ACTIVITY_CORE_RESTART_DEPLOYMENTS=1` after importing the same-tag image. The
+script was corrected afterward to restart only `actcore-api`,
+`actcore-worker`, and `actcore-event-router`, because
+`actcore-state-hub-bridge` uses host networking and a rolling restart leaves a
+new bridge pod pending behind the host-bound running pod. A 2026-06-16 cleanup
+check showed the bridge rollout had settled on Railiance01: the host-bound
+bridge pod was running and the replacement ReplicaSet was scaled to zero, so no
+manual live cleanup was needed.
+
+## Handoff closure to activity-core
+
+```task
+id: RAILIANCE-WP-0012-T03
+status: done
+priority: medium
+state_hub_task_id: "43f652c6-fcc4-49fa-90cc-4122eb6d5321"
+```
+
+After live evidence exists, update activity-core `ACTIVITY-WP-0007-T06` to cite
+the Railiance evidence and close it if Inter-Hub submission is active or
+explicitly deferred with the clean State Hub fallback result.
+
+2026-06-15: Updated activity-core `ACTIVITY-WP-0007-T06` to cite Railiance
+evidence note `baeeaeac-aa6d-4406-ae64-e54577f21386` and close the task with
+Inter-Hub submission explicitly deferred while the State Hub fallback evidence
+path is verified.
--- a/workplans/archived/260622-RAILIANCE-WP-0013-activity-core-verifier-evidence-hardening.md
+++ b/workplans/archived/260622-RAILIANCE-WP-0013-activity-core-verifier-evidence-hardening.md
@@ -0,0 +1,120 @@
+---
+id: RAILIANCE-WP-0013
+type: workplan
+title: "activity-core verifier evidence hardening"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: codex
+topic_slug: railiance
+created: "2026-06-16"
+updated: "2026-06-16"
+state_hub_workstream_id: "a3abb83a-2d42-40f9-a5f6-1dbc36903436"
+---
+
+# activity-core verifier evidence hardening
+
+## Context
+
+`RAILIANCE-WP-0012` moved activity-core live deploy/verify ownership into
+`railiance-cluster` and produced State Hub evidence
+`baeeaeac-aa6d-4406-ae64-e54577f21386`, with `ops_inventory_probe` progress
+`4c82360d-33e7-455b-8ab4-33facd4a3f8e`.
+
+A follow-up review found hardening work that matters for routine verifier use:
+the verifier should prove the State Hub progress event belongs to the specific
+manual trigger it launched, evidence should include an immutable runtime
+identity, and local `kubectl` mode should require an explicit double opt-in.
+
+This is a hardening follow-up only; it does not reopen activity-core
+`ACTIVITY-WP-0007-T06`.
+
+## Correlate State Hub progress to the manual trigger
+
+```task
+id: RAILIANCE-WP-0013-T01
+status: done
+priority: high
+state_hub_task_id: "d013a4a9-77fc-4cf0-babf-528d71acc0a1"
+```
+
+Update `tools/cmd/railiance-verify-activity-core` so after
+`POST /activity-definitions/<id>/trigger` it parses `trigger_key`, derives the
+expected activity-core manual `run_id`, and polls State Hub until it finds
+`ops_inventory_probe` where:
+
+- `detail.activity_id == DEFINITION_ID`;
+- `detail.activity_core_run_id == expected_run_id`.
+
+The verifier must not pass on merely any event created after `STARTED_AT`.
+Include the expected run id and matched progress id in the evidence note.
+
+2026-06-16: Implemented exact correlation. The verifier now derives the
+expected UUIDv5 `activity_core_run_id` from `<DEFINITION_ID>:<trigger_key>` and
+requires State Hub `ops_inventory_probe` detail to match both `activity_id` and
+`activity_core_run_id`.
+
+## Record immutable runtime evidence
+
+```task
+id: RAILIANCE-WP-0013-T02
+status: done
+priority: medium
+state_hub_task_id: "c5780ec1-9a74-401e-b60e-a0fdf2b7e5d2"
+```
+
+Ensure successful evidence includes either `activity_core_revision` or an
+immutable Kubernetes image ID/digest. When the remote repo revision is
+unavailable, fall back to the live `actcore-api` pod container `imageID`.
+
+2026-06-16: Implemented `api_image_id` capture from the live `actcore-api` pod
+container status and added a guard so passed evidence must include either the
+remote repo revision or the immutable image ID.
+
+## Guard explicit local kubectl override
+
+```task
+id: RAILIANCE-WP-0013-T03
+status: done
+priority: medium
+state_hub_task_id: "0d60809f-3f1d-4ea9-a96f-af074911acc0"
+```
+
+Keep `railiance01`/SSH as the default executor. If
+`ACTIVITY_CORE_CLUSTER_HOST=local` is selected, require an additional explicit
+opt-in such as `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` and print the current
+`kubectl` context before continuing.
+
+2026-06-16: Implemented the double opt-in. `ACTIVITY_CORE_CLUSTER_HOST=local`
+now exits before cluster access unless `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` is
+also set, and accepted local mode prints the current `kubectl` context.
+
+## Verify and publish hardening evidence
+
+```task
+id: RAILIANCE-WP-0013-T04
+status: done
+priority: medium
+state_hub_task_id: "150e4fa3-800c-4997-baaa-da696f5a0fc0"
+```
+
+Run `bash -n tools/cmd/railiance-verify-activity-core`, run
+`make verify-activity-core` against Railiance01, confirm the evidence note
+matched the manual trigger run id, and post a non-secret State Hub note citing
+the new evidence.
+
+2026-06-16: Verified with `bash -n tools/cmd/railiance-verify-activity-core`
+and a live Railiance01 `make verify-activity-core` run. The verifier posted
+State Hub evidence note `60256e9a-9d1b-44db-8999-738cf03bca2e`, matched manual
+run id `90e3b112-d1e3-51af-8fb2-cb61f26add17`, matched
+`ops_inventory_probe` progress `db408146-0310-4ac3-ac77-f73c5a41e070`, and
+included `api_image_id`
+`sha256:5ff92a8217c450ae06075d00862b6e2a92a83ca09eea18b5a5e96b5d2d728b35`.
+
+Done when:
+
+- the verifier rejects unrelated fresh `ops_inventory_probe` events;
+- evidence includes a non-null revision or image digest;
+- local `kubectl` mode requires explicit double opt-in;
+- the Railiance01 verifier run posts a passed evidence note with matched run id;
+- `make fix-consistency REPO=railiance-cluster` has synced the workplan.
--- a/workplans/archived/260627-RAIL-BS-WP-0006-staged-promotion-lifecycle.md
+++ b/workplans/archived/260627-RAIL-BS-WP-0006-staged-promotion-lifecycle.md
@@ -0,0 +1,258 @@
+---
+id: RAIL-BS-WP-0006
+type: workplan
+title: "Staged Promotion Lifecycle"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: railiance
+topic_slug: railiance
+repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
+state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
+created: "2026-02-24"
+updated: "2026-06-27"
+---
+
+# Staged Promotion Lifecycle
+
+## Goal
+
+Design and implement the three-stage deployment lifecycle as the core
+Railiance application promotion pattern:
+
+1. Stage 1: local development and validation.
+2. Stage 2: canary on production infrastructure.
+3. Stage 3: full production promotion with rollback.
+
+This lifecycle should become the repeatable path for native Railiance apps and
+third-party upstream applications wrapped by a Railiance overlay repo.
+
+## Why This Belongs Before Forgejo
+
+Forgejo will become critical production infrastructure. Before moving the
+source forge itself, Railiance needs a well-defined promotion lifecycle so the
+Forgejo deployment, Actions runners, package registry, and future upgrades can
+move through the same staged gates as every other important workload.
+
+## Boundary
+
+This workplan lives in `railiance-cluster` because it defines cluster runtime
+promotion mechanics and the canonical handoff between local validation,
+canary deployment, and production routing.
+
+Expected cross-repo handoffs:
+
+- `railiance-enablement`: developer-facing CLI templates and CI workflow
+  conventions.
+- `railiance-platform`: shared platform dependencies used by canaries.
+- `railiance-apps`: application Helm values and workload-specific promotion
+  definitions.
+
+## Tasks
+
+### T01 - Write deployment lifecycle specification
+
+```task
+id: RAIL-BS-WP-0006-T01
+status: done
+priority: high
+state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
+```
+
+Write `docs/deployment-lifecycle.md`.
+
+The spec should define:
+
+- Stage 1, Stage 2, and Stage 3 semantics.
+- Required checks before each stage.
+- Canary acceptance gates.
+- Rollback expectations.
+- Human approval gates for production-critical workloads.
+
+**Done when:** the lifecycle is clear enough to apply to Forgejo as a later
+production workload.
+
+2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from
+`docs/README.md`. The specification defines Stage 1 local validation, Stage 2
+production canary, Stage 3 production promotion, required checks and evidence,
+canary acceptance gates, rollback expectations, human approval gates for
+production-critical workloads, and the Forgejo readiness questions that must be
+answered before cutover.
+
+---
+
+### T02 - Define railiance directory schema and app.toml contract
+
+```task
+id: RAIL-BS-WP-0006-T02
+status: done
+priority: high
+state_hub_task_id: "523cf928-bb0e-4109-a172-abf029c62885"
+```
+
+Define the repository-local `railiance/` directory schema and `app.toml`
+contract for native and third-party applications.
+
+Minimum contract:
+
+- App identity and ownership.
+- Stage definitions.
+- Required platform dependencies.
+- Health checks and observability endpoints.
+- Promotion and rollback commands.
+- Secret references without plaintext secret values.
+
+**Done when:** a repo can declare how it moves through the Railiance promotion
+lifecycle without bespoke instructions.
+
+2026-06-27: Added `docs/app-toml-contract.md`, `schemas/railiance-app.schema.json`, and `examples/railiance/app.toml`. The v1 contract covers app identity, ownership, source/artifact policy, platform dependencies, secret references without plaintext values, health and observability endpoints, stage commands/checks/evidence, canary and promotion modes, rollback strategy, and human approval gates.
+
+---
+
+### T03 - Overlay repo pattern and creation script
+
+```task
+id: RAIL-BS-WP-0006-T03
+status: done
+priority: medium
+state_hub_task_id: "7cd378f2-0319-407a-9ce7-2c6d1a6d6d24"
+```
+
+Design the overlay repo pattern for third-party upstream applications and add
+`create_railiance_overlay_repo.sh` or equivalent tooling.
+
+The pattern should keep upstream code and Railiance deployment concerns cleanly
+separated while still allowing reproducible promotion.
+
+**Done when:** a third-party app can be wrapped without forking deployment
+logic into the upstream repository.
+
+2026-06-27: Added `docs/overlay-repo-pattern.md` and `tools/create_railiance_overlay_repo.sh`, plus the `bin/railiance create-overlay` dispatcher entry. The scaffold records upstream identity in `railiance/upstream.toml`, generates a schema-valid `railiance/app.toml`, stage values, a thin Helm chart, Stage 1 test script, rollback runbook, and promotion notes without vendoring upstream code or touching secrets.
+
+---
+
+### T04 - railiance run command
+
+```task
+id: RAIL-BS-WP-0006-T04
+status: done
+priority: high
+state_hub_task_id: "95c3311b-04bb-4c83-bda3-47958217b665"
+```
+
+Implement the Stage 1 `railiance run` command for local development and
+validation.
+
+Expected behavior:
+
+- Read `railiance/app.toml`.
+- Start or validate the local development target.
+- Run defined local health checks.
+- Emit a machine-readable result suitable for later promotion gates.
+
+**Done when:** at least one representative app can complete Stage 1 locally.
+
+2026-06-27: Added `tools/cmd/railiance-run`, the `bin/railiance run` dispatcher entry, and `docs/railiance-run-command.md`. The command reads `railiance/app.toml`, runs Stage 1 commands and local checks, and emits `railiance.run-result.v1` JSON without command logs or secret values. Updated the overlay generator so a generated Forgejo overlay completes Stage 1 locally in this environment; Helm rendering is optional when Helm is unavailable.
+
+---
+
+### T05 - Canary Helm chart template
+
+```task
+id: RAIL-BS-WP-0006-T05
+status: done
+priority: high
+state_hub_task_id: "47b8cd47-99c7-4f31-a147-ea16afde7217"
+```
+
+Create the Stage 2 canary Helm chart template.
+
+Minimum requirements:
+
+- Stable and canary release identities.
+- Weighted routing or equivalent traffic split through the chosen ingress
+  path.
+- Prometheus-compatible annotations.
+- Resource limits appropriate for single-node and future ThreePhoenix use.
+- Rollback-safe values layout.
+
+**Done when:** a canary deployment can be created without hand-editing cluster
+resources.
+
+2026-06-27: Updated generated overlay charts for Stage 2 canaries. The
+scaffold now emits stable/canary release identities, isolated canary ingress by
+default, optional Traefik weighted routing, Prometheus-compatible annotations,
+HTTP probes, conservative single-node resource limits, rollback labels,
+separate Stage 2/Stage 3 values, and `tests/stage2-template.sh`. Verified a
+fresh Forgejo overlay with schema validation, Stage 1 run, and Stage 2 scaffold
+checks; Helm rendering was skipped because Helm is unavailable in this
+environment.
+
+---
+
+### T06 - railiance deploy --stage 2 and observation tooling
+
+```task
+id: RAIL-BS-WP-0006-T06
+status: done
+priority: medium
+state_hub_task_id: "6a5c7422-fcb1-49d1-8153-e891bd1c27fa"
+```
+
+Implement Stage 2 deployment and observation commands.
+
+Expected behavior:
+
+- Deploy the canary from declared app metadata.
+- Show rollout state, pod health, ingress/routing state, and key metrics.
+- Fail closed when prerequisites or health gates are missing.
+
+**Done when:** Stage 2 can be run and observed from a repeatable command path.
+
+2026-06-27: Added `tools/cmd/railiance-stage2` and dispatcher entries for
+`bin/railiance deploy` and `bin/railiance observe`. Deploy emits a
+`railiance.stage2-deploy-result.v1` plan by default, can run Helm server dry-run
+or apply when tools and cluster access are present, and fails closed when
+required paths, Helm, or approval evidence are missing. Observe emits a
+`railiance.stage2-observe-result.v1` target plan by default and runs live
+kubectl rollout, pod, ingress, and metrics checks only with `--live`. Updated
+generated overlays to declare the repeatable Stage 2 plan commands.
+
+---
+
+### T07 - railiance promote, rollback, and onboarding guide
+
+```task
+id: RAIL-BS-WP-0006-T07
+status: done
+priority: medium
+state_hub_task_id: "476198f6-0049-4ac4-9593-6723c86c9602"
+```
+
+Implement Stage 3 promotion and rollback commands, then write the reference
+onboarding guide.
+
+Expected output:
+
+- `railiance promote` for controlled production promotion.
+- `railiance rollback` for reverting to the previous stable version.
+- A guide showing how a representative app adopts the lifecycle.
+- Explicit human approval points for critical infrastructure workloads.
+
+**Done when:** a representative app can move Stage 1 -> Stage 2 -> Stage 3 and
+back through rollback using documented commands.
+
+2026-06-27: Added `tools/cmd/railiance-stage3` and dispatcher entries for
+`bin/railiance promote` and `bin/railiance rollback`. Both commands default to
+non-mutating JSON plans, apply modes require approval evidence and Helm, and
+rollback apply also requires a Helm revision for `helm-revision` strategy.
+Added `docs/promote-rollback-onboarding.md` with the representative Stage 1 ->
+Stage 2 -> Stage 3 -> rollback path and explicit human approval points for
+critical workloads. Updated generated overlays to declare promote/rollback plan
+commands.
+
+## Dependencies
+
+This workplan should be done before the Forgejo production cutover. It can run
+in parallel with preparatory ThreePhoenix design, but its Stage 2/3 behavior
+should be validated against the intended ThreePhoenix cluster model.
--- a/workplans/archived/260701-RAILIANCE-WP-0014-activity-core-llm-connect-live-reconcile.md
+++ b/workplans/archived/260701-RAILIANCE-WP-0014-activity-core-llm-connect-live-reconcile.md
@@ -0,0 +1,106 @@
+---
+id: RAILIANCE-WP-0014
+type: workplan
+title: "activity-core llm-connect live reconcile"
+domain: financials
+repo: railiance-cluster
+status: finished
+owner: codex
+topic_slug: railiance
+created: "2026-06-18"
+updated: "2026-07-01"
+state_hub_workstream_id: "a152ddda-d60a-4a65-9b9c-59e2db9ff2b7"
+---
+
+# activity-core llm-connect live reconcile
+
+## Context
+
+activity-core has updated its Railiance runtime manifest so
+`actcore-runtime-config` points at the verified in-cluster llm-connect URL:
+
+```text
+LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080
+LLM_CONNECT_TIMEOUT_SECONDS=300
+```
+
+The remaining live gate belongs at the cluster/operator layer. Provider
+credentials must stay outside Git and State Hub, and the fixture smoke should
+record only non-secret evidence.
+
+## Add cluster-owned reconcile/check command
+
+```task
+id: RAILIANCE-WP-0014-T01
+status: done
+priority: high
+state_hub_task_id: "49288db7-8102-4ad5-af08-1fe6ab3f1d37"
+```
+
+Add a repeatable Railiance command that:
+
+- reconciles the non-secret activity-core runtime config keys;
+- checks the provider Secret by key count only;
+- applies the llm-connect overlay only after the provider Secret exists;
+- runs the in-namespace fixture smoke only after deployment readiness;
+- posts a non-secret State Hub evidence note.
+
+2026-06-18: Added `tools/cmd/railiance-reconcile-activity-core-llm-connect`
+and Makefile target `reconcile-activity-core-llm-connect`.
+
+## Reconcile live non-secret runtime config
+
+```task
+id: RAILIANCE-WP-0014-T02
+status: done
+priority: high
+state_hub_task_id: "61df5bad-535f-4ad1-ac7a-f46ff278c388"
+```
+
+Patch the live `activity-core/actcore-runtime-config` ConfigMap so it consumes
+the verified llm-connect service URL and timeout. Do not touch Secret values.
+
+2026-06-18: The reconcile command patches only `LLM_CONNECT_URL` and
+`LLM_CONNECT_TIMEOUT_SECONDS`, then re-reads the live ConfigMap to verify the
+values. Live evidence note `c72c514a-399e-4c54-8d5b-d36405932360` confirms
+`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080` and
+`LLM_CONNECT_TIMEOUT_SECONDS=300`.
+
+## Complete provider Secret, deployment, and smoke gate
+
+```task
+id: RAILIANCE-WP-0014-T03
+status: done
+priority: high
+state_hub_task_id: "ae8af00a-c14f-4b76-933c-46d06cd360ae"
+```
+
+After an operator stores provider credentials in
+`activity-core/llm-connect-provider-secrets`, rerun:
+
+```bash
+make reconcile-activity-core-llm-connect
+```
+
+The command will apply the llm-connect overlay, wait for deployment readiness,
+run the in-namespace fixture smoke with `imagePullPolicy=Never`, and post
+non-secret evidence: provider Secret key count, deployment readiness,
+pass/fail, latency/recommendation summary or sanitized failure.
+
+2026-07-01: Gate closed. Provider Secret `activity-core/llm-connect-provider-secrets`
+present (key count 1, no values inspected), overlay applied (no drift),
+deployment `llm-connect` ready 1/1, in-namespace fixture smoke passed
+(`health=ok latency_seconds=2.084 recommendations=1`). Evidence note
+`bddbf5d2-6cbe-4d97-9de6-689147d61be1`. The first rerun failed with
+`Connection refused` because the `llm-connect-activity-core-only`
+NetworkPolicy (added 2026-06-19) allowlist had not yet propagated the fresh
+smoke-pod IP; the reconcile tool now retries the smoke up to 6× with a 5s
+warm-up inside the pod.
+
+Historical live gate on 2026-06-18: provider Secret
+`activity-core/llm-connect-provider-secrets` is missing, so deployment and
+smoke are intentionally blocked until operator/OpenBao-to-Kubernetes Secret
+custody is complete. Evidence note
+`c72c514a-399e-4c54-8d5b-d36405932360` records provider Secret status
+`missing`, key count `0`, deployment status `not checked; provider Secret gate
+not satisfied`, and smoke status `blocked`.