feat(workplan): add RAIL-HO-WP-0004 production-readiness workplan

10-task cross-layer workplan covering: Ansible hardening codification (T01-T02), cnpg platform baseline superseding stale WP-0001 (T03-T05), S2→S5 Gitea boundary fix (T06), SSH git automation on CoulombCore (T07, done), deploy-stack docs (T08), state-hub + activity-core migration to cluster (T09-T10). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 01:01:47 +01:00
parent 9d59b5c667
commit caa6ae36da
1 changed files with 516 additions and 0 deletions
--- a/workplans/RAIL-HO-WP-0004-production-readiness.md
+++ b/workplans/RAIL-HO-WP-0004-production-readiness.md
@@ -0,0 +1,516 @@
+---
+id: RAIL-HO-WP-0004
+type: workplan
+title: "Railiance Production Readiness — Automated, Reproducible Stack"
+domain: railiance
+repo: railiance-infra
+status: active
+owner: worsch
+topic_slug: railiance
+created: "2026-03-26"
+updated: "2026-03-27"
+supersedes: RAIL-PL-WP-0001
+state_hub_workstream_id: "cee078e9-b18c-4f84-8a8a-6f27c2f9f407"
+---
+
+# Railiance Production Readiness — Automated, Reproducible Stack
+
+## Goal
+
+Make the Railiance cluster fully reproducible from a clean server — zero manual
+interventions required. Then migrate operational workloads (state-hub,
+activity-core) from the workstation and ad-hoc CoulombCore setup onto the
+cluster, with confidence that a rebuild is always a `make deploy-stack` away.
+
+This workplan supersedes `RAIL-PL-WP-0001` (which targeted Bitnami
+postgresql-ha; cnpg is now the deployed and active operator).
+
+## Why now
+
+Three forcing functions are converging:
+1. **CoulombCore hardening applied manually** (swapfile, nproc limits, systemd
+   slice) after INC-002. These must be in Ansible before the next node
+   rebuild — or the next operator overstep will repeat the incident.
+2. **cnpg is deployed** (cnpg-system namespace, databases namespace active).
+   WP-0001 targeted Bitnami postgresql-ha which is now stale. A clean platform
+   baseline must match reality.
+3. **State-hub and activity-core live on the workstation** — fragile,
+   non-redundant, not self-documenting. Moving them to the cluster is the last
+   step to making Railiance the durable operational home it was designed to be.
+
+## Current deployed state (reference snapshot 2026-03-26)
+
+| Component | Namespace | Manager | Boundary status |
+|-----------|-----------|---------|-----------------|
+| cert-manager | cert-manager | Helm S2 | ✓ correct |
+| CloudNative PG operator | cnpg-system | Helm S2 | boundary violation: operator is S3 concern |
+| nginx ingress | ingress-nginx | Helm S2 | ✓ correct |
+| Gitea 12.5.0 | gitea | Helm S2 | boundary violation: should be S5 |
+| ArgoCD | argocd | kubectl S2 | boundary violation: S4 concern |
+| SSO/MFA stack | mfa + sso | ? | per net-kingdom |
+| cnpg databases | databases | kubectl S3 | ✓ correct layer, no cluster defined yet |
+
+## Scope
+
+### Phase 1 — Ansible-codify server hardening (S1)
+
+All manual CoulombCore interventions from INC-002 must become Ansible roles
+so they survive node rebuild. No more drift between code and reality.
+
+### Phase 2 — S3 platform baseline with cnpg (supersedes WP-0001)
+
+Define a cnpg `Cluster` resource for the Gitea database in `railiance-platform`.
+Migrate Gitea from its built-in postgresql-ha subchart to this cluster.
+Codify Valkey as a standalone S3 Helm release.
+
+### Phase 3 — S2 boundary cleanup
+
+Move `gitea-values.sops.yaml` from `railiance-cluster` to `railiance-apps`.
+Document remaining boundary violations (cnpg operator in S2, ArgoCD in S2)
+and create forward-dated migration stubs.
+
+### Phase 4 — Git operations from CoulombCore
+
+CoulombCore cannot push to Gitea via HTTP (NodePort hairpin). Configure SSH
+remote so all on-cluster git operations use SSH.
+
+### Phase 5 — Automated stack deploy
+
+Write a `deploy-stack` target (or script) that converges S1→S5 in dependency
+order. The goal: a new operator can onboard a server and reach a working
+cluster with one command sequence.
+
+### Phase 6 — Migrate operational workloads (S5)
+
+Deploy state-hub and activity-core to the cluster. This is the payoff phase —
+the cluster becomes the operational home, not the workstation.
+
+## Pre-conditions
+
+- railiance-cluster converged and all S2 workplans done (they are: ✓)
+- Gitea operational (it is: ✓, gitea namespace running)
+- ops-bridge state-hub tunnel active (bridge up state-hub-coulombcore)
+- Active backup before any phase touching live data (make backup in railiance-cluster)
+
+---
+
+## Tasks
+
+### T01 — Ansible: swapfile role for CoulombCore
+
+```task
+id: RAIL-HO-WP-0004-T01
+status: todo
+priority: high
+state_hub_task_id: "7c586940-f7b8-4e55-b1d6-72eba6a675b7"
+```
+
+Create an Ansible role `swapfile` (or extend `roles/base`) that provisions the
+4 GB swapfile applied manually after INC-002.
+
+Desired state:
+```yaml
+# inventory/host_vars/coulombcore.yml (or group_vars)
+swap_size_gb: 4
+swap_swappiness: 10
+```
+
+Role tasks:
+1. Check `/swapfile` existence + correct size (fallocate idempotent)
+2. `chmod 600 /swapfile`, `mkswap`, `swapon` if not already active
+3. Ensure `/etc/fstab` entry present
+4. Set `vm.swappiness=10` via `sysctl` module (persist in `/etc/sysctl.d/`)
+
+**Done when:** `make converge` is idempotent; `free -h` on CoulombCore shows
+4 GB swap; `make verify` passes.
+
+---
+
+### T02 — Ansible: nproc limits and systemd user slice
+
+```task
+id: RAIL-HO-WP-0004-T02
+status: todo
+priority: high
+state_hub_task_id: "42f1f02b-0d8b-432c-8bc8-4930417e15dd"
+```
+
+Codify the PAM nproc limits and systemd user slice hardening applied after
+INC-002 into Ansible (role `security` or a new `resource-limits` role).
+
+Desired state:
+```yaml
+nproc_soft: 512
+nproc_hard: 1024
+user_memory_max: "1500M"
+user_memory_swap_max: "512M"
+```
+
+Tasks:
+1. Template `/etc/security/limits.conf` entry for tegwick (nproc soft/hard)
+2. Create `/etc/systemd/system/user-1000.slice.d/limits.conf` via template
+3. `systemctl daemon-reload` handler
+
+**Done when:** `make converge` idempotent; `cat /proc/<tegwick-pid>/limits`
+reflects caps; `make verify` passes; Goss test for nproc limit added.
+
+---
+
+### T03 — Define Gitea cnpg database cluster in railiance-platform
+
+```task
+id: RAIL-HO-WP-0004-T03
+status: todo
+priority: high
+state_hub_task_id: "8e8cff04-96c6-4386-8caa-b0586114a49d"
+```
+
+Mark `RAIL-PL-WP-0001` as superseded (update its status field). Then define
+the Gitea database cluster using CloudNative PG in `railiance-platform`.
+
+Files to create in `railiance-platform/`:
+```
+helm/gitea-db-cluster.yaml       # cnpg Cluster manifest (SOPS-encrypted secrets inline or ref)
+Makefile targets: db-deploy, db-status, db-shell
+```
+
+Cluster manifest skeleton:
+```yaml
+apiVersion: postgresql.cnpg.io/v1
+kind: Cluster
+metadata:
+  name: gitea-db
+  namespace: databases
+spec:
+  instances: 1           # single-node to start; bump to 3 when RAM allows
+  postgresql:
+    version: "16"
+  storage:
+    size: 10Gi
+  bootstrap:
+    initdb:
+      database: gitea
+      owner: gitea
+      secret:
+        name: gitea-db-credentials   # k8s Secret (SOPS-managed)
+```
+
+Add `make db-deploy` target that applies the manifest to the `databases`
+namespace. Add `make db-status` that shows cluster health via `kubectl cnpg
+status`.
+
+**Done when:** `make db-deploy` succeeds; `kubectl get cluster -n databases`
+shows `gitea-db` in `Cluster in healthy state`; credentials secret present.
+
+---
+
+### T04 — Migrate Gitea to external cnpg database
+
+```task
+id: RAIL-HO-WP-0004-T04
+status: todo
+priority: high
+state_hub_task_id: "4f4196b5-4d84-4648-b470-e6941444ea46"
+```
+
+**Pre-condition:** T03 done and gitea-db cluster healthy.
+
+Migration steps (execute from CoulombCore with kubectl access):
+
+1. Backup: `make backup` in railiance-cluster — verify success.
+2. Dump current Gitea DB:
+   ```bash
+   kubectl exec -n gitea deploy/gitea -- \
+     pg_dump -h localhost -U gitea gitea > /tmp/gitea-dump.sql
+   ```
+   (Gitea's built-in postgresql-ha is at localhost within the pod)
+3. Restore into cnpg cluster:
+   ```bash
+   kubectl exec -n databases gitea-db-1 -- \
+     psql -U gitea gitea < /tmp/gitea-dump.sql
+   ```
+4. Update Gitea Helm values to disable subchart and point to cnpg:
+   ```yaml
+   postgresql-ha:
+     enabled: false
+   redis-cluster:
+     enabled: false   # Valkey handled in T06
+   gitea:
+     config:
+       database:
+         DB_TYPE: postgres
+         HOST: gitea-db-rw.databases.svc.cluster.local:5432
+         NAME: gitea
+         USER: gitea
+         PASSWD: <from cnpg secret>
+   ```
+5. `helm upgrade gitea` — verify login and all repos intact.
+6. Confirm old postgresql-ha pods are terminated.
+
+**Done when:** Gitea login works; all repos accessible; no postgresql-ha pods
+running; `kubectl cnpg status gitea-db -n databases` healthy.
+
+---
+
+### T05 — Codify Valkey as standalone S3 asset
+
+```task
+id: RAIL-HO-WP-0004-T05
+status: todo
+priority: medium
+state_hub_task_id: "36c66ceb-ebc9-425a-a329-c37496278c6b"
+```
+
+Create `railiance-platform/helm/valkey-values.sops.yaml` and deploy Valkey
+as a standalone Helm release in the `platform` namespace (independent of Gitea
+subchart).
+
+```yaml
+# helm/valkey-values.sops.yaml
+auth:
+  enabled: true
+  password: ENC[age,...]
+replica:
+  replicaCount: 1
+persistence:
+  enabled: true
+  size: 2Gi
+```
+
+Add `make valkey-deploy` and `make valkey-status` to `railiance-platform/Makefile`.
+
+Update Gitea Helm values to point to standalone Valkey:
+```yaml
+redis-cluster:
+  enabled: false
+gitea:
+  config:
+    cache:
+      ADAPTER: redis
+      HOST: redis://:password@valkey.platform.svc.cluster.local:6379/0
+    session:
+      PROVIDER: redis
+      PROVIDER_CONFIG: redis://:password@valkey.platform.svc.cluster.local:6379/1
+    queue:
+      TYPE: redis
+      CONN_STR: redis://:password@valkey.platform.svc.cluster.local:6379/2
+```
+
+**Done when:** `make valkey-deploy` succeeds; Gitea session/cache operational
+on standalone Valkey; no redis subchart pods running.
+
+---
+
+### T06 — Move Gitea Helm values to railiance-apps (boundary fix)
+
+```task
+id: RAIL-HO-WP-0004-T06
+status: todo
+priority: medium
+state_hub_task_id: "6d8323b3-e842-4dc1-9a12-2b153b2afcce"
+```
+
+**Pre-condition:** T04 done (Gitea on external DB; Helm values updated).
+
+```bash
+# In railiance-cluster:
+git mv helm/gitea-values.sops.yaml ../railiance-apps/helm/gitea-values.sops.yaml
+```
+
+Add to `railiance-apps/Makefile`:
+```makefile
+gitea-deploy: ## Deploy / upgrade Gitea (S5 workload)
+    helm upgrade --install gitea gitea-charts/gitea \
+        -f <(sops -d helm/gitea-values.sops.yaml) \
+        --namespace gitea --create-namespace
+
+gitea-status: ## Check Gitea health
+    kubectl get pods -n gitea
+    kubectl cnpg status gitea-db -n databases
+```
+
+Add tombstone in `railiance-cluster/helm/MOVED.md`:
+```
+gitea-values.sops.yaml → railiance-apps/helm/ (2026-03-xx, RAIL-HO-WP-0004-T06)
+```
+
+Update `railiance-cluster/SCOPE.md` to remove Gitea boundary violation note.
+Update `railiance-apps/SCOPE.md` Current State to reflect resolved violation.
+
+**Done when:** `make gitea-deploy` from railiance-apps converges correctly;
+Gitea operational; tombstone in place in railiance-cluster.
+
+---
+
+### T07 — SSH remote for git operations from CoulombCore
+
+```task
+id: RAIL-HO-WP-0004-T07
+status: done
+priority: high
+state_hub_task_id: "3d76754d-2dc0-4fe5-8bf2-c74d77cebe36"
+```
+
+CoulombCore cannot push to Gitea via HTTP (NodePort hairpin, no stored
+credentials). Fix by configuring SSH-based remotes for all repos on CoulombCore.
+
+Steps:
+1. Generate an SSH key for the `tegwick` user on CoulombCore if not present:
+   ```bash
+   ssh-keygen -t ed25519 -C "tegwick@coulombcore" -f ~/.ssh/id_ed25519_gitea
+   ```
+2. Add the public key to Gitea (coulomb user or a dedicated `coulombcore` bot
+   account via Gitea admin UI or API).
+3. Add SSH config on CoulombCore:
+   ```
+   # ~/.ssh/config
+   Host gitea-local
+       HostName localhost
+       Port <Gitea SSH NodePort>
+       User git
+       IdentityFile ~/.ssh/id_ed25519_gitea
+   ```
+   Note: Gitea exposes SSH on a NodePort (check current value: `kubectl get svc -n gitea`).
+4. Update remotes for all repos on CoulombCore:
+   ```bash
+   git remote set-url origin ssh://git@gitea-local/coulomb/<repo>.git
+   ```
+5. Test: `git push origin main` from a repo on CoulombCore.
+
+Codify the SSH key deployment step into Ansible
+(`roles/base` or `roles/git-access`): ensure the key is present and the SSH
+config block is templated.
+
+**Done when:** `git push` from CoulombCore to Gitea succeeds over SSH without
+prompts; Ansible role deploys the key idempotently.
+
+---
+
+### T08 — Automated stack deploy documentation + Makefile
+
+```task
+id: RAIL-HO-WP-0004-T08
+status: todo
+priority: medium
+state_hub_task_id: "b076e540-2d81-4be8-a454-61cfd329bb05"
+```
+
+Write `railiance-infra/docs/deploy-stack.md` — the operator runbook for
+standing up the full Railiance stack from scratch. This is the canonical
+"I have a clean server, now what?" reference.
+
+Standard sequence:
+```
+S1: make tf-apply && make converge && make verify  (railiance-infra)
+S2: make converge && make smoke                    (railiance-cluster)
+S3: make db-deploy && make valkey-deploy           (railiance-platform)
+S4: (ArgoCD already at cluster level; no S4 workplan yet)
+S5: make gitea-deploy                              (railiance-apps)
+    make state-hub-deploy                          (railiance-apps, T09)
+    make activity-core-deploy                      (railiance-apps, T10)
+```
+
+Add a `make deploy-stack` target in `railiance-infra/Makefile` that prints
+the ordered sequence with per-step instructions (not a single runaway script —
+operator confirms each layer before proceeding).
+
+Document:
+- Pre-conditions checklist (Hetzner/HostEurope creds, age key, SOPS key)
+- State Hub tunnel bring-up (ops-bridge)
+- Recovery runbook pointer (INC-002 pattern)
+
+**Done when:** `docs/deploy-stack.md` accurate and reviewed; `make deploy-stack`
+prints the sequence; a new operator could follow it end-to-end without prior
+context.
+
+---
+
+### T09 — Deploy state-hub to cluster (S5)
+
+```task
+id: RAIL-HO-WP-0004-T09
+status: todo
+priority: medium
+state_hub_task_id: "d2afe78a-eb51-4ce9-b332-f181323d2370"
+```
+
+**Pre-condition:** T04 done (cnpg Gitea DB working); T08 done (deploy sequence
+documented). State-hub needs a PostgreSQL database — use a cnpg cluster in
+`databases` namespace.
+
+Steps:
+1. Define `state-hub-db` cnpg Cluster in `railiance-platform` (same pattern as T03).
+2. Create a container image for state-hub (Dockerfile in `the-custodian/state-hub/`).
+3. Push image to Gitea's container registry (or ghcr.io as interim).
+4. Write Helm chart or plain manifests in `railiance-apps/apps/state-hub/`:
+   - Deployment (state-hub API, port 8000)
+   - Service + Ingress (https://state-hub.<domain>)
+   - ConfigMap for environment (DB URL, etc.)
+   - Secret for DB credentials (SOPS-managed)
+5. Migrate data: `pg_dump` from workstation postgres → `pg_restore` into
+   cnpg cluster.
+6. Update ops-bridge tunnel targets if the state-hub URL changes.
+7. Update `~/.claude/CLAUDE.md` global instructions to point to cluster URL.
+
+**Done when:** `curl https://state-hub.<domain>/state/health` returns healthy;
+all MCP tools functional; workstation state-hub can be decommissioned.
+
+---
+
+### T10 — Deploy activity-core to cluster (S5)
+
+```task
+id: RAIL-HO-WP-0004-T10
+status: todo
+priority: low
+state_hub_task_id: "34d73215-f016-4750-8da5-69f82d63d619"
+```
+
+**Pre-condition:** T09 done (state-hub on cluster operational).
+
+Activity-core is the Rails/Go/other application running on CoulombCore ad-hoc.
+This task packages and deploys it as a proper S5 workload.
+
+Steps:
+1. Verify activity-core has a working Dockerfile (check repo).
+2. Define a cnpg database cluster for activity-core in `railiance-platform`
+   (if it uses postgres).
+3. Write Helm chart / manifests in `railiance-apps/apps/activity-core/`.
+4. Migrate any existing data from the ad-hoc CoulombCore deployment.
+5. Add to `railiance-apps/Makefile`:
+   ```makefile
+   activity-core-deploy: ## Deploy activity-core to cluster
+   activity-core-status: ## Check activity-core health
+   ```
+6. Remove or archive the ad-hoc CoulombCore deployment.
+
+**Done when:** Activity-core accessible at its cluster URL; no ad-hoc process
+remaining on CoulombCore; all prior functionality intact.
+
+---
+
+## Phasing and dependencies
+
+```
+T01 (swap) ─┐
+T02 (nproc) ─┴─ independent, can parallelize
+
+T03 (cnpg cluster def) ──► T04 (migrate Gitea DB) ──► T05 (Valkey standalone) ──► T06 (move Gitea to S5)
+
+T07 (SSH remotes) ─ independent, unblock early
+
+T08 (deploy docs) ─ can be written in parallel with T03-T06
+
+T09 (state-hub on cluster) ─ needs T04 (DB working) + T08 (deploy pattern)
+T10 (activity-core) ─ needs T09
+```
+
+Recommended order: T07 → T01+T02 → T03 → T04 → T05 → T06 → T08 → T09 → T10
+
+## References
+
+- ADR-003 (OAS boundary rules): `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`
+- ADR-004 (connectivity-first): `the-custodian/canon/architecture/adr-004-connectivity-first-network-posture.md`
+- INC-002 (overload incident): `the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`
+- Superseded: `railiance-platform/workplans/RAIL-PL-WP-0001-platform-baseline.md`
+- ops-bridge runbook: `ops-bridge/docs/`