diff --git a/workplans/RAIL-HO-WP-0004-production-readiness.md b/workplans/RAIL-HO-WP-0004-production-readiness.md new file mode 100644 index 0000000..e8980cf --- /dev/null +++ b/workplans/RAIL-HO-WP-0004-production-readiness.md @@ -0,0 +1,516 @@ +--- +id: RAIL-HO-WP-0004 +type: workplan +title: "Railiance Production Readiness — Automated, Reproducible Stack" +domain: railiance +repo: railiance-infra +status: active +owner: worsch +topic_slug: railiance +created: "2026-03-26" +updated: "2026-03-27" +supersedes: RAIL-PL-WP-0001 +state_hub_workstream_id: "cee078e9-b18c-4f84-8a8a-6f27c2f9f407" +--- + +# Railiance Production Readiness — Automated, Reproducible Stack + +## Goal + +Make the Railiance cluster fully reproducible from a clean server — zero manual +interventions required. Then migrate operational workloads (state-hub, +activity-core) from the workstation and ad-hoc CoulombCore setup onto the +cluster, with confidence that a rebuild is always a `make deploy-stack` away. + +This workplan supersedes `RAIL-PL-WP-0001` (which targeted Bitnami +postgresql-ha; cnpg is now the deployed and active operator). + +## Why now + +Three forcing functions are converging: +1. **CoulombCore hardening applied manually** (swapfile, nproc limits, systemd + slice) after INC-002. These must be in Ansible before the next node + rebuild — or the next operator overstep will repeat the incident. +2. **cnpg is deployed** (cnpg-system namespace, databases namespace active). + WP-0001 targeted Bitnami postgresql-ha which is now stale. A clean platform + baseline must match reality. +3. **State-hub and activity-core live on the workstation** — fragile, + non-redundant, not self-documenting. Moving them to the cluster is the last + step to making Railiance the durable operational home it was designed to be. + +## Current deployed state (reference snapshot 2026-03-26) + +| Component | Namespace | Manager | Boundary status | +|-----------|-----------|---------|-----------------| +| cert-manager | cert-manager | Helm S2 | ✓ correct | +| CloudNative PG operator | cnpg-system | Helm S2 | boundary violation: operator is S3 concern | +| nginx ingress | ingress-nginx | Helm S2 | ✓ correct | +| Gitea 12.5.0 | gitea | Helm S2 | boundary violation: should be S5 | +| ArgoCD | argocd | kubectl S2 | boundary violation: S4 concern | +| SSO/MFA stack | mfa + sso | ? | per net-kingdom | +| cnpg databases | databases | kubectl S3 | ✓ correct layer, no cluster defined yet | + +## Scope + +### Phase 1 — Ansible-codify server hardening (S1) + +All manual CoulombCore interventions from INC-002 must become Ansible roles +so they survive node rebuild. No more drift between code and reality. + +### Phase 2 — S3 platform baseline with cnpg (supersedes WP-0001) + +Define a cnpg `Cluster` resource for the Gitea database in `railiance-platform`. +Migrate Gitea from its built-in postgresql-ha subchart to this cluster. +Codify Valkey as a standalone S3 Helm release. + +### Phase 3 — S2 boundary cleanup + +Move `gitea-values.sops.yaml` from `railiance-cluster` to `railiance-apps`. +Document remaining boundary violations (cnpg operator in S2, ArgoCD in S2) +and create forward-dated migration stubs. + +### Phase 4 — Git operations from CoulombCore + +CoulombCore cannot push to Gitea via HTTP (NodePort hairpin). Configure SSH +remote so all on-cluster git operations use SSH. + +### Phase 5 — Automated stack deploy + +Write a `deploy-stack` target (or script) that converges S1→S5 in dependency +order. The goal: a new operator can onboard a server and reach a working +cluster with one command sequence. + +### Phase 6 — Migrate operational workloads (S5) + +Deploy state-hub and activity-core to the cluster. This is the payoff phase — +the cluster becomes the operational home, not the workstation. + +## Pre-conditions + +- railiance-cluster converged and all S2 workplans done (they are: ✓) +- Gitea operational (it is: ✓, gitea namespace running) +- ops-bridge state-hub tunnel active (bridge up state-hub-coulombcore) +- Active backup before any phase touching live data (make backup in railiance-cluster) + +--- + +## Tasks + +### T01 — Ansible: swapfile role for CoulombCore + +```task +id: RAIL-HO-WP-0004-T01 +status: todo +priority: high +state_hub_task_id: "7c586940-f7b8-4e55-b1d6-72eba6a675b7" +``` + +Create an Ansible role `swapfile` (or extend `roles/base`) that provisions the +4 GB swapfile applied manually after INC-002. + +Desired state: +```yaml +# inventory/host_vars/coulombcore.yml (or group_vars) +swap_size_gb: 4 +swap_swappiness: 10 +``` + +Role tasks: +1. Check `/swapfile` existence + correct size (fallocate idempotent) +2. `chmod 600 /swapfile`, `mkswap`, `swapon` if not already active +3. Ensure `/etc/fstab` entry present +4. Set `vm.swappiness=10` via `sysctl` module (persist in `/etc/sysctl.d/`) + +**Done when:** `make converge` is idempotent; `free -h` on CoulombCore shows +4 GB swap; `make verify` passes. + +--- + +### T02 — Ansible: nproc limits and systemd user slice + +```task +id: RAIL-HO-WP-0004-T02 +status: todo +priority: high +state_hub_task_id: "42f1f02b-0d8b-432c-8bc8-4930417e15dd" +``` + +Codify the PAM nproc limits and systemd user slice hardening applied after +INC-002 into Ansible (role `security` or a new `resource-limits` role). + +Desired state: +```yaml +nproc_soft: 512 +nproc_hard: 1024 +user_memory_max: "1500M" +user_memory_swap_max: "512M" +``` + +Tasks: +1. Template `/etc/security/limits.conf` entry for tegwick (nproc soft/hard) +2. Create `/etc/systemd/system/user-1000.slice.d/limits.conf` via template +3. `systemctl daemon-reload` handler + +**Done when:** `make converge` idempotent; `cat /proc//limits` +reflects caps; `make verify` passes; Goss test for nproc limit added. + +--- + +### T03 — Define Gitea cnpg database cluster in railiance-platform + +```task +id: RAIL-HO-WP-0004-T03 +status: todo +priority: high +state_hub_task_id: "8e8cff04-96c6-4386-8caa-b0586114a49d" +``` + +Mark `RAIL-PL-WP-0001` as superseded (update its status field). Then define +the Gitea database cluster using CloudNative PG in `railiance-platform`. + +Files to create in `railiance-platform/`: +``` +helm/gitea-db-cluster.yaml # cnpg Cluster manifest (SOPS-encrypted secrets inline or ref) +Makefile targets: db-deploy, db-status, db-shell +``` + +Cluster manifest skeleton: +```yaml +apiVersion: postgresql.cnpg.io/v1 +kind: Cluster +metadata: + name: gitea-db + namespace: databases +spec: + instances: 1 # single-node to start; bump to 3 when RAM allows + postgresql: + version: "16" + storage: + size: 10Gi + bootstrap: + initdb: + database: gitea + owner: gitea + secret: + name: gitea-db-credentials # k8s Secret (SOPS-managed) +``` + +Add `make db-deploy` target that applies the manifest to the `databases` +namespace. Add `make db-status` that shows cluster health via `kubectl cnpg +status`. + +**Done when:** `make db-deploy` succeeds; `kubectl get cluster -n databases` +shows `gitea-db` in `Cluster in healthy state`; credentials secret present. + +--- + +### T04 — Migrate Gitea to external cnpg database + +```task +id: RAIL-HO-WP-0004-T04 +status: todo +priority: high +state_hub_task_id: "4f4196b5-4d84-4648-b470-e6941444ea46" +``` + +**Pre-condition:** T03 done and gitea-db cluster healthy. + +Migration steps (execute from CoulombCore with kubectl access): + +1. Backup: `make backup` in railiance-cluster — verify success. +2. Dump current Gitea DB: + ```bash + kubectl exec -n gitea deploy/gitea -- \ + pg_dump -h localhost -U gitea gitea > /tmp/gitea-dump.sql + ``` + (Gitea's built-in postgresql-ha is at localhost within the pod) +3. Restore into cnpg cluster: + ```bash + kubectl exec -n databases gitea-db-1 -- \ + psql -U gitea gitea < /tmp/gitea-dump.sql + ``` +4. Update Gitea Helm values to disable subchart and point to cnpg: + ```yaml + postgresql-ha: + enabled: false + redis-cluster: + enabled: false # Valkey handled in T06 + gitea: + config: + database: + DB_TYPE: postgres + HOST: gitea-db-rw.databases.svc.cluster.local:5432 + NAME: gitea + USER: gitea + PASSWD: + ``` +5. `helm upgrade gitea` — verify login and all repos intact. +6. Confirm old postgresql-ha pods are terminated. + +**Done when:** Gitea login works; all repos accessible; no postgresql-ha pods +running; `kubectl cnpg status gitea-db -n databases` healthy. + +--- + +### T05 — Codify Valkey as standalone S3 asset + +```task +id: RAIL-HO-WP-0004-T05 +status: todo +priority: medium +state_hub_task_id: "36c66ceb-ebc9-425a-a329-c37496278c6b" +``` + +Create `railiance-platform/helm/valkey-values.sops.yaml` and deploy Valkey +as a standalone Helm release in the `platform` namespace (independent of Gitea +subchart). + +```yaml +# helm/valkey-values.sops.yaml +auth: + enabled: true + password: ENC[age,...] +replica: + replicaCount: 1 +persistence: + enabled: true + size: 2Gi +``` + +Add `make valkey-deploy` and `make valkey-status` to `railiance-platform/Makefile`. + +Update Gitea Helm values to point to standalone Valkey: +```yaml +redis-cluster: + enabled: false +gitea: + config: + cache: + ADAPTER: redis + HOST: redis://:password@valkey.platform.svc.cluster.local:6379/0 + session: + PROVIDER: redis + PROVIDER_CONFIG: redis://:password@valkey.platform.svc.cluster.local:6379/1 + queue: + TYPE: redis + CONN_STR: redis://:password@valkey.platform.svc.cluster.local:6379/2 +``` + +**Done when:** `make valkey-deploy` succeeds; Gitea session/cache operational +on standalone Valkey; no redis subchart pods running. + +--- + +### T06 — Move Gitea Helm values to railiance-apps (boundary fix) + +```task +id: RAIL-HO-WP-0004-T06 +status: todo +priority: medium +state_hub_task_id: "6d8323b3-e842-4dc1-9a12-2b153b2afcce" +``` + +**Pre-condition:** T04 done (Gitea on external DB; Helm values updated). + +```bash +# In railiance-cluster: +git mv helm/gitea-values.sops.yaml ../railiance-apps/helm/gitea-values.sops.yaml +``` + +Add to `railiance-apps/Makefile`: +```makefile +gitea-deploy: ## Deploy / upgrade Gitea (S5 workload) + helm upgrade --install gitea gitea-charts/gitea \ + -f <(sops -d helm/gitea-values.sops.yaml) \ + --namespace gitea --create-namespace + +gitea-status: ## Check Gitea health + kubectl get pods -n gitea + kubectl cnpg status gitea-db -n databases +``` + +Add tombstone in `railiance-cluster/helm/MOVED.md`: +``` +gitea-values.sops.yaml → railiance-apps/helm/ (2026-03-xx, RAIL-HO-WP-0004-T06) +``` + +Update `railiance-cluster/SCOPE.md` to remove Gitea boundary violation note. +Update `railiance-apps/SCOPE.md` Current State to reflect resolved violation. + +**Done when:** `make gitea-deploy` from railiance-apps converges correctly; +Gitea operational; tombstone in place in railiance-cluster. + +--- + +### T07 — SSH remote for git operations from CoulombCore + +```task +id: RAIL-HO-WP-0004-T07 +status: done +priority: high +state_hub_task_id: "3d76754d-2dc0-4fe5-8bf2-c74d77cebe36" +``` + +CoulombCore cannot push to Gitea via HTTP (NodePort hairpin, no stored +credentials). Fix by configuring SSH-based remotes for all repos on CoulombCore. + +Steps: +1. Generate an SSH key for the `tegwick` user on CoulombCore if not present: + ```bash + ssh-keygen -t ed25519 -C "tegwick@coulombcore" -f ~/.ssh/id_ed25519_gitea + ``` +2. Add the public key to Gitea (coulomb user or a dedicated `coulombcore` bot + account via Gitea admin UI or API). +3. Add SSH config on CoulombCore: + ``` + # ~/.ssh/config + Host gitea-local + HostName localhost + Port + User git + IdentityFile ~/.ssh/id_ed25519_gitea + ``` + Note: Gitea exposes SSH on a NodePort (check current value: `kubectl get svc -n gitea`). +4. Update remotes for all repos on CoulombCore: + ```bash + git remote set-url origin ssh://git@gitea-local/coulomb/.git + ``` +5. Test: `git push origin main` from a repo on CoulombCore. + +Codify the SSH key deployment step into Ansible +(`roles/base` or `roles/git-access`): ensure the key is present and the SSH +config block is templated. + +**Done when:** `git push` from CoulombCore to Gitea succeeds over SSH without +prompts; Ansible role deploys the key idempotently. + +--- + +### T08 — Automated stack deploy documentation + Makefile + +```task +id: RAIL-HO-WP-0004-T08 +status: todo +priority: medium +state_hub_task_id: "b076e540-2d81-4be8-a454-61cfd329bb05" +``` + +Write `railiance-infra/docs/deploy-stack.md` — the operator runbook for +standing up the full Railiance stack from scratch. This is the canonical +"I have a clean server, now what?" reference. + +Standard sequence: +``` +S1: make tf-apply && make converge && make verify (railiance-infra) +S2: make converge && make smoke (railiance-cluster) +S3: make db-deploy && make valkey-deploy (railiance-platform) +S4: (ArgoCD already at cluster level; no S4 workplan yet) +S5: make gitea-deploy (railiance-apps) + make state-hub-deploy (railiance-apps, T09) + make activity-core-deploy (railiance-apps, T10) +``` + +Add a `make deploy-stack` target in `railiance-infra/Makefile` that prints +the ordered sequence with per-step instructions (not a single runaway script — +operator confirms each layer before proceeding). + +Document: +- Pre-conditions checklist (Hetzner/HostEurope creds, age key, SOPS key) +- State Hub tunnel bring-up (ops-bridge) +- Recovery runbook pointer (INC-002 pattern) + +**Done when:** `docs/deploy-stack.md` accurate and reviewed; `make deploy-stack` +prints the sequence; a new operator could follow it end-to-end without prior +context. + +--- + +### T09 — Deploy state-hub to cluster (S5) + +```task +id: RAIL-HO-WP-0004-T09 +status: todo +priority: medium +state_hub_task_id: "d2afe78a-eb51-4ce9-b332-f181323d2370" +``` + +**Pre-condition:** T04 done (cnpg Gitea DB working); T08 done (deploy sequence +documented). State-hub needs a PostgreSQL database — use a cnpg cluster in +`databases` namespace. + +Steps: +1. Define `state-hub-db` cnpg Cluster in `railiance-platform` (same pattern as T03). +2. Create a container image for state-hub (Dockerfile in `the-custodian/state-hub/`). +3. Push image to Gitea's container registry (or ghcr.io as interim). +4. Write Helm chart or plain manifests in `railiance-apps/apps/state-hub/`: + - Deployment (state-hub API, port 8000) + - Service + Ingress (https://state-hub.) + - ConfigMap for environment (DB URL, etc.) + - Secret for DB credentials (SOPS-managed) +5. Migrate data: `pg_dump` from workstation postgres → `pg_restore` into + cnpg cluster. +6. Update ops-bridge tunnel targets if the state-hub URL changes. +7. Update `~/.claude/CLAUDE.md` global instructions to point to cluster URL. + +**Done when:** `curl https://state-hub./state/health` returns healthy; +all MCP tools functional; workstation state-hub can be decommissioned. + +--- + +### T10 — Deploy activity-core to cluster (S5) + +```task +id: RAIL-HO-WP-0004-T10 +status: todo +priority: low +state_hub_task_id: "34d73215-f016-4750-8da5-69f82d63d619" +``` + +**Pre-condition:** T09 done (state-hub on cluster operational). + +Activity-core is the Rails/Go/other application running on CoulombCore ad-hoc. +This task packages and deploys it as a proper S5 workload. + +Steps: +1. Verify activity-core has a working Dockerfile (check repo). +2. Define a cnpg database cluster for activity-core in `railiance-platform` + (if it uses postgres). +3. Write Helm chart / manifests in `railiance-apps/apps/activity-core/`. +4. Migrate any existing data from the ad-hoc CoulombCore deployment. +5. Add to `railiance-apps/Makefile`: + ```makefile + activity-core-deploy: ## Deploy activity-core to cluster + activity-core-status: ## Check activity-core health + ``` +6. Remove or archive the ad-hoc CoulombCore deployment. + +**Done when:** Activity-core accessible at its cluster URL; no ad-hoc process +remaining on CoulombCore; all prior functionality intact. + +--- + +## Phasing and dependencies + +``` +T01 (swap) ─┐ +T02 (nproc) ─┴─ independent, can parallelize + +T03 (cnpg cluster def) ──► T04 (migrate Gitea DB) ──► T05 (Valkey standalone) ──► T06 (move Gitea to S5) + +T07 (SSH remotes) ─ independent, unblock early + +T08 (deploy docs) ─ can be written in parallel with T03-T06 + +T09 (state-hub on cluster) ─ needs T04 (DB working) + T08 (deploy pattern) +T10 (activity-core) ─ needs T09 +``` + +Recommended order: T07 → T01+T02 → T03 → T04 → T05 → T06 → T08 → T09 → T10 + +## References + +- ADR-003 (OAS boundary rules): `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md` +- ADR-004 (connectivity-first): `the-custodian/canon/architecture/adr-004-connectivity-first-network-posture.md` +- INC-002 (overload incident): `the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md` +- Superseded: `railiance-platform/workplans/RAIL-PL-WP-0001-platform-baseline.md` +- ops-bridge runbook: `ops-bridge/docs/`