Files
railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md
2026-05-03 00:03:29 +02:00

18 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, supersedes, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated supersedes state_hub_workstream_id
RAIL-HO-WP-0004 workplan Railiance Production Readiness — Automated, Reproducible Stack railiance railiance-infra active worsch railiance 2026-03-26 2026-05-02 RAIL-PL-WP-0001 cee078e9-b18c-4f84-8a8a-6f27c2f9f407

Railiance Production Readiness — Automated, Reproducible Stack

Goal

Make the Railiance cluster fully reproducible from a clean server — zero manual interventions required. Then migrate operational workloads (state-hub, activity-core) from the workstation and ad-hoc CoulombCore setup onto the cluster, with confidence that a rebuild is always a make deploy-stack away.

This workplan supersedes RAIL-PL-WP-0001 (which targeted Bitnami postgresql-ha; cnpg is now the deployed and active operator).

Why now

Three forcing functions are converging:

  1. CoulombCore hardening applied manually (swapfile, nproc limits, systemd slice) after INC-002. These must be in Ansible before the next node rebuild — or the next operator overstep will repeat the incident.
  2. cnpg is deployed (cnpg-system namespace, databases namespace active). WP-0001 targeted Bitnami postgresql-ha which is now stale. A clean platform baseline must match reality.
  3. State-hub and activity-core live on the workstation — fragile, non-redundant, not self-documenting. Moving them to the cluster is the last step to making Railiance the durable operational home it was designed to be.

Current deployed state (reference snapshot 2026-03-26)

Component Namespace Manager Boundary status
cert-manager cert-manager Helm S2 ✓ correct
CloudNative PG operator cnpg-system Helm S2 boundary violation: operator is S3 concern
nginx ingress ingress-nginx Helm S2 ✓ correct
Gitea 12.5.0 gitea Helm S2 boundary violation: should be S5
ArgoCD argocd kubectl S2 boundary violation: S4 concern
SSO/MFA stack mfa + sso ? per net-kingdom
cnpg databases databases kubectl S3 ✓ correct layer, no cluster defined yet

Scope

Phase 1 — Ansible-codify server hardening (S1)

All manual CoulombCore interventions from INC-002 must become Ansible roles so they survive node rebuild. No more drift between code and reality.

Phase 2 — S3 platform baseline with cnpg (supersedes WP-0001)

Define a cnpg Cluster resource for the Gitea database in railiance-platform. Migrate Gitea from its built-in postgresql-ha subchart to this cluster. Codify Valkey as a standalone S3 Helm release.

Phase 3 — S2 boundary cleanup

Move gitea-values.sops.yaml from railiance-cluster to railiance-apps. Document remaining boundary violations (cnpg operator in S2, ArgoCD in S2) and create forward-dated migration stubs.

Phase 4 — Git operations from CoulombCore

CoulombCore cannot push to Gitea via HTTP (NodePort hairpin). Configure SSH remote so all on-cluster git operations use SSH.

Phase 5 — Automated stack deploy

Write a deploy-stack target (or script) that converges S1→S5 in dependency order. The goal: a new operator can onboard a server and reach a working cluster with one command sequence.

Phase 6 — Migrate operational workloads (S5)

Deploy state-hub and activity-core to the cluster. This is the payoff phase — the cluster becomes the operational home, not the workstation.

Pre-conditions

  • railiance-cluster converged and all S2 workplans done (they are: ✓)
  • Gitea operational (it is: ✓, gitea namespace running)
  • ops-bridge state-hub tunnel active (bridge up state-hub-coulombcore)
  • Active backup before any phase touching live data (make backup in railiance-cluster)

Tasks

T01 — Ansible: swapfile role for CoulombCore

id: RAIL-HO-WP-0004-T01
status: done
priority: high
state_hub_task_id: "7c586940-f7b8-4e55-b1d6-72eba6a675b7"

Create an Ansible role swapfile (or extend roles/base) that provisions the 4 GB swapfile applied manually after INC-002.

Desired state:

# inventory/host_vars/coulombcore.yml (or group_vars)
swap_size_gb: 4
swap_swappiness: 10

Role tasks:

  1. Check /swapfile existence + correct size (fallocate idempotent)
  2. chmod 600 /swapfile, mkswap, swapon if not already active
  3. Ensure /etc/fstab entry present
  4. Set vm.swappiness=10 via sysctl module (persist in /etc/sysctl.d/)

Convergence pattern: Ansible is not installed on the workstation. Run convergence directly on CoulombCore (local Ansible, connection=local):

ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
  'cd ~/railiance-infra && git pull && ansible-playbook ansible/playbooks/bootstrap.yaml -c local -u tegwick --become -l CoulombCore'

Done when: Convergence runs without errors; free -h on CoulombCore shows 4 GB swap; Goss verify passes.


T02 — Ansible: nproc limits and systemd user slice

id: RAIL-HO-WP-0004-T02
status: done
priority: high
state_hub_task_id: "42f1f02b-0d8b-432c-8bc8-4930417e15dd"

Codify the PAM nproc limits and systemd user slice hardening applied after INC-002 into Ansible (role security or a new resource-limits role).

Desired state:

nproc_soft: 512
nproc_hard: 1024
user_memory_max: "1500M"
user_memory_swap_max: "512M"

Tasks:

  1. Template /etc/security/limits.conf entry for tegwick (nproc soft/hard)
  2. Create /etc/systemd/system/user-1000.slice.d/limits.conf via template
  3. systemctl daemon-reload handler

Done when: make converge idempotent; cat /proc/<tegwick-pid>/limits reflects caps; make verify passes; Goss test for nproc limit added.


T03 — Define Gitea cnpg database cluster in railiance-platform

id: RAIL-HO-WP-0004-T03
status: done
priority: high
state_hub_task_id: "8e8cff04-96c6-4386-8caa-b0586114a49d"

Mark RAIL-PL-WP-0001 as superseded (update its status field). Then define the Gitea database cluster using CloudNative PG in railiance-platform.

Files to create in railiance-platform/:

helm/gitea-db-cluster.yaml       # cnpg Cluster manifest (SOPS-encrypted secrets inline or ref)
Makefile targets: db-deploy, db-status, db-shell

Cluster manifest skeleton:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: gitea-db
  namespace: databases
spec:
  instances: 1           # single-node to start; bump to 3 when RAM allows
  postgresql:
    version: "16"
  storage:
    size: 10Gi
  bootstrap:
    initdb:
      database: gitea
      owner: gitea
      secret:
        name: gitea-db-credentials   # k8s Secret (SOPS-managed)

Add make db-deploy target that applies the manifest to the databases namespace. Add make db-status that shows cluster health via kubectl cnpg status.

Done when: make db-deploy succeeds; kubectl get cluster -n databases shows gitea-db in Cluster in healthy state; credentials secret present.


T04 — Migrate Gitea to external cnpg database

id: RAIL-HO-WP-0004-T04
status: done
priority: high
state_hub_task_id: "4f4196b5-4d84-4648-b470-e6941444ea46"

Pre-condition: T03 done and gitea-db cluster healthy.

Migration steps (execute from CoulombCore with kubectl access):

  1. Backup: make backup in railiance-cluster — verify success.
  2. Dump current Gitea DB:
    kubectl exec -n gitea deploy/gitea -- \
      pg_dump -h localhost -U gitea gitea > /tmp/gitea-dump.sql
    
    (Gitea's built-in postgresql-ha is at localhost within the pod)
  3. Restore into cnpg cluster:
    kubectl exec -n databases gitea-db-1 -- \
      psql -U gitea gitea < /tmp/gitea-dump.sql
    
  4. Update Gitea Helm values to disable subchart and point to cnpg:
    postgresql-ha:
      enabled: false
    redis-cluster:
      enabled: false   # Valkey handled in T06
    gitea:
      config:
        database:
          DB_TYPE: postgres
          HOST: gitea-db-rw.databases.svc.cluster.local:5432
          NAME: gitea
          USER: gitea
          PASSWD: <from cnpg secret>
    
  5. helm upgrade gitea — verify login and all repos intact.
  6. Confirm old postgresql-ha pods are terminated.

Done when: Gitea login works; all repos accessible; no postgresql-ha pods running; kubectl cnpg status gitea-db -n databases healthy.


T05 — Codify Valkey as standalone S3 asset

id: RAIL-HO-WP-0004-T05
status: done
priority: medium
state_hub_task_id: "36c66ceb-ebc9-425a-a329-c37496278c6b"

Create railiance-platform/helm/valkey-values.sops.yaml and deploy Valkey as a standalone Helm release in the platform namespace (independent of Gitea subchart).

# helm/valkey-values.sops.yaml
auth:
  enabled: true
  password: ENC[age,...]
replica:
  replicaCount: 1
persistence:
  enabled: true
  size: 2Gi

Add make valkey-deploy and make valkey-status to railiance-platform/Makefile.

Update Gitea Helm values to point to standalone Valkey:

redis-cluster:
  enabled: false
gitea:
  config:
    cache:
      ADAPTER: redis
      HOST: redis://:password@valkey.platform.svc.cluster.local:6379/0
    session:
      PROVIDER: redis
      PROVIDER_CONFIG: redis://:password@valkey.platform.svc.cluster.local:6379/1
    queue:
      TYPE: redis
      CONN_STR: redis://:password@valkey.platform.svc.cluster.local:6379/2

Done when: make valkey-deploy succeeds; Gitea session/cache operational on standalone Valkey; no redis subchart pods running.


T06 — Move Gitea Helm values to railiance-apps (boundary fix)

id: RAIL-HO-WP-0004-T06
status: done
priority: medium
state_hub_task_id: "6d8323b3-e842-4dc1-9a12-2b153b2afcce"

Pre-condition: T04 done (Gitea on external DB; Helm values updated).

# In railiance-cluster:
git mv helm/gitea-values.sops.yaml ../railiance-apps/helm/gitea-values.sops.yaml

Add to railiance-apps/Makefile:

gitea-deploy: ## Deploy / upgrade Gitea (S5 workload)
    helm upgrade --install gitea gitea-charts/gitea \
        -f <(sops -d helm/gitea-values.sops.yaml) \
        --namespace gitea --create-namespace

gitea-status: ## Check Gitea health
    kubectl get pods -n gitea
    kubectl cnpg status gitea-db -n databases

Add tombstone in railiance-cluster/helm/MOVED.md:

gitea-values.sops.yaml → railiance-apps/helm/ (2026-03-xx, RAIL-HO-WP-0004-T06)

Update railiance-cluster/SCOPE.md to remove Gitea boundary violation note. Update railiance-apps/SCOPE.md Current State to reflect resolved violation.

Done when: make gitea-deploy from railiance-apps converges correctly; Gitea operational; tombstone in place in railiance-cluster.


T07 — SSH remote for git operations from CoulombCore

id: RAIL-HO-WP-0004-T07
status: done
priority: high
state_hub_task_id: "3d76754d-2dc0-4fe5-8bf2-c74d77cebe36"

CoulombCore cannot push to Gitea via HTTP (NodePort hairpin, no stored credentials). Fix by configuring SSH-based remotes for all repos on CoulombCore.

Steps:

  1. Generate an SSH key for the tegwick user on CoulombCore if not present:
    ssh-keygen -t ed25519 -C "tegwick@coulombcore" -f ~/.ssh/id_ed25519_gitea
    
  2. Add the public key to Gitea (coulomb user or a dedicated coulombcore bot account via Gitea admin UI or API).
  3. Add SSH config on CoulombCore:
    # ~/.ssh/config
    Host gitea-local
        HostName localhost
        Port <Gitea SSH NodePort>
        User git
        IdentityFile ~/.ssh/id_ed25519_gitea
    
    Note: Gitea exposes SSH on a NodePort (check current value: kubectl get svc -n gitea).
  4. Update remotes for all repos on CoulombCore:
    git remote set-url origin ssh://git@gitea-local/coulomb/<repo>.git
    
  5. Test: git push origin main from a repo on CoulombCore.

Codify the SSH key deployment step into Ansible (roles/base or roles/git-access): ensure the key is present and the SSH config block is templated.

Done when: git push from CoulombCore to Gitea succeeds over SSH without prompts; Ansible role deploys the key idempotently.


T08 — Automated stack deploy documentation + Makefile

id: RAIL-HO-WP-0004-T08
status: done
priority: medium
state_hub_task_id: "b076e540-2d81-4be8-a454-61cfd329bb05"

Write railiance-infra/docs/deploy-stack.md — the operator runbook for standing up the full Railiance stack from scratch. This is the canonical "I have a clean server, now what?" reference.

Standard sequence:

S1: make tf-apply && make converge && make verify  (railiance-infra)
S2: make converge && make smoke                    (railiance-cluster)
S3: make db-deploy && make valkey-deploy           (railiance-platform)
S4: (ArgoCD already at cluster level; no S4 workplan yet)
S5: make gitea-deploy                              (railiance-apps)
    make state-hub-deploy                          (railiance-apps, T09)
    make activity-core-deploy                      (railiance-apps, T10)

Add a make deploy-stack target in railiance-infra/Makefile that prints the ordered sequence with per-step instructions (not a single runaway script — operator confirms each layer before proceeding).

Document:

  • Pre-conditions checklist (Hetzner/HostEurope creds, age key, SOPS key)
  • State Hub tunnel bring-up (ops-bridge)
  • Recovery runbook pointer (INC-002 pattern)

Done when: docs/deploy-stack.md accurate and reviewed; make deploy-stack prints the sequence; a new operator could follow it end-to-end without prior context.


T09 — Deploy state-hub to railiance01 as cluster primary (S5)

id: RAIL-HO-WP-0004-T09
status: todo
priority: medium
state_hub_task_id: "d2afe78a-eb51-4ce9-b332-f181323d2370"
needs_human: true
intervention_note: "Requires decisions: final hostname/domain or tunnel-only endpoint, registry choice, private exposure model, and approval before freezing workstation writes and migrating production State Hub data."

Pre-condition: T04 done (cnpg Gitea DB working); T08 done (deploy sequence documented). Custodian-side safety gate CUST-WP-0011-T01 must have passed: a fresh WSL2 State Hub backup restore drill with matching row counts.

State-hub needs a PostgreSQL database — use a cnpg cluster in databases namespace. This is the pragmatic railiance01 migration path; full multi-node ThreePhoenix HA remains a separate Custodian follow-up (CUST-WP-0038).

Steps:

  1. Define state-hub-db cnpg Cluster in railiance-platform (same pattern as T03).
  2. Create a container image for state-hub (Dockerfile in the-custodian/state-hub/).
  3. Push image to Gitea's container registry (or ghcr.io as interim).
  4. Write Helm chart or plain manifests in railiance-apps/apps/state-hub/:
    • Deployment (state-hub API, port 8000)
    • Service + Ingress (https://state-hub.)
    • ConfigMap for environment (DB URL, etc.)
    • Secret for DB credentials (SOPS-managed)
  5. Deploy empty State Hub and run Alembic migrations in-cluster.
  6. Restore a copy of WSL2 data into the cnpg cluster and compare table counts while the workstation remains the source of truth.
  7. With explicit human approval, freeze workstation writes, take a final dump, restore it to the cluster, and make railiance01 the primary endpoint.
  8. Update ops-bridge tunnel targets or MCP API_BASE if the State Hub URL changes.
  9. Update operator instructions to describe cluster primary plus WSL2 fallback.

Done when: the private State Hub endpoint returns healthy, MCP tools work against the cluster-backed API, and WSL2 is retained as documented fallback. Permanent WSL2 retirement is out of scope here and requires a later explicit approval after stabilisation.


T10 — Deploy activity-core to cluster (S5)

id: RAIL-HO-WP-0004-T10
status: todo
priority: low
state_hub_task_id: "34d73215-f016-4750-8da5-69f82d63d619"
needs_human: true
intervention_note: "activity-core architecture needs review before packaging — needs confirmation of runtime (Rails/Go/other), whether it uses postgres, and what the migration strategy is for any existing on-node data."

Pre-condition: T09 done (state-hub on cluster operational).

Activity-core is the Rails/Go/other application running on CoulombCore ad-hoc. This task packages and deploys it as a proper S5 workload.

Steps:

  1. Verify activity-core has a working Dockerfile (check repo).
  2. Define a cnpg database cluster for activity-core in railiance-platform (if it uses postgres).
  3. Write Helm chart / manifests in railiance-apps/apps/activity-core/.
  4. Migrate any existing data from the ad-hoc CoulombCore deployment.
  5. Add to railiance-apps/Makefile:
    activity-core-deploy: ## Deploy activity-core to cluster
    activity-core-status: ## Check activity-core health
    
  6. Remove or archive the ad-hoc CoulombCore deployment.

Done when: Activity-core accessible at its cluster URL; no ad-hoc process remaining on CoulombCore; all prior functionality intact.


Phasing and dependencies

T01 (swap) ─┐
T02 (nproc) ─┴─ independent, can parallelize

T03 (cnpg cluster def) ──► T04 (migrate Gitea DB) ──► T05 (Valkey standalone) ──► T06 (move Gitea to S5)

T07 (SSH remotes) ─ independent, unblock early

T08 (deploy docs) ─ can be written in parallel with T03-T06

T09 (state-hub on cluster) ─ needs T04 (DB working) + T08 (deploy pattern)
T10 (activity-core) ─ needs T09

Recommended order: T07 → T01+T02 → T03 → T04 → T05 → T06 → T08 → T09 → T10

References

  • ADR-003 (OAS boundary rules): railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md
  • ADR-004 (connectivity-first): the-custodian/canon/architecture/adr-004-connectivity-first-network-posture.md
  • INC-002 (overload incident): the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
  • Superseded: railiance-platform/workplans/RAIL-PL-WP-0001-platform-baseline.md
  • ops-bridge runbook: ops-bridge/docs/