537 lines
18 KiB
Markdown
537 lines
18 KiB
Markdown
---
|
|
id: RAIL-HO-WP-0004
|
|
type: workplan
|
|
title: "Railiance Production Readiness — Automated, Reproducible Stack"
|
|
domain: railiance
|
|
repo: railiance-infra
|
|
status: active
|
|
owner: worsch
|
|
topic_slug: railiance
|
|
created: "2026-03-26"
|
|
updated: "2026-05-02"
|
|
supersedes: RAIL-PL-WP-0001
|
|
state_hub_workstream_id: "cee078e9-b18c-4f84-8a8a-6f27c2f9f407"
|
|
---
|
|
|
|
# Railiance Production Readiness — Automated, Reproducible Stack
|
|
|
|
## Goal
|
|
|
|
Make the Railiance cluster fully reproducible from a clean server — zero manual
|
|
interventions required. Then migrate operational workloads (state-hub,
|
|
activity-core) from the workstation and ad-hoc CoulombCore setup onto the
|
|
cluster, with confidence that a rebuild is always a `make deploy-stack` away.
|
|
|
|
This workplan supersedes `RAIL-PL-WP-0001` (which targeted Bitnami
|
|
postgresql-ha; cnpg is now the deployed and active operator).
|
|
|
|
## Why now
|
|
|
|
Three forcing functions are converging:
|
|
1. **CoulombCore hardening applied manually** (swapfile, nproc limits, systemd
|
|
slice) after INC-002. These must be in Ansible before the next node
|
|
rebuild — or the next operator overstep will repeat the incident.
|
|
2. **cnpg is deployed** (cnpg-system namespace, databases namespace active).
|
|
WP-0001 targeted Bitnami postgresql-ha which is now stale. A clean platform
|
|
baseline must match reality.
|
|
3. **State-hub and activity-core live on the workstation** — fragile,
|
|
non-redundant, not self-documenting. Moving them to the cluster is the last
|
|
step to making Railiance the durable operational home it was designed to be.
|
|
|
|
## Current deployed state (reference snapshot 2026-03-26)
|
|
|
|
| Component | Namespace | Manager | Boundary status |
|
|
|-----------|-----------|---------|-----------------|
|
|
| cert-manager | cert-manager | Helm S2 | ✓ correct |
|
|
| CloudNative PG operator | cnpg-system | Helm S2 | boundary violation: operator is S3 concern |
|
|
| nginx ingress | ingress-nginx | Helm S2 | ✓ correct |
|
|
| Gitea 12.5.0 | gitea | Helm S2 | boundary violation: should be S5 |
|
|
| ArgoCD | argocd | kubectl S2 | boundary violation: S4 concern |
|
|
| SSO/MFA stack | mfa + sso | ? | per net-kingdom |
|
|
| cnpg databases | databases | kubectl S3 | ✓ correct layer, no cluster defined yet |
|
|
|
|
## Scope
|
|
|
|
### Phase 1 — Ansible-codify server hardening (S1)
|
|
|
|
All manual CoulombCore interventions from INC-002 must become Ansible roles
|
|
so they survive node rebuild. No more drift between code and reality.
|
|
|
|
### Phase 2 — S3 platform baseline with cnpg (supersedes WP-0001)
|
|
|
|
Define a cnpg `Cluster` resource for the Gitea database in `railiance-platform`.
|
|
Migrate Gitea from its built-in postgresql-ha subchart to this cluster.
|
|
Codify Valkey as a standalone S3 Helm release.
|
|
|
|
### Phase 3 — S2 boundary cleanup
|
|
|
|
Move `gitea-values.sops.yaml` from `railiance-cluster` to `railiance-apps`.
|
|
Document remaining boundary violations (cnpg operator in S2, ArgoCD in S2)
|
|
and create forward-dated migration stubs.
|
|
|
|
### Phase 4 — Git operations from CoulombCore
|
|
|
|
CoulombCore cannot push to Gitea via HTTP (NodePort hairpin). Configure SSH
|
|
remote so all on-cluster git operations use SSH.
|
|
|
|
### Phase 5 — Automated stack deploy
|
|
|
|
Write a `deploy-stack` target (or script) that converges S1→S5 in dependency
|
|
order. The goal: a new operator can onboard a server and reach a working
|
|
cluster with one command sequence.
|
|
|
|
### Phase 6 — Migrate operational workloads (S5)
|
|
|
|
Deploy state-hub and activity-core to the cluster. This is the payoff phase —
|
|
the cluster becomes the operational home, not the workstation.
|
|
|
|
## Pre-conditions
|
|
|
|
- railiance-cluster converged and all S2 workplans done (they are: ✓)
|
|
- Gitea operational (it is: ✓, gitea namespace running)
|
|
- ops-bridge state-hub tunnel active (bridge up state-hub-coulombcore)
|
|
- Active backup before any phase touching live data (make backup in railiance-cluster)
|
|
|
|
---
|
|
|
|
## Tasks
|
|
|
|
### T01 — Ansible: swapfile role for CoulombCore
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "7c586940-f7b8-4e55-b1d6-72eba6a675b7"
|
|
```
|
|
|
|
Create an Ansible role `swapfile` (or extend `roles/base`) that provisions the
|
|
4 GB swapfile applied manually after INC-002.
|
|
|
|
Desired state:
|
|
```yaml
|
|
# inventory/host_vars/coulombcore.yml (or group_vars)
|
|
swap_size_gb: 4
|
|
swap_swappiness: 10
|
|
```
|
|
|
|
Role tasks:
|
|
1. Check `/swapfile` existence + correct size (fallocate idempotent)
|
|
2. `chmod 600 /swapfile`, `mkswap`, `swapon` if not already active
|
|
3. Ensure `/etc/fstab` entry present
|
|
4. Set `vm.swappiness=10` via `sysctl` module (persist in `/etc/sysctl.d/`)
|
|
|
|
**Convergence pattern:** Ansible is not installed on the workstation. Run convergence
|
|
directly on CoulombCore (local Ansible, connection=local):
|
|
```bash
|
|
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
|
|
'cd ~/railiance-infra && git pull && ansible-playbook ansible/playbooks/bootstrap.yaml -c local -u tegwick --become -l CoulombCore'
|
|
```
|
|
|
|
**Done when:** Convergence runs without errors; `free -h` on CoulombCore shows
|
|
4 GB swap; Goss verify passes.
|
|
|
|
---
|
|
|
|
### T02 — Ansible: nproc limits and systemd user slice
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "42f1f02b-0d8b-432c-8bc8-4930417e15dd"
|
|
```
|
|
|
|
Codify the PAM nproc limits and systemd user slice hardening applied after
|
|
INC-002 into Ansible (role `security` or a new `resource-limits` role).
|
|
|
|
Desired state:
|
|
```yaml
|
|
nproc_soft: 512
|
|
nproc_hard: 1024
|
|
user_memory_max: "1500M"
|
|
user_memory_swap_max: "512M"
|
|
```
|
|
|
|
Tasks:
|
|
1. Template `/etc/security/limits.conf` entry for tegwick (nproc soft/hard)
|
|
2. Create `/etc/systemd/system/user-1000.slice.d/limits.conf` via template
|
|
3. `systemctl daemon-reload` handler
|
|
|
|
**Done when:** `make converge` idempotent; `cat /proc/<tegwick-pid>/limits`
|
|
reflects caps; `make verify` passes; Goss test for nproc limit added.
|
|
|
|
---
|
|
|
|
### T03 — Define Gitea cnpg database cluster in railiance-platform
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "8e8cff04-96c6-4386-8caa-b0586114a49d"
|
|
```
|
|
|
|
Mark `RAIL-PL-WP-0001` as superseded (update its status field). Then define
|
|
the Gitea database cluster using CloudNative PG in `railiance-platform`.
|
|
|
|
Files to create in `railiance-platform/`:
|
|
```
|
|
helm/gitea-db-cluster.yaml # cnpg Cluster manifest (SOPS-encrypted secrets inline or ref)
|
|
Makefile targets: db-deploy, db-status, db-shell
|
|
```
|
|
|
|
Cluster manifest skeleton:
|
|
```yaml
|
|
apiVersion: postgresql.cnpg.io/v1
|
|
kind: Cluster
|
|
metadata:
|
|
name: gitea-db
|
|
namespace: databases
|
|
spec:
|
|
instances: 1 # single-node to start; bump to 3 when RAM allows
|
|
postgresql:
|
|
version: "16"
|
|
storage:
|
|
size: 10Gi
|
|
bootstrap:
|
|
initdb:
|
|
database: gitea
|
|
owner: gitea
|
|
secret:
|
|
name: gitea-db-credentials # k8s Secret (SOPS-managed)
|
|
```
|
|
|
|
Add `make db-deploy` target that applies the manifest to the `databases`
|
|
namespace. Add `make db-status` that shows cluster health via `kubectl cnpg
|
|
status`.
|
|
|
|
**Done when:** `make db-deploy` succeeds; `kubectl get cluster -n databases`
|
|
shows `gitea-db` in `Cluster in healthy state`; credentials secret present.
|
|
|
|
---
|
|
|
|
### T04 — Migrate Gitea to external cnpg database
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T04
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "4f4196b5-4d84-4648-b470-e6941444ea46"
|
|
```
|
|
|
|
**Pre-condition:** T03 done and gitea-db cluster healthy.
|
|
|
|
Migration steps (execute from CoulombCore with kubectl access):
|
|
|
|
1. Backup: `make backup` in railiance-cluster — verify success.
|
|
2. Dump current Gitea DB:
|
|
```bash
|
|
kubectl exec -n gitea deploy/gitea -- \
|
|
pg_dump -h localhost -U gitea gitea > /tmp/gitea-dump.sql
|
|
```
|
|
(Gitea's built-in postgresql-ha is at localhost within the pod)
|
|
3. Restore into cnpg cluster:
|
|
```bash
|
|
kubectl exec -n databases gitea-db-1 -- \
|
|
psql -U gitea gitea < /tmp/gitea-dump.sql
|
|
```
|
|
4. Update Gitea Helm values to disable subchart and point to cnpg:
|
|
```yaml
|
|
postgresql-ha:
|
|
enabled: false
|
|
redis-cluster:
|
|
enabled: false # Valkey handled in T06
|
|
gitea:
|
|
config:
|
|
database:
|
|
DB_TYPE: postgres
|
|
HOST: gitea-db-rw.databases.svc.cluster.local:5432
|
|
NAME: gitea
|
|
USER: gitea
|
|
PASSWD: <from cnpg secret>
|
|
```
|
|
5. `helm upgrade gitea` — verify login and all repos intact.
|
|
6. Confirm old postgresql-ha pods are terminated.
|
|
|
|
**Done when:** Gitea login works; all repos accessible; no postgresql-ha pods
|
|
running; `kubectl cnpg status gitea-db -n databases` healthy.
|
|
|
|
---
|
|
|
|
### T05 — Codify Valkey as standalone S3 asset
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T05
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "36c66ceb-ebc9-425a-a329-c37496278c6b"
|
|
```
|
|
|
|
Create `railiance-platform/helm/valkey-values.sops.yaml` and deploy Valkey
|
|
as a standalone Helm release in the `platform` namespace (independent of Gitea
|
|
subchart).
|
|
|
|
```yaml
|
|
# helm/valkey-values.sops.yaml
|
|
auth:
|
|
enabled: true
|
|
password: ENC[age,...]
|
|
replica:
|
|
replicaCount: 1
|
|
persistence:
|
|
enabled: true
|
|
size: 2Gi
|
|
```
|
|
|
|
Add `make valkey-deploy` and `make valkey-status` to `railiance-platform/Makefile`.
|
|
|
|
Update Gitea Helm values to point to standalone Valkey:
|
|
```yaml
|
|
redis-cluster:
|
|
enabled: false
|
|
gitea:
|
|
config:
|
|
cache:
|
|
ADAPTER: redis
|
|
HOST: redis://:password@valkey.platform.svc.cluster.local:6379/0
|
|
session:
|
|
PROVIDER: redis
|
|
PROVIDER_CONFIG: redis://:password@valkey.platform.svc.cluster.local:6379/1
|
|
queue:
|
|
TYPE: redis
|
|
CONN_STR: redis://:password@valkey.platform.svc.cluster.local:6379/2
|
|
```
|
|
|
|
**Done when:** `make valkey-deploy` succeeds; Gitea session/cache operational
|
|
on standalone Valkey; no redis subchart pods running.
|
|
|
|
---
|
|
|
|
### T06 — Move Gitea Helm values to railiance-apps (boundary fix)
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T06
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "6d8323b3-e842-4dc1-9a12-2b153b2afcce"
|
|
```
|
|
|
|
**Pre-condition:** T04 done (Gitea on external DB; Helm values updated).
|
|
|
|
```bash
|
|
# In railiance-cluster:
|
|
git mv helm/gitea-values.sops.yaml ../railiance-apps/helm/gitea-values.sops.yaml
|
|
```
|
|
|
|
Add to `railiance-apps/Makefile`:
|
|
```makefile
|
|
gitea-deploy: ## Deploy / upgrade Gitea (S5 workload)
|
|
helm upgrade --install gitea gitea-charts/gitea \
|
|
-f <(sops -d helm/gitea-values.sops.yaml) \
|
|
--namespace gitea --create-namespace
|
|
|
|
gitea-status: ## Check Gitea health
|
|
kubectl get pods -n gitea
|
|
kubectl cnpg status gitea-db -n databases
|
|
```
|
|
|
|
Add tombstone in `railiance-cluster/helm/MOVED.md`:
|
|
```
|
|
gitea-values.sops.yaml → railiance-apps/helm/ (2026-03-xx, RAIL-HO-WP-0004-T06)
|
|
```
|
|
|
|
Update `railiance-cluster/SCOPE.md` to remove Gitea boundary violation note.
|
|
Update `railiance-apps/SCOPE.md` Current State to reflect resolved violation.
|
|
|
|
**Done when:** `make gitea-deploy` from railiance-apps converges correctly;
|
|
Gitea operational; tombstone in place in railiance-cluster.
|
|
|
|
---
|
|
|
|
### T07 — SSH remote for git operations from CoulombCore
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T07
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "3d76754d-2dc0-4fe5-8bf2-c74d77cebe36"
|
|
```
|
|
|
|
CoulombCore cannot push to Gitea via HTTP (NodePort hairpin, no stored
|
|
credentials). Fix by configuring SSH-based remotes for all repos on CoulombCore.
|
|
|
|
Steps:
|
|
1. Generate an SSH key for the `tegwick` user on CoulombCore if not present:
|
|
```bash
|
|
ssh-keygen -t ed25519 -C "tegwick@coulombcore" -f ~/.ssh/id_ed25519_gitea
|
|
```
|
|
2. Add the public key to Gitea (coulomb user or a dedicated `coulombcore` bot
|
|
account via Gitea admin UI or API).
|
|
3. Add SSH config on CoulombCore:
|
|
```
|
|
# ~/.ssh/config
|
|
Host gitea-local
|
|
HostName localhost
|
|
Port <Gitea SSH NodePort>
|
|
User git
|
|
IdentityFile ~/.ssh/id_ed25519_gitea
|
|
```
|
|
Note: Gitea exposes SSH on a NodePort (check current value: `kubectl get svc -n gitea`).
|
|
4. Update remotes for all repos on CoulombCore:
|
|
```bash
|
|
git remote set-url origin ssh://git@gitea-local/coulomb/<repo>.git
|
|
```
|
|
5. Test: `git push origin main` from a repo on CoulombCore.
|
|
|
|
Codify the SSH key deployment step into Ansible
|
|
(`roles/base` or `roles/git-access`): ensure the key is present and the SSH
|
|
config block is templated.
|
|
|
|
**Done when:** `git push` from CoulombCore to Gitea succeeds over SSH without
|
|
prompts; Ansible role deploys the key idempotently.
|
|
|
|
---
|
|
|
|
### T08 — Automated stack deploy documentation + Makefile
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T08
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "b076e540-2d81-4be8-a454-61cfd329bb05"
|
|
```
|
|
|
|
Write `railiance-infra/docs/deploy-stack.md` — the operator runbook for
|
|
standing up the full Railiance stack from scratch. This is the canonical
|
|
"I have a clean server, now what?" reference.
|
|
|
|
Standard sequence:
|
|
```
|
|
S1: make tf-apply && make converge && make verify (railiance-infra)
|
|
S2: make converge && make smoke (railiance-cluster)
|
|
S3: make db-deploy && make valkey-deploy (railiance-platform)
|
|
S4: (ArgoCD already at cluster level; no S4 workplan yet)
|
|
S5: make gitea-deploy (railiance-apps)
|
|
make state-hub-deploy (railiance-apps, T09)
|
|
make activity-core-deploy (railiance-apps, T10)
|
|
```
|
|
|
|
Add a `make deploy-stack` target in `railiance-infra/Makefile` that prints
|
|
the ordered sequence with per-step instructions (not a single runaway script —
|
|
operator confirms each layer before proceeding).
|
|
|
|
Document:
|
|
- Pre-conditions checklist (Hetzner/HostEurope creds, age key, SOPS key)
|
|
- State Hub tunnel bring-up (ops-bridge)
|
|
- Recovery runbook pointer (INC-002 pattern)
|
|
|
|
**Done when:** `docs/deploy-stack.md` accurate and reviewed; `make deploy-stack`
|
|
prints the sequence; a new operator could follow it end-to-end without prior
|
|
context.
|
|
|
|
---
|
|
|
|
### T09 — Deploy state-hub to railiance01 as cluster primary (S5)
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T09
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "d2afe78a-eb51-4ce9-b332-f181323d2370"
|
|
needs_human: true
|
|
intervention_note: "Requires decisions: final hostname/domain or tunnel-only endpoint, registry choice, private exposure model, and approval before freezing workstation writes and migrating production State Hub data."
|
|
```
|
|
|
|
**Pre-condition:** T04 done (cnpg Gitea DB working); T08 done (deploy sequence
|
|
documented). Custodian-side safety gate `CUST-WP-0011-T01` must have passed:
|
|
a fresh WSL2 State Hub backup restore drill with matching row counts.
|
|
|
|
State-hub needs a PostgreSQL database — use a cnpg cluster in `databases`
|
|
namespace. This is the pragmatic railiance01 migration path; full multi-node
|
|
ThreePhoenix HA remains a separate Custodian follow-up (`CUST-WP-0038`).
|
|
|
|
Steps:
|
|
1. Define `state-hub-db` cnpg Cluster in `railiance-platform` (same pattern as T03).
|
|
2. Create a container image for state-hub (Dockerfile in `the-custodian/state-hub/`).
|
|
3. Push image to Gitea's container registry (or ghcr.io as interim).
|
|
4. Write Helm chart or plain manifests in `railiance-apps/apps/state-hub/`:
|
|
- Deployment (state-hub API, port 8000)
|
|
- Service + Ingress (https://state-hub.<domain>)
|
|
- ConfigMap for environment (DB URL, etc.)
|
|
- Secret for DB credentials (SOPS-managed)
|
|
5. Deploy empty State Hub and run Alembic migrations in-cluster.
|
|
6. Restore a copy of WSL2 data into the cnpg cluster and compare table counts
|
|
while the workstation remains the source of truth.
|
|
7. With explicit human approval, freeze workstation writes, take a final dump,
|
|
restore it to the cluster, and make railiance01 the primary endpoint.
|
|
8. Update ops-bridge tunnel targets or MCP `API_BASE` if the State Hub URL changes.
|
|
9. Update operator instructions to describe cluster primary plus WSL2 fallback.
|
|
|
|
**Done when:** the private State Hub endpoint returns healthy, MCP tools work
|
|
against the cluster-backed API, and WSL2 is retained as documented fallback.
|
|
Permanent WSL2 retirement is out of scope here and requires a later explicit
|
|
approval after stabilisation.
|
|
|
|
---
|
|
|
|
### T10 — Deploy activity-core to cluster (S5)
|
|
|
|
```task
|
|
id: RAIL-HO-WP-0004-T10
|
|
status: todo
|
|
priority: low
|
|
state_hub_task_id: "34d73215-f016-4750-8da5-69f82d63d619"
|
|
needs_human: true
|
|
intervention_note: "activity-core architecture needs review before packaging — needs confirmation of runtime (Rails/Go/other), whether it uses postgres, and what the migration strategy is for any existing on-node data."
|
|
```
|
|
|
|
**Pre-condition:** T09 done (state-hub on cluster operational).
|
|
|
|
Activity-core is the Rails/Go/other application running on CoulombCore ad-hoc.
|
|
This task packages and deploys it as a proper S5 workload.
|
|
|
|
Steps:
|
|
1. Verify activity-core has a working Dockerfile (check repo).
|
|
2. Define a cnpg database cluster for activity-core in `railiance-platform`
|
|
(if it uses postgres).
|
|
3. Write Helm chart / manifests in `railiance-apps/apps/activity-core/`.
|
|
4. Migrate any existing data from the ad-hoc CoulombCore deployment.
|
|
5. Add to `railiance-apps/Makefile`:
|
|
```makefile
|
|
activity-core-deploy: ## Deploy activity-core to cluster
|
|
activity-core-status: ## Check activity-core health
|
|
```
|
|
6. Remove or archive the ad-hoc CoulombCore deployment.
|
|
|
|
**Done when:** Activity-core accessible at its cluster URL; no ad-hoc process
|
|
remaining on CoulombCore; all prior functionality intact.
|
|
|
|
---
|
|
|
|
## Phasing and dependencies
|
|
|
|
```
|
|
T01 (swap) ─┐
|
|
T02 (nproc) ─┴─ independent, can parallelize
|
|
|
|
T03 (cnpg cluster def) ──► T04 (migrate Gitea DB) ──► T05 (Valkey standalone) ──► T06 (move Gitea to S5)
|
|
|
|
T07 (SSH remotes) ─ independent, unblock early
|
|
|
|
T08 (deploy docs) ─ can be written in parallel with T03-T06
|
|
|
|
T09 (state-hub on cluster) ─ needs T04 (DB working) + T08 (deploy pattern)
|
|
T10 (activity-core) ─ needs T09
|
|
```
|
|
|
|
Recommended order: T07 → T01+T02 → T03 → T04 → T05 → T06 → T08 → T09 → T10
|
|
|
|
## References
|
|
|
|
- ADR-003 (OAS boundary rules): `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`
|
|
- ADR-004 (connectivity-first): `the-custodian/canon/architecture/adr-004-connectivity-first-network-posture.md`
|
|
- INC-002 (overload incident): `the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`
|
|
- Superseded: `railiance-platform/workplans/RAIL-PL-WP-0001-platform-baseline.md`
|
|
- ops-bridge runbook: `ops-bridge/docs/`
|