Files

tegwick 2d7e0101bc feat(infra): UFW k3s routing + full deploy runbook

- base role: allow UFW routing direction (required for k3s flannel
  pod networking to function across nodes)
- docs/deploy-stack.md: full S1→S5 ordered deploy runbook with
  pre-conditions checklist and layer-by-layer steps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-27 02:28:51 +01:00

4.6 KiB

Raw Blame History

Railiance Stack — Full Deploy Runbook

When to use this: Starting from a bare server (post-OS install) or rebuilding after a catastrophic failure. For day-to-day operations use the individual layer repos. See ADR-003 for layer boundaries and ADR-004 for connectivity posture.

Pre-conditions checklist

Before starting, verify you have:

SSH access to the target server (COULOMBCORE: 92.205.130.254, user: tegwick, key: ~/.ssh/id_ops)
SOPS age private key available (~/.config/sops/age/keys.txt or SOPS_AGE_KEY env)
ops-bridge running on the workstation (needed for state hub MCP): make mcp-http in ~/the-custodian/state-hub/
Gitea accessible (needed for git pull on remote): SSH via gitea-remote:coulomb/<repo>.git
If re-provisioning from scratch: Hetzner/HostEurope API credentials decryptable via SOPS

S1 — Infrastructure Substrate (`railiance-infra`)

# On workstation
cd ~/railiance-infra

# Provision server (skip if server already exists)
make tf-plan        # review Terraform plan
make tf-apply       # create/update server

# Converge OS baseline
# NOTE: Ansible runs locally on CoulombCore (workstation has no Ansible installed)
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
  'cd ~/railiance-infra && git pull && \
   cd ansible && ansible-playbook playbooks/bootstrap.yaml \
     -c local --become -l CoulombCore'

# Verify OS baseline
make verify

Checkpoint: UFW active, fail2ban running, swap enabled, nproc limits in place, SOPS/age installed.

S2 — Cluster Runtime (`railiance-cluster`)

# On CoulombCore (SSH in first)
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254

cd ~/railiance-cluster
make converge       # installs k3s, Helm, cert-manager, nginx ingress, cnpg operator
make smoke          # runs cluster health assertions

Checkpoint: k3s running, Helm available, cert-manager and nginx-ingress pods Running, cnpg-system namespace active.

S3 — Platform Services (`railiance-platform`)

# On CoulombCore (kubectl available after S2)
cd ~/railiance-platform && git pull

# Create Gitea DB credentials secret (one-time; do NOT commit plaintext)
kubectl create secret generic gitea-db-credentials \
  --namespace databases \
  --from-literal=username=gitea \
  --from-literal=password=<GITEA_DB_PASSWORD>

# Deploy cnpg Gitea database cluster
make db-deploy

# Wait for cluster to be healthy (~60s)
make db-status

# Deploy Valkey cache (standalone, not as Gitea subchart)
# Requires: helm/valkey-values.sops.yaml with encrypted password
make valkey-deploy
make valkey-status

Checkpoint: kubectl get cluster -n databases shows gitea-db healthy; Valkey pod Running in platform namespace.

S4 — Developer Enablement (`railiance-enablement`)

No formal workplan yet. ArgoCD is currently deployed at cluster level (S2 boundary violation, tracked in RAIL-HO-WP-0004). No S4-specific steps required at this time.

S5 — Workloads & Experience (`railiance-apps`)

# On CoulombCore
cd ~/railiance-apps && git pull

# Deploy Gitea (git hosting)
# Requires: helm/gitea-values.sops.yaml with encrypted values
make gitea-deploy
make gitea-status

# Deploy state-hub (Custodian cognitive infrastructure)
# See RAIL-HO-WP-0004-T09 for full steps
make state-hub-deploy   # (not yet implemented — pending T09)

# Deploy activity-core
# See RAIL-HO-WP-0004-T10 for full steps
make activity-core-deploy  # (not yet implemented — pending T10)

Checkpoint: Gitea accessible and all repos cloneable via SSH; state-hub /state/health returns 200.

ops-bridge tunnel setup (workstation)

After S2 is up, establish the persistent tunnels from the workstation:

bridge up state-hub-coulombcore       # state-hub HTTP (port 18000 remote)
bridge up state-hub-mcp-coulombcore   # state-hub MCP (port 18001 remote)
bridge up k3s-api-coulombcore         # k3s API (port 16443 local)

Verify: bridge status shows all three connected.

Recovery pointers

Node overload / SSH unresponsive: See the-custodian/ops/runbooks/gitea-coulombcore.md Issue #3
Incident report: the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
Cluster backup restore: railiance-cluster/tools/cmd/railiance-restore-s2
Gitea SSH not working: Check gitea-ssh-nodeport service exists: kubectl get svc -n default gitea-ssh-nodeport

Layer dependency chain

S1 (infra) → S2 (cluster) → S3 (platform) → S4 (enablement) → S5 (workloads)

Each layer must be fully converged and verified before starting the next. Never configure S2 concerns from S3+ repos (ADR-003 boundary rule).

4.6 KiB Raw Blame History

Railiance Stack — Full Deploy Runbook

Pre-conditions checklist

S1 — Infrastructure Substrate (railiance-infra)

S2 — Cluster Runtime (railiance-cluster)

S3 — Platform Services (railiance-platform)

S4 — Developer Enablement (railiance-enablement)

S5 — Workloads & Experience (railiance-apps)