From 2d7e0101bc0fe3912bccccfc3e6329407f409a34 Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 27 Mar 2026 02:28:51 +0100 Subject: [PATCH] feat(infra): UFW k3s routing + full deploy runbook MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - base role: allow UFW routing direction (required for k3s flannel pod networking to function across nodes) - docs/deploy-stack.md: full S1→S5 ordered deploy runbook with pre-conditions checklist and layer-by-layer steps Co-Authored-By: Claude Sonnet 4.6 --- ansible/roles/base/tasks/main.yml | 7 +- docs/deploy-stack.md | 152 ++++++++++++++++++++++++++++++ 2 files changed, 158 insertions(+), 1 deletion(-) create mode 100644 docs/deploy-stack.md diff --git a/ansible/roles/base/tasks/main.yml b/ansible/roles/base/tasks/main.yml index af6a177..d2abbee 100644 --- a/ansible/roles/base/tasks/main.yml +++ b/ansible/roles/base/tasks/main.yml @@ -30,12 +30,17 @@ name: ssh state: restarted -- name: Configure UFW +- name: Configure UFW default incoming policy ansible.builtin.ufw: state: enabled policy: deny direction: incoming +- name: Allow UFW routing (required for k3s flannel pod networking) + ansible.builtin.ufw: + policy: allow + direction: routed + - name: Allow SSH in UFW ansible.builtin.ufw: rule: allow diff --git a/docs/deploy-stack.md b/docs/deploy-stack.md new file mode 100644 index 0000000..23350e5 --- /dev/null +++ b/docs/deploy-stack.md @@ -0,0 +1,152 @@ +# Railiance Stack — Full Deploy Runbook + +> **When to use this:** Starting from a bare server (post-OS install) or rebuilding +> after a catastrophic failure. For day-to-day operations use the individual layer +> repos. See ADR-003 for layer boundaries and ADR-004 for connectivity posture. + +## Pre-conditions checklist + +Before starting, verify you have: + +- [ ] SSH access to the target server (COULOMBCORE: 92.205.130.254, user: tegwick, key: `~/.ssh/id_ops`) +- [ ] SOPS age private key available (`~/.config/sops/age/keys.txt` or `SOPS_AGE_KEY` env) +- [ ] ops-bridge running on the workstation (needed for state hub MCP): `make mcp-http` in `~/the-custodian/state-hub/` +- [ ] Gitea accessible (needed for git pull on remote): SSH via `gitea-remote:coulomb/.git` +- [ ] If re-provisioning from scratch: Hetzner/HostEurope API credentials decryptable via SOPS + +--- + +## S1 — Infrastructure Substrate (`railiance-infra`) + +```bash +# On workstation +cd ~/railiance-infra + +# Provision server (skip if server already exists) +make tf-plan # review Terraform plan +make tf-apply # create/update server + +# Converge OS baseline +# NOTE: Ansible runs locally on CoulombCore (workstation has no Ansible installed) +ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \ + 'cd ~/railiance-infra && git pull && \ + cd ansible && ansible-playbook playbooks/bootstrap.yaml \ + -c local --become -l CoulombCore' + +# Verify OS baseline +make verify +``` + +**Checkpoint:** UFW active, fail2ban running, swap enabled, nproc limits in place, +SOPS/age installed. + +--- + +## S2 — Cluster Runtime (`railiance-cluster`) + +```bash +# On CoulombCore (SSH in first) +ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 + +cd ~/railiance-cluster +make converge # installs k3s, Helm, cert-manager, nginx ingress, cnpg operator +make smoke # runs cluster health assertions +``` + +**Checkpoint:** k3s running, Helm available, cert-manager and nginx-ingress pods Running, +cnpg-system namespace active. + +--- + +## S3 — Platform Services (`railiance-platform`) + +```bash +# On CoulombCore (kubectl available after S2) +cd ~/railiance-platform && git pull + +# Create Gitea DB credentials secret (one-time; do NOT commit plaintext) +kubectl create secret generic gitea-db-credentials \ + --namespace databases \ + --from-literal=username=gitea \ + --from-literal=password= + +# Deploy cnpg Gitea database cluster +make db-deploy + +# Wait for cluster to be healthy (~60s) +make db-status + +# Deploy Valkey cache (standalone, not as Gitea subchart) +# Requires: helm/valkey-values.sops.yaml with encrypted password +make valkey-deploy +make valkey-status +``` + +**Checkpoint:** `kubectl get cluster -n databases` shows `gitea-db` healthy; +Valkey pod Running in platform namespace. + +--- + +## S4 — Developer Enablement (`railiance-enablement`) + +No formal workplan yet. ArgoCD is currently deployed at cluster level (S2 boundary +violation, tracked in RAIL-HO-WP-0004). No S4-specific steps required at this time. + +--- + +## S5 — Workloads & Experience (`railiance-apps`) + +```bash +# On CoulombCore +cd ~/railiance-apps && git pull + +# Deploy Gitea (git hosting) +# Requires: helm/gitea-values.sops.yaml with encrypted values +make gitea-deploy +make gitea-status + +# Deploy state-hub (Custodian cognitive infrastructure) +# See RAIL-HO-WP-0004-T09 for full steps +make state-hub-deploy # (not yet implemented — pending T09) + +# Deploy activity-core +# See RAIL-HO-WP-0004-T10 for full steps +make activity-core-deploy # (not yet implemented — pending T10) +``` + +**Checkpoint:** Gitea accessible and all repos cloneable via SSH; state-hub +`/state/health` returns 200. + +--- + +## ops-bridge tunnel setup (workstation) + +After S2 is up, establish the persistent tunnels from the workstation: + +```bash +bridge up state-hub-coulombcore # state-hub HTTP (port 18000 remote) +bridge up state-hub-mcp-coulombcore # state-hub MCP (port 18001 remote) +bridge up k3s-api-coulombcore # k3s API (port 16443 local) +``` + +Verify: `bridge status` shows all three connected. + +--- + +## Recovery pointers + +- **Node overload / SSH unresponsive:** See `the-custodian/ops/runbooks/gitea-coulombcore.md` Issue #3 +- **Incident report:** `the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md` +- **Cluster backup restore:** `railiance-cluster/tools/cmd/railiance-restore-s2` +- **Gitea SSH not working:** Check `gitea-ssh-nodeport` service exists: `kubectl get svc -n default gitea-ssh-nodeport` + +--- + +## Layer dependency chain + +``` +S1 (infra) → S2 (cluster) → S3 (platform) → S4 (enablement) → S5 (workloads) +``` + +Each layer must be fully converged and verified before starting the next. +Never configure S2 concerns from S3+ repos (ADR-003 boundary rule).