feat(infra): UFW k3s routing + full deploy runbook
- base role: allow UFW routing direction (required for k3s flannel pod networking to function across nodes) - docs/deploy-stack.md: full S1→S5 ordered deploy runbook with pre-conditions checklist and layer-by-layer steps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -30,12 +30,17 @@
|
||||
name: ssh
|
||||
state: restarted
|
||||
|
||||
- name: Configure UFW
|
||||
- name: Configure UFW default incoming policy
|
||||
ansible.builtin.ufw:
|
||||
state: enabled
|
||||
policy: deny
|
||||
direction: incoming
|
||||
|
||||
- name: Allow UFW routing (required for k3s flannel pod networking)
|
||||
ansible.builtin.ufw:
|
||||
policy: allow
|
||||
direction: routed
|
||||
|
||||
- name: Allow SSH in UFW
|
||||
ansible.builtin.ufw:
|
||||
rule: allow
|
||||
|
||||
152
docs/deploy-stack.md
Normal file
152
docs/deploy-stack.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Railiance Stack — Full Deploy Runbook
|
||||
|
||||
> **When to use this:** Starting from a bare server (post-OS install) or rebuilding
|
||||
> after a catastrophic failure. For day-to-day operations use the individual layer
|
||||
> repos. See ADR-003 for layer boundaries and ADR-004 for connectivity posture.
|
||||
|
||||
## Pre-conditions checklist
|
||||
|
||||
Before starting, verify you have:
|
||||
|
||||
- [ ] SSH access to the target server (COULOMBCORE: 92.205.130.254, user: tegwick, key: `~/.ssh/id_ops`)
|
||||
- [ ] SOPS age private key available (`~/.config/sops/age/keys.txt` or `SOPS_AGE_KEY` env)
|
||||
- [ ] ops-bridge running on the workstation (needed for state hub MCP): `make mcp-http` in `~/the-custodian/state-hub/`
|
||||
- [ ] Gitea accessible (needed for git pull on remote): SSH via `gitea-remote:coulomb/<repo>.git`
|
||||
- [ ] If re-provisioning from scratch: Hetzner/HostEurope API credentials decryptable via SOPS
|
||||
|
||||
---
|
||||
|
||||
## S1 — Infrastructure Substrate (`railiance-infra`)
|
||||
|
||||
```bash
|
||||
# On workstation
|
||||
cd ~/railiance-infra
|
||||
|
||||
# Provision server (skip if server already exists)
|
||||
make tf-plan # review Terraform plan
|
||||
make tf-apply # create/update server
|
||||
|
||||
# Converge OS baseline
|
||||
# NOTE: Ansible runs locally on CoulombCore (workstation has no Ansible installed)
|
||||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
|
||||
'cd ~/railiance-infra && git pull && \
|
||||
cd ansible && ansible-playbook playbooks/bootstrap.yaml \
|
||||
-c local --become -l CoulombCore'
|
||||
|
||||
# Verify OS baseline
|
||||
make verify
|
||||
```
|
||||
|
||||
**Checkpoint:** UFW active, fail2ban running, swap enabled, nproc limits in place,
|
||||
SOPS/age installed.
|
||||
|
||||
---
|
||||
|
||||
## S2 — Cluster Runtime (`railiance-cluster`)
|
||||
|
||||
```bash
|
||||
# On CoulombCore (SSH in first)
|
||||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254
|
||||
|
||||
cd ~/railiance-cluster
|
||||
make converge # installs k3s, Helm, cert-manager, nginx ingress, cnpg operator
|
||||
make smoke # runs cluster health assertions
|
||||
```
|
||||
|
||||
**Checkpoint:** k3s running, Helm available, cert-manager and nginx-ingress pods Running,
|
||||
cnpg-system namespace active.
|
||||
|
||||
---
|
||||
|
||||
## S3 — Platform Services (`railiance-platform`)
|
||||
|
||||
```bash
|
||||
# On CoulombCore (kubectl available after S2)
|
||||
cd ~/railiance-platform && git pull
|
||||
|
||||
# Create Gitea DB credentials secret (one-time; do NOT commit plaintext)
|
||||
kubectl create secret generic gitea-db-credentials \
|
||||
--namespace databases \
|
||||
--from-literal=username=gitea \
|
||||
--from-literal=password=<GITEA_DB_PASSWORD>
|
||||
|
||||
# Deploy cnpg Gitea database cluster
|
||||
make db-deploy
|
||||
|
||||
# Wait for cluster to be healthy (~60s)
|
||||
make db-status
|
||||
|
||||
# Deploy Valkey cache (standalone, not as Gitea subchart)
|
||||
# Requires: helm/valkey-values.sops.yaml with encrypted password
|
||||
make valkey-deploy
|
||||
make valkey-status
|
||||
```
|
||||
|
||||
**Checkpoint:** `kubectl get cluster -n databases` shows `gitea-db` healthy;
|
||||
Valkey pod Running in platform namespace.
|
||||
|
||||
---
|
||||
|
||||
## S4 — Developer Enablement (`railiance-enablement`)
|
||||
|
||||
No formal workplan yet. ArgoCD is currently deployed at cluster level (S2 boundary
|
||||
violation, tracked in RAIL-HO-WP-0004). No S4-specific steps required at this time.
|
||||
|
||||
---
|
||||
|
||||
## S5 — Workloads & Experience (`railiance-apps`)
|
||||
|
||||
```bash
|
||||
# On CoulombCore
|
||||
cd ~/railiance-apps && git pull
|
||||
|
||||
# Deploy Gitea (git hosting)
|
||||
# Requires: helm/gitea-values.sops.yaml with encrypted values
|
||||
make gitea-deploy
|
||||
make gitea-status
|
||||
|
||||
# Deploy state-hub (Custodian cognitive infrastructure)
|
||||
# See RAIL-HO-WP-0004-T09 for full steps
|
||||
make state-hub-deploy # (not yet implemented — pending T09)
|
||||
|
||||
# Deploy activity-core
|
||||
# See RAIL-HO-WP-0004-T10 for full steps
|
||||
make activity-core-deploy # (not yet implemented — pending T10)
|
||||
```
|
||||
|
||||
**Checkpoint:** Gitea accessible and all repos cloneable via SSH; state-hub
|
||||
`/state/health` returns 200.
|
||||
|
||||
---
|
||||
|
||||
## ops-bridge tunnel setup (workstation)
|
||||
|
||||
After S2 is up, establish the persistent tunnels from the workstation:
|
||||
|
||||
```bash
|
||||
bridge up state-hub-coulombcore # state-hub HTTP (port 18000 remote)
|
||||
bridge up state-hub-mcp-coulombcore # state-hub MCP (port 18001 remote)
|
||||
bridge up k3s-api-coulombcore # k3s API (port 16443 local)
|
||||
```
|
||||
|
||||
Verify: `bridge status` shows all three connected.
|
||||
|
||||
---
|
||||
|
||||
## Recovery pointers
|
||||
|
||||
- **Node overload / SSH unresponsive:** See `the-custodian/ops/runbooks/gitea-coulombcore.md` Issue #3
|
||||
- **Incident report:** `the-custodian/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`
|
||||
- **Cluster backup restore:** `railiance-cluster/tools/cmd/railiance-restore-s2`
|
||||
- **Gitea SSH not working:** Check `gitea-ssh-nodeport` service exists: `kubectl get svc -n default gitea-ssh-nodeport`
|
||||
|
||||
---
|
||||
|
||||
## Layer dependency chain
|
||||
|
||||
```
|
||||
S1 (infra) → S2 (cluster) → S3 (platform) → S4 (enablement) → S5 (workloads)
|
||||
```
|
||||
|
||||
Each layer must be fully converged and verified before starting the next.
|
||||
Never configure S2 concerns from S3+ repos (ADR-003 boundary rule).
|
||||
Reference in New Issue
Block a user