Files
inter-hub/workplans/IHUB-WP-0018-railiance01-deployment.md

521 lines
21 KiB
Markdown

---
id: IHUB-WP-0018
type: workplan
title: "Railiance01 Deployment — Production Operations Scaffold"
domain: inter_hub
repo: inter-hub
status: finished
owner: custodian
topic_slug: inter_hub
created: "2026-04-29"
updated: "2026-06-14"
depends_on: IHUB-WP-0015
state_hub_workstream_id: "080d841a-3acd-4adf-b684-2d1890a5e986"
---
# IHUB-WP-0018 — Railiance01 Deployment: Production Operations Scaffold
## Goal
Deploy inter-hub to the Railiance01 Kubernetes cluster with fully automatic
deployment, SOPS-encrypted secrets, Traefik ingress, PostgreSQL HA, and a
Gitea Actions CI/CD pipeline. After this workplan, every push to `main`
automatically builds an OCI container image on haskelseed, pushes it to the
Railiance container registry, and deploys it — with automatic restart on node
reboot guaranteed by K3s.
## Background
inter-hub v0.2.0-alpha.1 is running on haskelseed (Alpine) via RunDevServer
and socat. That setup is a development convenience, not a production operations
scaffold. The target is the Railiance01 K3s cluster, which has:
- K3s (single-node for now; ThreePhoenix HA cluster is in progress)
- Traefik ingress with TLS
- PostgreSQL HA (repmgr + pgpool) managed by railiance-platform
- SOPS/age secret management
- Gitea with built-in container registry (or separate registry service)
- Staged Promotion Lifecycle CLI (`railiance run / deploy / promote / rollback`)
**Key constraint:** This workplan depends on Railiance01 K3s being operational.
Gate R3 verifies cluster readiness before any deployment work begins — if K3s
or the container registry is not ready, this workplan blocks there and the
cluster work must be completed first.
**IHP specifics:** IHP DevServer is a development server. For production we
build the IHP binary via `nix build` (which produces a self-contained binary)
and wrap it in a minimal OCI image using Nix's `dockerTools.buildImage`. The
app serves HTTP on port 8000; the socat workaround is not needed in Kubernetes
since Traefik routes directly to the pod's port.
## Architecture
```
git push → Gitea Actions
→ SSH to haskelseed: nix build → docker load → docker push registry/inter-hub:$SHA
→ helm upgrade inter-hub railiance-apps/helm/inter-hub
→ Deployment (1 replica): inter-hub:$SHA + env from Secrets
→ Service (ClusterIP :8000)
→ Ingress (Traefik): hub.coulomb.social → Service
→ PersistentVolumeClaim: /app/static (generated CSS/JS)
→ PostgreSQL: database 'interhub' on railiance-platform HA cluster
```
## Close-out Audit - 2026-06-04
WSJF triage flagged this workplan as a close-out candidate because State Hub had
no indexed task rows for it. The deployment work is not complete; this file now
contains explicit task blocks so the hub can track the remaining Railiance01
deployment work instead of treating the workplan as empty.
## Deployment Review - 2026-06-05
Review against the current repo and public Railiance endpoint shows the
deployment scaffold is partially implemented but the live deployment is behind
`origin/main`.
- `origin/main` is at `a3d980c`, which includes the completed ops-hub bootstrap
API work from `IHUB-WP-0019`.
- `https://hub.coulomb.social/` returns 200 and serves inter-hub.
- The public OpenAPI only lists the older v2 endpoints; it does not include
`/hubs`, `/hub-capability-manifests`, `/api-consumers`, or `/policy-scopes`.
- Unauthenticated `/api/v2/hubs` returns 404 publicly, while current source
should route it and return 401. This means ops-hub bootstrap cannot run
against production until the current image is deployed.
- The registry endpoint returns the expected unauthenticated `/v2/` 401
challenge, but this workspace does not have `kubectl`, so R3 cluster readiness
cannot be fully verified from here.
## Tasks
### R1 - Add OCI image build to flake.nix
```task
id: IHUB-WP-0018-T01
status: done
priority: high
state_hub_task_id: "27420bd7-0f70-4793-8805-393d8d5cacfd"
```
Add a `packages.docker` output to `flake.nix` using `pkgs.dockerTools.buildLayeredImage`.
The image wraps the IHP production binary produced by `nix build .#default`.
```nix
packages.docker = pkgs.dockerTools.buildLayeredImage {
name = "inter-hub";
tag = "latest";
contents = [ self.packages.${system}.default pkgs.cacert ];
config = {
Cmd = [ "/bin/inter-hub" ];
ExposedPorts = { "8000/tcp" = {}; };
Env = [
"PORT=8000"
"IHP_ENV=Production"
];
};
};
```
Test locally on haskelseed:
```bash
nix build .#docker
docker load < result
docker run --rm -p 8000:8000 -e DATABASE_URL=... -e IHP_SESSION_SECRET=... inter-hub:latest
```
**Note:** First build pulls the full Haskell binary closure (~2 GB); subsequent
builds are incremental (layer caching). Build must run on haskelseed - the only
machine with the Nix store populated for GHC 9.10.3.
**Implementation note (2026-06-05):** `flake.nix` exposes `packages.docker =
config.packages.unoptimized-docker-image`, the IHP-provided production OCI
image used by the Railiance runbook. The original `buildLayeredImage` sketch is
superseded by that IHP image path.
### R2 — Verify container runs correctly
```task
id: IHUB-WP-0018-T02
status: done
priority: high
state_hub_task_id: "5ab45e4e-16bc-4feb-8b1b-e8eeb05bf39a"
```
On haskelseed, run the container image against the existing `interhub` database.
Confirm:
- `curl http://localhost:8000/` returns 200 (LandingAction)
- `curl http://localhost:8000/api/v2/hubs` returns 401 (auth required)
- Static assets load (Tailwind CSS present in image)
- Container exits cleanly on SIGTERM
If Tailwind CSS output (`static/app.css`) is not bundled into the Nix binary
closure, add a pre-build step: run tailwindcss and include `static/` in the
image via `dockerTools.buildLayeredImage` `contents` or a NixOS module.
### R3 — Verify Railiance01 readiness (gate)
```task
id: IHUB-WP-0018-T03
status: done
priority: high
state_hub_task_id: "79b5cf2c-3a5b-4b4b-8f84-f635cb6891c1"
```
This is a dependency gate. Before proceeding, confirm:
```bash
# From CoulombCore (execution origin):
kubectl get nodes # must show Ready
kubectl get pods -n kube-system | grep traefik # Traefik must be running
kubectl get pods -n railiance-platform # PostgreSQL HA pods
```
Also confirm:
- Container registry is reachable from haskelseed (verify push access)
- Registry address (e.g., `registry.coulomb.social` or `gitea.coulomb.social`)
- SOPS/age key is present on CoulombCore at `~/.config/sops/age/keys.txt`
If any check fails, block here and open the relevant Railiance workstream.
Do not proceed until all checks pass.
**Review note (2026-06-05):** Public smoke probes show
`https://hub.coulomb.social/` returning 200 and the Gitea registry `/v2/`
endpoint returning the expected unauthenticated 401 challenge. Full R3 remains
blocked from this workspace because `kubectl` is not available here, and the
live app is not serving the current `origin/main` v2 bootstrap routes.
**Recovery note (2026-06-14):** Re-established the haskelseed ops-bridge path
and verified the runner substrate before deployment. `make runner-status` in
`railiance-forge` confirmed `act_runner` is registered to
`https://gitea.coulomb.social`, running under OpenRC, and has the expected
self-hosted labels and build/deploy tools. The K3s API path, Helm deploy path,
and Gitea registry host were exercised successfully by the production rollout.
### R4 — Provision inter-hub database on railiance-platform
```task
id: IHUB-WP-0018-T04
status: done
priority: high
state_hub_task_id: "c937cf36-3850-4ab3-aa83-2d846e1a378e"
```
On the PostgreSQL HA cluster, create the inter-hub database and user:
```sql
CREATE USER interhub WITH PASSWORD '<generated>';
CREATE DATABASE interhub OWNER interhub;
GRANT ALL PRIVILEGES ON DATABASE interhub TO interhub;
```
Run schema migration (IHP migrations) as part of the first deployment via an
init container or a manual `migrate` run inside the pod. Document the
migration procedure in `deploy/railiance/RUNBOOK.md`.
**Recovery note (2026-06-14):** Bootstrapped the production database manually on
the Railiance PostgreSQL cluster: role `interhub`, database `interhub`, schema
ownership, and privileges were created/updated. The running deployment now uses
that database through the `inter-hub-env` Kubernetes Secret.
### R5 — SOPS-encrypted secrets
```task
id: IHUB-WP-0018-T05
status: done
priority: high
state_hub_task_id: "926f82d1-15cd-425d-8a41-3d6b51c07f0b"
```
Create `deploy/railiance/secrets/inter-hub.env.sops.yaml` with:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: inter-hub-env
namespace: inter-hub
type: Opaque
stringData:
DATABASE_URL: postgresql://interhub:<pass>@net-kingdom-pg-rw.databases.svc.cluster.local:5432/interhub?sslmode=disable
IHP_SESSION_SECRET: <64-char-hex>
IHP_BASEURL: https://hub.coulomb.social
PORT: "8000"
IHP_ENV: Production
```
Encrypt with the age key:
```bash
sops --encrypt \
--age age1aq8twfd78wvpra0had8cezcnj96tj4q0068edrz5jez8d6xwmflqdepsh4 \
/tmp/inter-hub-env.yaml > deploy/railiance/secrets/inter-hub.env.sops.yaml
```
Commit only the encrypted file. Apply it with
`sops -d deploy/railiance/secrets/inter-hub.env.sops.yaml | kubectl apply -f -`.
**Recovery note (2026-06-14):** Runtime secrets were bootstrapped manually in
Kubernetes so production could deploy safely. This task remains in progress
until the durable SOPS-encrypted source for `DATABASE_URL`, `IHP_SESSION_SECRET`,
and related runtime env is committed and wired into the deploy path.
**Progress note (2026-06-14):** Added repo root `.sops.yaml`, plaintext
guardrails under `deploy/railiance/secrets/`, an example Secret manifest, and
`k8s-secret-json-to-sops-input.py` to convert the live Kubernetes Secret into a
SOPS-ready manifest without printing values. At that point the encrypted source
file was still pending because local `sops` tooling was not available.
**Completion note (2026-06-14):** Created
`deploy/railiance/secrets/inter-hub.env.sops.yaml` from the live
`inter-hub/inter-hub-env` Kubernetes Secret using temporary `sops` v3.13.1 and
the shared Railiance age recipient. Verified the file is SOPS-encrypted, parses
as YAML, leaves only non-secret metadata reviewable, and does not contain the
checked plaintext runtime markers. Decryption/apply verification remains a
custody-backed operator capability because the private age identity is not
present in the normal workstation or haskelseed shell.
### R6 — Helm chart in railiance-apps
```task
id: IHUB-WP-0018-T06
status: done
priority: high
state_hub_task_id: "4c4acc98-5773-4289-ad57-03f3fd5c381c"
```
Create `charts/inter-hub/` in the `railiance-apps` repository following the
Railiance app.toml contract. Minimal chart:
```
charts/inter-hub/
Chart.yaml name: inter-hub, version: 0.1.0
values.yaml image.tag, ingress.host, resources
helm/inter-hub-values.yaml
production non-secret overrides
templates/
deployment.yaml envFrom: secretRef inter-hub-env
service.yaml ClusterIP :8000
ingress.yaml Traefik annotations, TLS
```
`app.toml` in the inter-hub repo root for railiance CLI integration:
```toml
[app]
name = "inter-hub"
slug = "inter-hub"
kind = "native"
registry = "gitea.coulomb.social/coulomb/inter-hub"
[deploy]
chart = "railiance-apps/charts/inter-hub"
namespace = "inter-hub"
```
**Implementation note (2026-06-05):** A Helm chart exists in
`deploy/helm/inter-hub/` with Deployment, Service, Ingress, and values for the
current Gitea registry and `hub.coulomb.social`. Remaining gaps: no repo-root
`app.toml`, no committed SOPS secret manifest, and no separate
`railiance-apps/helm/inter-hub` handoff in this repo.
**Recovery note (2026-06-14):** The local chart under `deploy/helm/inter-hub/`
successfully deployed the app to Railiance01. This task remains in progress
because the repo-root `app.toml` and railiance-apps handoff are still not
completed.
**Completion note (2026-06-14):** Added repo-root `app.toml` in inter-hub and
added `charts/inter-hub`, `helm/inter-hub-values.yaml`, Makefile targets, and
server-dry-run coverage in `railiance-apps`. The chart rendered successfully on
haskelseed with `helm template`.
### R7 — Gitea Actions CI/CD pipeline
```task
id: IHUB-WP-0018-T07
status: done
priority: medium
state_hub_task_id: "ec25c67c-3cb0-4534-9fb0-9bd6578a2def"
```
Create `.gitea/workflows/deploy.yaml` in the inter-hub repo:
```yaml
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest # or self-hosted if available
steps:
- uses: actions/checkout@v4
- name: Build OCI image on haskelseed
run: |
ssh haskelseed "cd /root/inter-hub && git pull && \
nix build .#docker && \
docker load < result && \
docker tag inter-hub:latest $REGISTRY/inter-hub:${{ github.sha }} && \
docker push $REGISTRY/inter-hub:${{ github.sha }}"
- name: Deploy to Railiance01
run: |
ssh coulombcore "helm upgrade --install inter-hub \
railiance-apps/helm/inter-hub \
--namespace inter-hub --create-namespace \
--set image.tag=${{ github.sha }} \
-f railiance-apps/helm/inter-hub/values.prod.yaml"
```
Secrets in Gitea: `REGISTRY`, `SSH_KEY_HASKELSEED`, `SSH_KEY_COULOMBCORE`.
**Alternative if self-hosted runner is available on CoulombCore:** run the
deploy step directly without the SSH hop to coulombcore.
**Implementation note (2026-06-05):** `.gitea/workflows/deploy.yaml` exists and
builds `.#docker` on a self-hosted `haskelseed` runner, pushes to
`92.205.130.254:32166/coulomb/inter-hub`, deploys with Helm, and smoke-tests
the public endpoint. Remote `main` is already current, but production is still
serving an older API surface, so the workflow needs an attended rerun/inspection
or a new deployment trigger.
**Runner substrate finding (2026-06-07):** Pushed commits `fa96fb8` and
`7cc3173` to trigger the workflow, but public `/api/v2/hubs` remained `404`
while `/` stayed `200`, indicating the current image was not deployed. Repo
search shows `railiance-forge` owns Actions runner substrate, but its
2026-06-05 migration plan explicitly lists "No Actions runner deployment" as a
non-goal and no runner manifest/script/workplan exists there yet. `haskelseed`
itself is reachable on SSH and historical port 8080, but this workspace cannot
authenticate non-interactively. Treat R7 as blocked on a forge-owned runner
prerequisite rather than continuing to push commits as deployment probes.
**Recovery note (2026-06-14):** The runner prerequisite was restored through
the haskelseed ops-bridge path. The workflow now builds the Nix OCI image,
publishes to `gitea.coulomb.social/coulomb/inter-hub` using a registry bearer
token from the repo `REGISTRY_TOKEN` Actions secret, deploys with Helm, and
runs public smoke checks. Gitea Actions run `2913` completed successfully for
commit `5663fab`.
**Load-control note (2026-06-14):** Added workflow `paths-ignore` for docs,
workplans, `.custodian-brief.md`, `app.toml`, `.sops.yaml`, and
`deploy/railiance/**` so State Hub consistency/doc-only commits do not consume a
haskelseed build/deploy cycle.
**Bootstrap-gate deploy note (2026-06-14):** Hardened the deployment workflow
smoke test so a production rollout only passes when `/api/v2/hubs` returns the
expected unauthenticated `401` and OpenAPI exposes `/hubs`,
`/hub-capability-manifests`, `/api-consumers`, and `/policy-scopes`. This
directly protects the ops-hub bootstrap gate instead of only checking the
landing page and generic widget auth gate.
**Authenticated inspection note (2026-06-14):** The stored local Tea token is
stale for `https://gitea.coulomb.social`, but runner-side inspection succeeded.
`make runner-status` in `railiance-forge` showed `act_runner` registered to
`https://gitea.coulomb.social`, started under OpenRC, and carrying the expected
`self-hosted`/`haskelseed` labels. The runner log shows task `19` for
`coulomb/inter-hub` starting at `2026-06-14T19:59:19+02:00`, matching the
`6455902` deploy trigger.
### R8 — Staged deployment and smoke test
```task
id: IHUB-WP-0018-T08
status: done
priority: high
state_hub_task_id: "2b02ae5c-47b9-4f09-88f0-a4af7900b38f"
```
Follow the Railiance staged promotion lifecycle:
1. **Local verify** (done in R2 — container runs correctly)
2. **Deploy to Railiance01:**
```bash
railiance deploy inter-hub --tag <sha>
```
3. **Smoke test:**
```bash
curl -s https://hub.coulomb.social/ | grep "Inter-Hub" # Landing page
curl -s https://hub.coulomb.social/capabilities # Capabilities
curl -H "Authorization: Bearer <key>" \
https://hub.coulomb.social/api/v2/hubs # API (200)
curl https://hub.coulomb.social/api/v2/hubs # Unauthenticated (401)
```
4. **Verify restart persistence:**
```bash
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub
# Then re-run smoke test
```
**Recovery note (2026-06-14):** Production is deployed from image
`gitea.coulomb.social/coulomb/inter-hub:5663fab`; Kubernetes reports the
`inter-hub` deployment ready with one replica. Public smoke checks pass:
`/` returns 200 and contains `inter-hub`, `/api/v2/openapi.json` returns 200,
and unauthenticated `/api/v2/widgets` returns 401.
**DNS gate finding (2026-06-14):** The deployment workflow did publish and
deploy `gitea.coulomb.social/coulomb/inter-hub:6455902`; Kubernetes reports the
`inter-hub` Deployment ready on the COULOMBCORE K3s node
`92.205.130.254`. An in-cluster probe to
`http://inter-hub:8000/api/v2/hubs` returned the expected unauthenticated
`401`, and forcing public TLS to `92.205.130.254` also returned `401`. The
public DNS record for `hub.coulomb.social`, however, resolves to
`92.205.62.239`, where `/api/v2/hubs` still returns `404` and OpenAPI lacks the
bootstrap paths. The remaining production gate is therefore DNS cutover (or an
intentional kubeconfig rotation to the cluster behind `92.205.62.239`), not a
runner, build, registry, Helm, or image-content issue.
### R9 — Document and register
```task
id: IHUB-WP-0018-T09
status: done
priority: medium
state_hub_task_id: "4d1e55c7-8dbb-480f-b07b-6c5e39a04218"
```
- Write `deploy/railiance/RUNBOOK.md`: image build, migration procedure,
secret rotation, rollback (`railiance rollback inter-hub`), log access
(`kubectl logs -n inter-hub -l app=inter-hub --tail=100`)
- Add progress event to state hub
- Remove haskelseed socat/OpenRC production role note from quickstart -
document it as the build machine only, not the production host
**Implementation note (2026-06-05):** `deploy/railiance/RUNBOOK.md` exists and
documents architecture, image build/push, Helm deployment, logs, restart,
rollback, secret rotation, and smoke checks. The deployment record remains
incomplete until current `main` is running and the ops-hub bootstrap smoke test
passes against production.
**Recovery note (2026-06-14):** Current `main` is running in production and the
deployment evidence has been recorded here. Remaining documentation work is to
capture the durable secret-management and railiance-apps handoff path once R5
and R6 are completed.
**Completion note (2026-06-14):** Updated `deploy/railiance/RUNBOOK.md` for the
current Gitea registry host, runner-based build/deploy path, SOPS secret handoff,
current smoke checks, and haskelseed's build-runner-only role. Updated
`docs/new-hub-quickstart.md` so haskelseed is no longer described as a
production/shared database runtime.
## Exit Criteria
- `https://hub.coulomb.social/` returns the Landing page (200, no auth)
- `/api/v2/hubs` returns 401 unauthenticated, 200 with valid API key
- All 12 IHF dashboards accessible after admin login
- `kubectl rollout restart` followed by smoke test passes (K3s restart
persistence confirmed)
- Gitea Actions pipeline: push to `main` → image built → deployed → smoke
test green within 15 minutes
- No dependency on haskelseed being up for the app to *run* (only for builds)
## Open Questions / Pre-flight Checks
1. **K3s status**: ThreePhoenix HA cluster workstream is active but not complete.
Confirm whether Railiance01 is a single-node cluster already accepting
workloads or still being provisioned. Gate R3 is the go/no-go check.
2. **Container registry**: Is Gitea's built-in registry available on Railiance01,
or is a separate registry service needed? If neither, add registry deployment
to the scope.
3. **PostgreSQL HA status**: railiance-platform baseline workstream is active.
Confirm whether the HA cluster (repmgr + pgpool) is operational before R4.
4. **Static asset bundling**: The Nix production binary may or may not include
`static/app.css` (Tailwind output). Verify in R2 and adjust image build
if needed.
5. **Anthropic API key**: Phase 5 AI-assisted distillation requires
`IHP_ANTHROPIC_API_KEY`. Add to SOPS secrets if the feature is to be
active on Railiance01.