Files
inter-hub/deploy/railiance/RUNBOOK.md

9.6 KiB

inter-hub Production Deploy Runbook

Architecture

  • Deployment cluster: COULOMBCORE K3s (92.205.130.254) as observed from the haskelseed runner kube context on 2026-06-14.
  • Stale public DNS host: hub.coulomb.social still resolved to 92.205.62.239 on 2026-06-14, which served the older API surface.
  • Namespace: inter-hub
  • Image registry: gitea.coulomb.social/coulomb/inter-hub:<sha>
  • Database: CloudNativePG cluster net-kingdom-pg in databases namespace
    • RW endpoint: net-kingdom-pg-rw.databases.svc.cluster.local:5432
    • Database: interhub, User: interhub
  • Ingress: Traefik → hub.coulomb.social (TLS via letsencrypt-prod)
  • Secrets: inter-hub-env Secret in inter-hub namespace
  • App handoff: app.toml points Railiance operators to railiance-apps/charts/inter-hub with values from railiance-apps/helm/inter-hub-values.yaml

Public DNS Gate

The app deployment can be healthy while public smoke tests still fail if DNS points hub.coulomb.social at the stale host. On 2026-06-14:

  • Kubernetes reported image gitea.coulomb.social/coulomb/inter-hub:6455902 ready in namespace inter-hub on node 92.205.130.254.
  • An in-cluster probe to http://inter-hub:8000/api/v2/hubs returned 401.
  • Forcing public TLS to the cluster ingress also returned 401: curl --resolve hub.coulomb.social:443:92.205.130.254 https://hub.coulomb.social/api/v2/hubs.
  • Normal DNS resolved hub.coulomb.social to 92.205.62.239, where /api/v2/hubs returned 404 and OpenAPI lacked the bootstrap paths.

Before treating a deploy as failed, compare DNS and forced-ingress probes:

getent ahosts hub.coulomb.social
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/hubs
curl --resolve hub.coulomb.social:443:92.205.130.254 \
  -s -o /dev/null -w "%{http_code}" \
  https://hub.coulomb.social/api/v2/hubs

The public bootstrap gate passes when the DNS A record for hub.coulomb.social points at the active ingress IP (92.205.130.254) or the workflow kubeconfig is intentionally rotated to deploy to the cluster behind the current DNS target.

Deployment

Normal deployment is handled by Gitea Actions on push to main:

  • runner labels: self-hosted, haskelseed
  • build: nix build .#docker
  • publish: gitea.coulomb.social/coulomb/inter-hub:<short-sha> and latest
  • deploy: helm upgrade --install inter-hub deploy/helm/inter-hub ...
  • smoke: public landing page and v2 auth gate

Manual deployment from this repo:

helm upgrade --install inter-hub deploy/helm/inter-hub \
  --namespace inter-hub --create-namespace \
  --set image.tag=<short-sha> \
  --wait --timeout 5m

Manual deployment through the Railiance app handoff chart:

helm upgrade --install inter-hub /home/worsch/railiance-apps/charts/inter-hub \
  --namespace inter-hub --create-namespace \
  -f /home/worsch/railiance-apps/helm/inter-hub-values.yaml \
  --set image.tag=<short-sha> \
  --wait --timeout 5m

Image Build (on haskelseed)

ssh root@192.168.178.135
cd /root/inter-hub
# Build:
nix build .#docker --log-format raw > /tmp/build.log 2>&1

# Push:
SHA=$(git rev-parse --short HEAD)
TOKEN=$(curl -fsS \
  "https://gitea.coulomb.social/v2/token?service=container_registry&scope=repository:coulomb/inter-hub:push,pull" \
  -u "tegwick:<REGISTRY_TOKEN>" | awk -F'"' '/token/{print $4}')
skopeo copy --insecure-policy \
  --dest-registry-token "$TOKEN" \
  docker-archive:result \
  docker://gitea.coulomb.social/coulomb/inter-hub:$SHA

Notes:

  • Haskelseed is a build/deploy runner, not the production app host.
  • The IHP Nix Docker image may not have /bin/sh. Prefer Kubernetes-native checks from other pods or the database pod when possible.

Gitea Registry Credentials

The deploy workflow uses the repository Actions secret REGISTRY_TOKEN to request a short-lived registry bearer token from https://gitea.coulomb.social/v2/token.

If publishing starts failing with an authentication error:

  1. Generate or rotate a Gitea token with package write access.
  2. Update the REGISTRY_TOKEN Actions secret for coulomb/inter-hub.
  3. Rerun the workflow or push a non-production test commit.

Do not print token values in logs, State Hub, or commits.

Runtime Secret Source

The live deployment currently consumes the Kubernetes Secret inter-hub/inter-hub-env. The durable source file is:

deploy/railiance/secrets/inter-hub.env.sops.yaml

Create or refresh it from the live Secret using:

tmp="$(mktemp)"
trap 'rm -f "$tmp"' EXIT

kubectl -n inter-hub get secret inter-hub-env -o json \
  | python3 deploy/railiance/secrets/k8s-secret-json-to-sops-input.py \
  > "$tmp"

sops --encrypt \
  --age age1aq8twfd78wvpra0had8cezcnj96tj4q0068edrz5jez8d6xwmflqdepsh4 \
  "$tmp" > deploy/railiance/secrets/inter-hub.env.sops.yaml

Apply the encrypted source:

sops -d deploy/railiance/secrets/inter-hub.env.sops.yaml \
  | kubectl apply -f -
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub

Custody-backed recovery verification:

# after the approved custody unlock makes the age identity available
make recovery-drill

The drill prints UTC/local timestamps, verifies that the committed SOPS file can be decrypted in memory, checks the expected Secret metadata and key names, and does not print secret values. Keep the PASS output as non-secret recovery evidence.

Database Migration

The current Nix production image is intentionally minimal: image metadata for 6455902 points at /nix/store/<hash>-inter-hub/bin/RunProdServer, and the package contains only RunProdServer and RunJobs. It has no shell and no packaged migration runner, so schema work is performed through the CloudNativePG pod.

Check schema state:

kubectl exec -n databases net-kingdom-pg-1 -- \
  psql -d interhub -Atc "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';"

Initialize a blank production database from the canonical schema:

kubectl exec -i -n databases net-kingdom-pg-1 -- \
  psql -d interhub -v ON_ERROR_STOP=1 -1 -f - < Application/Schema.sql

kubectl exec -i -n databases net-kingdom-pg-1 -- \
  psql -d interhub -v ON_ERROR_STOP=1 -1 -f - < Application/Migration/1744502400-seed-type-registries.sql

kubectl exec -i -n databases net-kingdom-pg-1 -- psql -d interhub -v ON_ERROR_STOP=1 -1 -f - <<'SQL'
GRANT USAGE ON SCHEMA public TO interhub;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO interhub;
GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA public TO interhub;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT EXECUTE ON FUNCTIONS TO interhub;
SQL

kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub

Do not apply 1744416000-seed-admin-user.sql unattended in production; it uses a documented default password intended for initial local deployment only.

Logs

kubectl logs -n inter-hub -l app=inter-hub --tail=100 -f
# Previous pod logs:
kubectl logs -n inter-hub -l app=inter-hub --previous --tail=50

Restart / Rollback

# Restart:
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub

# Rollback to previous image:
kubectl rollout undo deployment/inter-hub -n inter-hub

# Rollback to specific version:
helm rollback inter-hub 1 --namespace inter-hub

Secret Rotation

To rotate the session secret:

sops deploy/railiance/secrets/inter-hub.env.sops.yaml
sops -d deploy/railiance/secrets/inter-hub.env.sops.yaml | kubectl apply -f -
kubectl rollout restart deployment/inter-hub -n inter-hub

To rotate the database password:

  1. Update the password in PostgreSQL (via kubectl exec to the CNPG pod)
  2. Update the inter-hub-env secret
  3. Restart the deployment

Smoke Test

getent ahosts hub.coulomb.social   # expected: 92.205.130.254
curl -fsS https://hub.coulomb.social/ | grep "inter-hub"
curl -fsS https://hub.coulomb.social/api/v2/openapi.json >/dev/null
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/widgets | grep 401
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/hubs | grep 401

Database Connection Check

The IHP Nix image has no /bin/sh. Connect via the CNPG pod instead:

kubectl exec -n databases net-kingdom-pg-1 -- psql -U postgres -d interhub -c "SELECT version();"

Password Hashing

IHP uses pwstore-fast (Crypto.PasswordStore) — not bcrypt. Hash format:

sha256|17|<base64-salt>|<base64-hash>

To generate a correct hash (requires GHC with pwstore-fast available on haskelseed):

ssh root@192.168.178.135
cat > /tmp/genhash.hs << 'EOF'
import qualified Crypto.PasswordStore as PS
import qualified Data.ByteString.Char8 as B8
main :: IO ()
main = do
    h <- PS.makePassword (B8.pack "yourpassword") 17
    B8.putStrLn h
EOF
/nix/store/yp23474ys67f1fd2z2ff1nn3q5wrmjng-ghc-9.10.3-with-packages/bin/runghc /tmp/genhash.hs

haskelseed Build VM

  • Host: 192.168.178.135
  • Access: ops-bridge SSH path with the approved operator key
  • Role: self-hosted Gitea Actions runner and Nix build machine only
  • Runner: OpenRC act_runner service registered to https://gitea.coulomb.social
  • Build logs: Gitea Actions logs and temporary runner work directories
  • Nix store: /dev/sdb1 (100 GB, mounted at /nix)