Files
inter-hub/deploy/railiance/RUNBOOK.md

280 lines
9.6 KiB
Markdown

# inter-hub Production Deploy Runbook
## Architecture
- **Deployment cluster:** COULOMBCORE K3s (`92.205.130.254`) as observed from
the haskelseed runner kube context on 2026-06-14.
- **Stale public DNS host:** `hub.coulomb.social` still resolved to
`92.205.62.239` on 2026-06-14, which served the older API surface.
- **Namespace:** `inter-hub`
- **Image registry:** `gitea.coulomb.social/coulomb/inter-hub:<sha>`
- **Database:** CloudNativePG cluster `net-kingdom-pg` in `databases` namespace
- RW endpoint: `net-kingdom-pg-rw.databases.svc.cluster.local:5432`
- Database: `interhub`, User: `interhub`
- **Ingress:** Traefik → `hub.coulomb.social` (TLS via letsencrypt-prod)
- **Secrets:** `inter-hub-env` Secret in `inter-hub` namespace
- **App handoff:** `app.toml` points Railiance operators to
`railiance-apps/charts/inter-hub` with values from
`railiance-apps/helm/inter-hub-values.yaml`
## Public DNS Gate
The app deployment can be healthy while public smoke tests still fail if DNS
points `hub.coulomb.social` at the stale host. On 2026-06-14:
- Kubernetes reported image `gitea.coulomb.social/coulomb/inter-hub:6455902`
ready in namespace `inter-hub` on node `92.205.130.254`.
- An in-cluster probe to `http://inter-hub:8000/api/v2/hubs` returned `401`.
- Forcing public TLS to the cluster ingress also returned `401`:
`curl --resolve hub.coulomb.social:443:92.205.130.254 https://hub.coulomb.social/api/v2/hubs`.
- Normal DNS resolved `hub.coulomb.social` to `92.205.62.239`, where
`/api/v2/hubs` returned `404` and OpenAPI lacked the bootstrap paths.
Before treating a deploy as failed, compare DNS and forced-ingress probes:
```bash
getent ahosts hub.coulomb.social
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/hubs
curl --resolve hub.coulomb.social:443:92.205.130.254 \
-s -o /dev/null -w "%{http_code}" \
https://hub.coulomb.social/api/v2/hubs
```
The public bootstrap gate passes when the DNS A record for
`hub.coulomb.social` points at the active ingress IP (`92.205.130.254`) or the
workflow kubeconfig is intentionally rotated to deploy to the cluster behind the
current DNS target.
## Deployment
Normal deployment is handled by Gitea Actions on push to `main`:
- runner labels: `self-hosted`, `haskelseed`
- build: `nix build .#docker`
- publish: `gitea.coulomb.social/coulomb/inter-hub:<short-sha>` and `latest`
- deploy: `helm upgrade --install inter-hub deploy/helm/inter-hub ...`
- smoke: public landing page and v2 auth gate
Manual deployment from this repo:
```bash
helm upgrade --install inter-hub deploy/helm/inter-hub \
--namespace inter-hub --create-namespace \
--set image.tag=<short-sha> \
--wait --timeout 5m
```
Manual deployment through the Railiance app handoff chart:
```bash
helm upgrade --install inter-hub /home/worsch/railiance-apps/charts/inter-hub \
--namespace inter-hub --create-namespace \
-f /home/worsch/railiance-apps/helm/inter-hub-values.yaml \
--set image.tag=<short-sha> \
--wait --timeout 5m
```
## Image Build (on haskelseed)
```bash
ssh root@192.168.178.135
cd /root/inter-hub
# Build:
nix build .#docker --log-format raw > /tmp/build.log 2>&1
# Push:
SHA=$(git rev-parse --short HEAD)
TOKEN=$(curl -fsS \
"https://gitea.coulomb.social/v2/token?service=container_registry&scope=repository:coulomb/inter-hub:push,pull" \
-u "tegwick:<REGISTRY_TOKEN>" | awk -F'"' '/token/{print $4}')
skopeo copy --insecure-policy \
--dest-registry-token "$TOKEN" \
docker-archive:result \
docker://gitea.coulomb.social/coulomb/inter-hub:$SHA
```
**Notes:**
- Haskelseed is a build/deploy runner, not the production app host.
- The IHP Nix Docker image may not have `/bin/sh`. Prefer Kubernetes-native
checks from other pods or the database pod when possible.
## Gitea Registry Credentials
The deploy workflow uses the repository Actions secret `REGISTRY_TOKEN` to
request a short-lived registry bearer token from
`https://gitea.coulomb.social/v2/token`.
If publishing starts failing with an authentication error:
1. Generate or rotate a Gitea token with package write access.
2. Update the `REGISTRY_TOKEN` Actions secret for `coulomb/inter-hub`.
3. Rerun the workflow or push a non-production test commit.
Do not print token values in logs, State Hub, or commits.
## Runtime Secret Source
The live deployment currently consumes the Kubernetes Secret
`inter-hub/inter-hub-env`. The durable source file is:
```text
deploy/railiance/secrets/inter-hub.env.sops.yaml
```
Create or refresh it from the live Secret using:
```bash
tmp="$(mktemp)"
trap 'rm -f "$tmp"' EXIT
kubectl -n inter-hub get secret inter-hub-env -o json \
| python3 deploy/railiance/secrets/k8s-secret-json-to-sops-input.py \
> "$tmp"
sops --encrypt \
--age age1aq8twfd78wvpra0had8cezcnj96tj4q0068edrz5jez8d6xwmflqdepsh4 \
"$tmp" > deploy/railiance/secrets/inter-hub.env.sops.yaml
```
Apply the encrypted source:
```bash
sops -d deploy/railiance/secrets/inter-hub.env.sops.yaml \
| kubectl apply -f -
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub
```
Custody-backed recovery verification:
```bash
# after the approved custody unlock makes the age identity available
make recovery-drill
```
The drill prints UTC/local timestamps, verifies that the committed SOPS file can
be decrypted in memory, checks the expected Secret metadata and key names, and
does not print secret values. Keep the PASS output as non-secret recovery
evidence.
## Database Migration
The current Nix production image is intentionally minimal: image metadata for
`6455902` points at
`/nix/store/<hash>-inter-hub/bin/RunProdServer`, and the package contains only
`RunProdServer` and `RunJobs`. It has no shell and no packaged migration
runner, so schema work is performed through the CloudNativePG pod.
Check schema state:
```bash
kubectl exec -n databases net-kingdom-pg-1 -- \
psql -d interhub -Atc "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';"
```
Initialize a blank production database from the canonical schema:
```bash
kubectl exec -i -n databases net-kingdom-pg-1 -- \
psql -d interhub -v ON_ERROR_STOP=1 -1 -f - < Application/Schema.sql
kubectl exec -i -n databases net-kingdom-pg-1 -- \
psql -d interhub -v ON_ERROR_STOP=1 -1 -f - < Application/Migration/1744502400-seed-type-registries.sql
kubectl exec -i -n databases net-kingdom-pg-1 -- psql -d interhub -v ON_ERROR_STOP=1 -1 -f - <<'SQL'
GRANT USAGE ON SCHEMA public TO interhub;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO interhub;
GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA public TO interhub;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO interhub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT EXECUTE ON FUNCTIONS TO interhub;
SQL
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub
```
Do not apply `1744416000-seed-admin-user.sql` unattended in production; it uses
a documented default password intended for initial local deployment only.
## Logs
```bash
kubectl logs -n inter-hub -l app=inter-hub --tail=100 -f
# Previous pod logs:
kubectl logs -n inter-hub -l app=inter-hub --previous --tail=50
```
## Restart / Rollback
```bash
# Restart:
kubectl rollout restart deployment/inter-hub -n inter-hub
kubectl rollout status deployment/inter-hub -n inter-hub
# Rollback to previous image:
kubectl rollout undo deployment/inter-hub -n inter-hub
# Rollback to specific version:
helm rollback inter-hub 1 --namespace inter-hub
```
## Secret Rotation
To rotate the session secret:
```bash
sops deploy/railiance/secrets/inter-hub.env.sops.yaml
sops -d deploy/railiance/secrets/inter-hub.env.sops.yaml | kubectl apply -f -
kubectl rollout restart deployment/inter-hub -n inter-hub
```
To rotate the database password:
1. Update the password in PostgreSQL (via kubectl exec to the CNPG pod)
2. Update the `inter-hub-env` secret
3. Restart the deployment
## Smoke Test
```bash
getent ahosts hub.coulomb.social # expected: 92.205.130.254
curl -fsS https://hub.coulomb.social/ | grep "inter-hub"
curl -fsS https://hub.coulomb.social/api/v2/openapi.json >/dev/null
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/widgets | grep 401
curl -s -o /dev/null -w "%{http_code}" https://hub.coulomb.social/api/v2/hubs | grep 401
```
## Database Connection Check
The IHP Nix image has no `/bin/sh`. Connect via the CNPG pod instead:
```bash
kubectl exec -n databases net-kingdom-pg-1 -- psql -U postgres -d interhub -c "SELECT version();"
```
## Password Hashing
IHP uses `pwstore-fast` (`Crypto.PasswordStore`) — **not bcrypt**. Hash format:
```
sha256|17|<base64-salt>|<base64-hash>
```
To generate a correct hash (requires GHC with pwstore-fast available on haskelseed):
```bash
ssh root@192.168.178.135
cat > /tmp/genhash.hs << 'EOF'
import qualified Crypto.PasswordStore as PS
import qualified Data.ByteString.Char8 as B8
main :: IO ()
main = do
h <- PS.makePassword (B8.pack "yourpassword") 17
B8.putStrLn h
EOF
/nix/store/yp23474ys67f1fd2z2ff1nn3q5wrmjng-ghc-9.10.3-with-packages/bin/runghc /tmp/genhash.hs
```
## haskelseed Build VM
- **Host:** 192.168.178.135
- **Access:** ops-bridge SSH path with the approved operator key
- **Role:** self-hosted Gitea Actions runner and Nix build machine only
- **Runner:** OpenRC `act_runner` service registered to `https://gitea.coulomb.social`
- **Build logs:** Gitea Actions logs and temporary runner work directories
- **Nix store:** `/dev/sdb1` (100 GB, mounted at `/nix`)