--- id: RAILIANCE-WP-0004 type: workplan title: "App deployment improvements (lessons from RAILIANCE-WP-0002)" domain: railiance repo: railiance-apps status: backlog owner: railiance topic_slug: railiance planning_priority: medium created: "2026-05-19" updated: "2026-05-19" state_hub_workstream_id: "b61a9aca-4e43-4b3d-a48b-999e0fa842cf" --- # App deployment improvements This workplan collects concrete follow-ups surfaced while shipping `vergabe-teilnahme` under `RAILIANCE-WP-0002`. Each item is small, independent, and can be picked up in isolation when the next S5 app lands or when the next operator onboards. Status is `backlog` — nothing here is blocking the live deployment. ## I01 — URL-encode DB passwords at Secret-build time ```task id: RAILIANCE-WP-0004-I01 status: todo priority: medium state_hub_task_id: "a05a855a-00a0-4e0e-ba82-27e0a072f777" ``` **Problem.** cnpg-generated bootstrap passwords come from `openssl rand -base64 N` and contain `=`, `+`, `/`. Embedded raw in `DATABASE_URL`, those characters confuse `dj-database-url` (it parsed `vergabe:@apps-pg-rw:5432/vergabe_db` as having an 80-character database name). Cost us one Helm revision and one pod restart to diagnose. **Fix.** Add a tiny helper (shell script or Makefile target) that takes the raw role password from the cnpg secret and emits the DSN-ready URL-encoded form into the consumer-namespace env Secret. Alternative: switch to individual env vars (`POSTGRES_HOST`, `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) so no URL parsing is needed at all. **Where it lives:** new `tools/` script + Makefile target, or chart helper template. --- ## I02 — Document the Django + kube-probe Host-header pattern ```task id: RAILIANCE-WP-0004-I02 status: todo priority: low state_hub_task_id: "22a212e6-31b1-490a-8d1c-0a33ddc62501" ``` **Problem.** The kube-probe sends `Host: :8000`. With production Django settings (`DEBUG=False`, narrow `ALLOWED_HOSTS`), that fails the Host validation and returns `HTTP 400 Bad Request`, which the kubelet treats as Unhealthy. First deploy revision restarted on liveness failure for ~5 minutes before diagnosis. **Fix.** The `charts/vergabe-teilnahme` chart already sets `httpGet.httpHeaders[Host]` from `probes.hostHeader`. Promote this pattern into a documented "Django-on-Railiance" recipe (short doc in `docs/`) so the next Django app starts there rather than rediscovering the gotcha. Also worth a "common chart values" sketch if a second Django app justifies the abstraction. --- ## I03 — Publish `issue-core` to a Gitea Python package registry ```task id: RAILIANCE-WP-0004-I03 status: todo priority: medium state_hub_task_id: "f412b874-0670-4a4a-89fc-575fe4994646" ``` **Problem.** `vergabe-teilnahme/pyproject.toml` has a path dependency on `../issue-core`. Building the container image therefore requires the `--build-context issue-core=/home/worsch/issue-core` BuildKit flag, which is operator-machine-specific and breaks CI builds / remote builds / other workstations. **Fix.** Enable the Gitea Python package registry (analogous to the container registry from `RAIL-AP-WP-0001`), publish `issue-core` as a proper wheel with version, and switch the dep to `issue-core>=0.2,<0.3` with a normal index URL. The Dockerfile then drops the `--build-context` and the build becomes portable. **Coordination:** depends on Gitea PyPI enablement in `railiance-apps` (small Helm values change) and a release pipeline for `issue-core` (separate repo). --- ## I04 — Operator onboarding: install the `kubectl cnpg` plugin ```task id: RAILIANCE-WP-0004-I04 status: todo priority: low state_hub_task_id: "2f44cad1-b70c-4406-91a9-0c0fa9c75583" ``` **Problem.** `make vergabe-status`, `apps-pg-status`, `db-shell` use `kubectl cnpg ...` first and fall back to bare `kubectl` when the plugin is missing. The fallback works but the cnpg plugin gives much better cluster diagnostics (`status` table, primary/replica health, backup state). **Fix.** Add the plugin install command to operator onboarding (one line: `kubectl krew install cnpg` or a direct binary download). Add a `make check-tools` target that warns when `kubectl cnpg` or `helm` is missing. --- ## I05 — Operator onboarding: SOPS / age key bootstrap ```task id: RAILIANCE-WP-0004-I05 status: todo priority: low state_hub_task_id: "741d8a73-8cb0-40ac-a218-f1d3a74ebef3" ``` **Problem.** Several Makefile targets read `helm/*.sops.yaml` via `sops -d`. A new operator with no `~/.config/sops/age/keys.txt` sees a confusing decryption failure rather than a clear "you need the age key" message. The session that produced this workplan had to skip the SOPS template step for `apps-pg-secret.sops.yaml.template`. **Fix.** Add a `docs/operator-setup.md` with the age key handoff procedure (where to put the key, how to verify, how to rotate). A `make check-sops` target that asserts the keys file exists and can decrypt a known sentinel would catch this at the first deploy attempt rather than at the failing apply. --- ## I06 — CI guard against stale committed manifests vs live CRD drift ```task id: RAILIANCE-WP-0004-I06 status: todo priority: medium state_hub_task_id: "a319c20b-993c-46b7-889a-f0ac738056c4" ``` **Problem.** `helm/gitea-db-cluster.yaml` (in `railiance-platform`) had `spec.postgresql.version: "16"` — a field that has never existed in the CNPG v1 schema. The committed manifest had silently diverged from the live cluster for months and would have rejected on the next `make db-deploy`. Caught only by trying to apply a new file that copied the same stale shape. **Fix.** Add a per-PR CI job that runs `kubectl apply --dry-run=server -f ` against a representative cluster (or a kind cluster seeded with the same CRDs). The cnpg / cert-manager / Traefik CRDs change between operator releases; strict server-side decoding catches drift that `yamllint` and Helm template rendering miss. **Note.** Primarily a `railiance-platform` and `railiance-cluster` concern, but mirrored here because every S5 manifest in `charts/` and `manifests/` carries the same risk. --- ## I07 — `kubectl run --rm -i` smoke pattern is unreliable ```task id: RAILIANCE-WP-0004-I07 status: todo priority: low state_hub_task_id: "e3f59b3d-95c8-4cf9-9943-b1597954fd77" ``` **Problem.** Repeated false negatives when testing service-IP connectivity with `kubectl run --rm -i …`: the smoke pod exits before the connection completes, producing "Connection refused" output even though the destination service was fully healthy. Wasted significant debugging time during apps-pg verification before switching to a persistent pod + `kubectl exec`. **Fix.** Add an `docs/operator-recipes.md` note (or inline in the runbook) recommending the persistent-pod-plus-exec pattern for any service-IP smoke check. Optional: ship `tools/smoke.sh` that wraps the pattern. --- ## Notes - Items are individually `todo`; the workplan status is `backlog` so they don't show up in active-workstream lists. Promote an item to `active` (and its tasks to `in_progress`) when you pick it up. - I06 is genuinely cross-repo; the others are local to `railiance-apps` or its operator workflow. - The first three items (I01, I02, I03) are the highest-leverage for the second S5 app onboarding.