Propose RAILIANCE-WP-0004: app deployment improvements backlog

7 items surfaced during RAILIANCE-WP-0002 (vergabe-teilnahme launch): URL-encoding DB passwords at Secret-build time, Django+kube-probe Host-header pattern, publishing issue-core to a Gitea PyPI registry to remove the BuildKit --build-context dependency, kubectl cnpg plugin + SOPS/age in operator onboarding, CI guard against stale yaml vs live CRD drift, and persistent-pod smoke pattern over kubectl run --rm. Status backlog; pick up individually before the second S5 app onboards. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 20:38:13 +02:00
parent 962c5a1b36
commit 864bb9d1dc
1 changed files with 200 additions and 0 deletions
--- a/workplans/railiance-apps-WP-0004-app-deployment-improvements.md
+++ b/workplans/railiance-apps-WP-0004-app-deployment-improvements.md
@@ -0,0 +1,200 @@
+---
+id: RAILIANCE-WP-0004
+type: workplan
+title: "App deployment improvements (lessons from RAILIANCE-WP-0002)"
+domain: railiance
+repo: railiance-apps
+status: backlog
+owner: railiance
+topic_slug: railiance
+planning_priority: medium
+created: "2026-05-19"
+updated: "2026-05-19"
+---
+
+# App deployment improvements
+
+This workplan collects concrete follow-ups surfaced while shipping
+`vergabe-teilnahme` under `RAILIANCE-WP-0002`. Each item is small,
+independent, and can be picked up in isolation when the next S5 app
+lands or when the next operator onboards. Status is `backlog` —
+nothing here is blocking the live deployment.
+
+## I01 — URL-encode DB passwords at Secret-build time
+
+```task
+id: RAILIANCE-WP-0004-I01
+status: todo
+priority: medium
+```
+
+**Problem.** cnpg-generated bootstrap passwords come from
+`openssl rand -base64 N` and contain `=`, `+`, `/`. Embedded raw in
+`DATABASE_URL`, those characters confuse `dj-database-url` (it parsed
+`vergabe:<pw>@apps-pg-rw:5432/vergabe_db` as having an 80-character
+database name). Cost us one Helm revision and one pod restart to
+diagnose.
+
+**Fix.** Add a tiny helper (shell script or Makefile target) that
+takes the raw role password from the cnpg secret and emits the
+DSN-ready URL-encoded form into the consumer-namespace env Secret.
+Alternative: switch to individual env vars (`POSTGRES_HOST`,
+`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) so no URL
+parsing is needed at all.
+
+**Where it lives:** new `tools/` script + Makefile target, or chart
+helper template.
+
+---
+
+## I02 — Document the Django + kube-probe Host-header pattern
+
+```task
+id: RAILIANCE-WP-0004-I02
+status: todo
+priority: low
+```
+
+**Problem.** The kube-probe sends `Host: <pod-ip>:8000`. With
+production Django settings (`DEBUG=False`, narrow `ALLOWED_HOSTS`),
+that fails the Host validation and returns `HTTP 400 Bad Request`,
+which the kubelet treats as Unhealthy. First deploy revision
+restarted on liveness failure for ~5 minutes before diagnosis.
+
+**Fix.** The `charts/vergabe-teilnahme` chart already sets
+`httpGet.httpHeaders[Host]` from `probes.hostHeader`. Promote this
+pattern into a documented "Django-on-Railiance" recipe (short doc in
+`docs/`) so the next Django app starts there rather than rediscovering
+the gotcha. Also worth a "common chart values" sketch if a second
+Django app justifies the abstraction.
+
+---
+
+## I03 — Publish `issue-core` to a Gitea Python package registry
+
+```task
+id: RAILIANCE-WP-0004-I03
+status: todo
+priority: medium
+```
+
+**Problem.** `vergabe-teilnahme/pyproject.toml` has a path dependency
+on `../issue-core`. Building the container image therefore requires
+the `--build-context issue-core=/home/worsch/issue-core` BuildKit
+flag, which is operator-machine-specific and breaks CI builds /
+remote builds / other workstations.
+
+**Fix.** Enable the Gitea Python package registry (analogous to the
+container registry from `RAIL-AP-WP-0001`), publish `issue-core` as a
+proper wheel with version, and switch the dep to
+`issue-core>=0.2,<0.3` with a normal index URL. The Dockerfile then
+drops the `--build-context` and the build becomes portable.
+
+**Coordination:** depends on Gitea PyPI enablement in `railiance-apps`
+(small Helm values change) and a release pipeline for `issue-core`
+(separate repo).
+
+---
+
+## I04 — Operator onboarding: install the `kubectl cnpg` plugin
+
+```task
+id: RAILIANCE-WP-0004-I04
+status: todo
+priority: low
+```
+
+**Problem.** `make vergabe-status`, `apps-pg-status`, `db-shell` use
+`kubectl cnpg ...` first and fall back to bare `kubectl` when the
+plugin is missing. The fallback works but the cnpg plugin gives much
+better cluster diagnostics (`status` table, primary/replica health,
+backup state).
+
+**Fix.** Add the plugin install command to operator onboarding (one
+line: `kubectl krew install cnpg` or a direct binary download). Add
+a `make check-tools` target that warns when `kubectl cnpg` or `helm`
+is missing.
+
+---
+
+## I05 — Operator onboarding: SOPS / age key bootstrap
+
+```task
+id: RAILIANCE-WP-0004-I05
+status: todo
+priority: low
+```
+
+**Problem.** Several Makefile targets read `helm/*.sops.yaml` via
+`sops -d`. A new operator with no `~/.config/sops/age/keys.txt`
+sees a confusing decryption failure rather than a clear "you need
+the age key" message. The session that produced this workplan had to
+skip the SOPS template step for `apps-pg-secret.sops.yaml.template`.
+
+**Fix.** Add a `docs/operator-setup.md` with the age key handoff
+procedure (where to put the key, how to verify, how to rotate). A
+`make check-sops` target that asserts the keys file exists and can
+decrypt a known sentinel would catch this at the first deploy attempt
+rather than at the failing apply.
+
+---
+
+## I06 — CI guard against stale committed manifests vs live CRD drift
+
+```task
+id: RAILIANCE-WP-0004-I06
+status: todo
+priority: medium
+```
+
+**Problem.** `helm/gitea-db-cluster.yaml` (in `railiance-platform`)
+had `spec.postgresql.version: "16"` — a field that has never
+existed in the CNPG v1 schema. The committed manifest had silently
+diverged from the live cluster for months and would have rejected on
+the next `make db-deploy`. Caught only by trying to apply a new file
+that copied the same stale shape.
+
+**Fix.** Add a per-PR CI job that runs
+`kubectl apply --dry-run=server -f <changed-yaml>` against a
+representative cluster (or a kind cluster seeded with the same CRDs).
+The cnpg / cert-manager / Traefik CRDs change between operator
+releases; strict server-side decoding catches drift that
+`yamllint` and Helm template rendering miss.
+
+**Note.** Primarily a `railiance-platform` and `railiance-cluster`
+concern, but mirrored here because every S5 manifest in
+`charts/` and `manifests/` carries the same risk.
+
+---
+
+## I07 — `kubectl run --rm -i` smoke pattern is unreliable
+
+```task
+id: RAILIANCE-WP-0004-I07
+status: todo
+priority: low
+```
+
+**Problem.** Repeated false negatives when testing service-IP
+connectivity with `kubectl run --rm -i …`: the smoke pod exits
+before the connection completes, producing "Connection refused"
+output even though the destination service was fully healthy. Wasted
+significant debugging time during apps-pg verification before
+switching to a persistent pod + `kubectl exec`.
+
+**Fix.** Add an `docs/operator-recipes.md` note (or inline in the
+runbook) recommending the persistent-pod-plus-exec pattern for any
+service-IP smoke check. Optional: ship `tools/smoke.sh` that
+wraps the pattern.
+
+---
+
+## Notes
+
+- Items are individually `todo`; the workplan status is `backlog` so
+  they don't show up in active-workstream lists. Promote an item to
+  `active` (and its tasks to `in_progress`) when you pick it up.
+- I06 is genuinely cross-repo; the others are local to
+  `railiance-apps` or its operator workflow.
+- The first three items (I01, I02, I03) are the highest-leverage
+  for the second S5 app onboarding.