From 864bb9d1dc0b73fc675ed81fcabe05dff8c3c357 Mon Sep 17 00:00:00 2001 From: tegwick Date: Tue, 19 May 2026 20:38:13 +0200 Subject: [PATCH] Propose RAILIANCE-WP-0004: app deployment improvements backlog 7 items surfaced during RAILIANCE-WP-0002 (vergabe-teilnahme launch): URL-encoding DB passwords at Secret-build time, Django+kube-probe Host-header pattern, publishing issue-core to a Gitea PyPI registry to remove the BuildKit --build-context dependency, kubectl cnpg plugin + SOPS/age in operator onboarding, CI guard against stale yaml vs live CRD drift, and persistent-pod smoke pattern over kubectl run --rm. Status backlog; pick up individually before the second S5 app onboards. Co-Authored-By: Claude Opus 4.7 --- ...pps-WP-0004-app-deployment-improvements.md | 200 ++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 workplans/railiance-apps-WP-0004-app-deployment-improvements.md diff --git a/workplans/railiance-apps-WP-0004-app-deployment-improvements.md b/workplans/railiance-apps-WP-0004-app-deployment-improvements.md new file mode 100644 index 0000000..33ab78c --- /dev/null +++ b/workplans/railiance-apps-WP-0004-app-deployment-improvements.md @@ -0,0 +1,200 @@ +--- +id: RAILIANCE-WP-0004 +type: workplan +title: "App deployment improvements (lessons from RAILIANCE-WP-0002)" +domain: railiance +repo: railiance-apps +status: backlog +owner: railiance +topic_slug: railiance +planning_priority: medium +created: "2026-05-19" +updated: "2026-05-19" +--- + +# App deployment improvements + +This workplan collects concrete follow-ups surfaced while shipping +`vergabe-teilnahme` under `RAILIANCE-WP-0002`. Each item is small, +independent, and can be picked up in isolation when the next S5 app +lands or when the next operator onboards. Status is `backlog` — +nothing here is blocking the live deployment. + +## I01 — URL-encode DB passwords at Secret-build time + +```task +id: RAILIANCE-WP-0004-I01 +status: todo +priority: medium +``` + +**Problem.** cnpg-generated bootstrap passwords come from +`openssl rand -base64 N` and contain `=`, `+`, `/`. Embedded raw in +`DATABASE_URL`, those characters confuse `dj-database-url` (it parsed +`vergabe:@apps-pg-rw:5432/vergabe_db` as having an 80-character +database name). Cost us one Helm revision and one pod restart to +diagnose. + +**Fix.** Add a tiny helper (shell script or Makefile target) that +takes the raw role password from the cnpg secret and emits the +DSN-ready URL-encoded form into the consumer-namespace env Secret. +Alternative: switch to individual env vars (`POSTGRES_HOST`, +`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) so no URL +parsing is needed at all. + +**Where it lives:** new `tools/` script + Makefile target, or chart +helper template. + +--- + +## I02 — Document the Django + kube-probe Host-header pattern + +```task +id: RAILIANCE-WP-0004-I02 +status: todo +priority: low +``` + +**Problem.** The kube-probe sends `Host: :8000`. With +production Django settings (`DEBUG=False`, narrow `ALLOWED_HOSTS`), +that fails the Host validation and returns `HTTP 400 Bad Request`, +which the kubelet treats as Unhealthy. First deploy revision +restarted on liveness failure for ~5 minutes before diagnosis. + +**Fix.** The `charts/vergabe-teilnahme` chart already sets +`httpGet.httpHeaders[Host]` from `probes.hostHeader`. Promote this +pattern into a documented "Django-on-Railiance" recipe (short doc in +`docs/`) so the next Django app starts there rather than rediscovering +the gotcha. Also worth a "common chart values" sketch if a second +Django app justifies the abstraction. + +--- + +## I03 — Publish `issue-core` to a Gitea Python package registry + +```task +id: RAILIANCE-WP-0004-I03 +status: todo +priority: medium +``` + +**Problem.** `vergabe-teilnahme/pyproject.toml` has a path dependency +on `../issue-core`. Building the container image therefore requires +the `--build-context issue-core=/home/worsch/issue-core` BuildKit +flag, which is operator-machine-specific and breaks CI builds / +remote builds / other workstations. + +**Fix.** Enable the Gitea Python package registry (analogous to the +container registry from `RAIL-AP-WP-0001`), publish `issue-core` as a +proper wheel with version, and switch the dep to +`issue-core>=0.2,<0.3` with a normal index URL. The Dockerfile then +drops the `--build-context` and the build becomes portable. + +**Coordination:** depends on Gitea PyPI enablement in `railiance-apps` +(small Helm values change) and a release pipeline for `issue-core` +(separate repo). + +--- + +## I04 — Operator onboarding: install the `kubectl cnpg` plugin + +```task +id: RAILIANCE-WP-0004-I04 +status: todo +priority: low +``` + +**Problem.** `make vergabe-status`, `apps-pg-status`, `db-shell` use +`kubectl cnpg ...` first and fall back to bare `kubectl` when the +plugin is missing. The fallback works but the cnpg plugin gives much +better cluster diagnostics (`status` table, primary/replica health, +backup state). + +**Fix.** Add the plugin install command to operator onboarding (one +line: `kubectl krew install cnpg` or a direct binary download). Add +a `make check-tools` target that warns when `kubectl cnpg` or `helm` +is missing. + +--- + +## I05 — Operator onboarding: SOPS / age key bootstrap + +```task +id: RAILIANCE-WP-0004-I05 +status: todo +priority: low +``` + +**Problem.** Several Makefile targets read `helm/*.sops.yaml` via +`sops -d`. A new operator with no `~/.config/sops/age/keys.txt` +sees a confusing decryption failure rather than a clear "you need +the age key" message. The session that produced this workplan had to +skip the SOPS template step for `apps-pg-secret.sops.yaml.template`. + +**Fix.** Add a `docs/operator-setup.md` with the age key handoff +procedure (where to put the key, how to verify, how to rotate). A +`make check-sops` target that asserts the keys file exists and can +decrypt a known sentinel would catch this at the first deploy attempt +rather than at the failing apply. + +--- + +## I06 — CI guard against stale committed manifests vs live CRD drift + +```task +id: RAILIANCE-WP-0004-I06 +status: todo +priority: medium +``` + +**Problem.** `helm/gitea-db-cluster.yaml` (in `railiance-platform`) +had `spec.postgresql.version: "16"` — a field that has never +existed in the CNPG v1 schema. The committed manifest had silently +diverged from the live cluster for months and would have rejected on +the next `make db-deploy`. Caught only by trying to apply a new file +that copied the same stale shape. + +**Fix.** Add a per-PR CI job that runs +`kubectl apply --dry-run=server -f ` against a +representative cluster (or a kind cluster seeded with the same CRDs). +The cnpg / cert-manager / Traefik CRDs change between operator +releases; strict server-side decoding catches drift that +`yamllint` and Helm template rendering miss. + +**Note.** Primarily a `railiance-platform` and `railiance-cluster` +concern, but mirrored here because every S5 manifest in +`charts/` and `manifests/` carries the same risk. + +--- + +## I07 — `kubectl run --rm -i` smoke pattern is unreliable + +```task +id: RAILIANCE-WP-0004-I07 +status: todo +priority: low +``` + +**Problem.** Repeated false negatives when testing service-IP +connectivity with `kubectl run --rm -i …`: the smoke pod exits +before the connection completes, producing "Connection refused" +output even though the destination service was fully healthy. Wasted +significant debugging time during apps-pg verification before +switching to a persistent pod + `kubectl exec`. + +**Fix.** Add an `docs/operator-recipes.md` note (or inline in the +runbook) recommending the persistent-pod-plus-exec pattern for any +service-IP smoke check. Optional: ship `tools/smoke.sh` that +wraps the pattern. + +--- + +## Notes + +- Items are individually `todo`; the workplan status is `backlog` so + they don't show up in active-workstream lists. Promote an item to + `active` (and its tasks to `in_progress`) when you pick it up. +- I06 is genuinely cross-repo; the others are local to + `railiance-apps` or its operator workflow. +- The first three items (I01, I02, I03) are the highest-leverage + for the second S5 app onboarding.