Propose RAILIANCE-WP-0004: app deployment improvements backlog

7 items surfaced during RAILIANCE-WP-0002 (vergabe-teilnahme launch): URL-encoding DB passwords at Secret-build time, Django+kube-probe Host-header pattern, publishing issue-core to a Gitea PyPI registry to remove the BuildKit --build-context dependency, kubectl cnpg plugin + SOPS/age in operator onboarding, CI guard against stale yaml vs live CRD drift, and persistent-pod smoke pattern over kubectl run --rm.

Status backlog; pick up individually before the second S5 app onboards.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-19 20:38:13 +02:00
parent 962c5a1b36
commit 864bb9d1dc

View File

@@ -0,0 +1,200 @@
---
id: RAILIANCE-WP-0004
type: workplan
title: "App deployment improvements (lessons from RAILIANCE-WP-0002)"
domain: railiance
repo: railiance-apps
status: backlog
owner: railiance
topic_slug: railiance
planning_priority: medium
created: "2026-05-19"
updated: "2026-05-19"
---
# App deployment improvements
This workplan collects concrete follow-ups surfaced while shipping
`vergabe-teilnahme` under `RAILIANCE-WP-0002`. Each item is small,
independent, and can be picked up in isolation when the next S5 app
lands or when the next operator onboards. Status is `backlog`
nothing here is blocking the live deployment.
## I01 — URL-encode DB passwords at Secret-build time
```task
id: RAILIANCE-WP-0004-I01
status: todo
priority: medium
```
**Problem.** cnpg-generated bootstrap passwords come from
`openssl rand -base64 N` and contain `=`, `+`, `/`. Embedded raw in
`DATABASE_URL`, those characters confuse `dj-database-url` (it parsed
`vergabe:<pw>@apps-pg-rw:5432/vergabe_db` as having an 80-character
database name). Cost us one Helm revision and one pod restart to
diagnose.
**Fix.** Add a tiny helper (shell script or Makefile target) that
takes the raw role password from the cnpg secret and emits the
DSN-ready URL-encoded form into the consumer-namespace env Secret.
Alternative: switch to individual env vars (`POSTGRES_HOST`,
`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) so no URL
parsing is needed at all.
**Where it lives:** new `tools/` script + Makefile target, or chart
helper template.
---
## I02 — Document the Django + kube-probe Host-header pattern
```task
id: RAILIANCE-WP-0004-I02
status: todo
priority: low
```
**Problem.** The kube-probe sends `Host: <pod-ip>:8000`. With
production Django settings (`DEBUG=False`, narrow `ALLOWED_HOSTS`),
that fails the Host validation and returns `HTTP 400 Bad Request`,
which the kubelet treats as Unhealthy. First deploy revision
restarted on liveness failure for ~5 minutes before diagnosis.
**Fix.** The `charts/vergabe-teilnahme` chart already sets
`httpGet.httpHeaders[Host]` from `probes.hostHeader`. Promote this
pattern into a documented "Django-on-Railiance" recipe (short doc in
`docs/`) so the next Django app starts there rather than rediscovering
the gotcha. Also worth a "common chart values" sketch if a second
Django app justifies the abstraction.
---
## I03 — Publish `issue-core` to a Gitea Python package registry
```task
id: RAILIANCE-WP-0004-I03
status: todo
priority: medium
```
**Problem.** `vergabe-teilnahme/pyproject.toml` has a path dependency
on `../issue-core`. Building the container image therefore requires
the `--build-context issue-core=/home/worsch/issue-core` BuildKit
flag, which is operator-machine-specific and breaks CI builds /
remote builds / other workstations.
**Fix.** Enable the Gitea Python package registry (analogous to the
container registry from `RAIL-AP-WP-0001`), publish `issue-core` as a
proper wheel with version, and switch the dep to
`issue-core>=0.2,<0.3` with a normal index URL. The Dockerfile then
drops the `--build-context` and the build becomes portable.
**Coordination:** depends on Gitea PyPI enablement in `railiance-apps`
(small Helm values change) and a release pipeline for `issue-core`
(separate repo).
---
## I04 — Operator onboarding: install the `kubectl cnpg` plugin
```task
id: RAILIANCE-WP-0004-I04
status: todo
priority: low
```
**Problem.** `make vergabe-status`, `apps-pg-status`, `db-shell` use
`kubectl cnpg ...` first and fall back to bare `kubectl` when the
plugin is missing. The fallback works but the cnpg plugin gives much
better cluster diagnostics (`status` table, primary/replica health,
backup state).
**Fix.** Add the plugin install command to operator onboarding (one
line: `kubectl krew install cnpg` or a direct binary download). Add
a `make check-tools` target that warns when `kubectl cnpg` or `helm`
is missing.
---
## I05 — Operator onboarding: SOPS / age key bootstrap
```task
id: RAILIANCE-WP-0004-I05
status: todo
priority: low
```
**Problem.** Several Makefile targets read `helm/*.sops.yaml` via
`sops -d`. A new operator with no `~/.config/sops/age/keys.txt`
sees a confusing decryption failure rather than a clear "you need
the age key" message. The session that produced this workplan had to
skip the SOPS template step for `apps-pg-secret.sops.yaml.template`.
**Fix.** Add a `docs/operator-setup.md` with the age key handoff
procedure (where to put the key, how to verify, how to rotate). A
`make check-sops` target that asserts the keys file exists and can
decrypt a known sentinel would catch this at the first deploy attempt
rather than at the failing apply.
---
## I06 — CI guard against stale committed manifests vs live CRD drift
```task
id: RAILIANCE-WP-0004-I06
status: todo
priority: medium
```
**Problem.** `helm/gitea-db-cluster.yaml` (in `railiance-platform`)
had `spec.postgresql.version: "16"` — a field that has never
existed in the CNPG v1 schema. The committed manifest had silently
diverged from the live cluster for months and would have rejected on
the next `make db-deploy`. Caught only by trying to apply a new file
that copied the same stale shape.
**Fix.** Add a per-PR CI job that runs
`kubectl apply --dry-run=server -f <changed-yaml>` against a
representative cluster (or a kind cluster seeded with the same CRDs).
The cnpg / cert-manager / Traefik CRDs change between operator
releases; strict server-side decoding catches drift that
`yamllint` and Helm template rendering miss.
**Note.** Primarily a `railiance-platform` and `railiance-cluster`
concern, but mirrored here because every S5 manifest in
`charts/` and `manifests/` carries the same risk.
---
## I07 — `kubectl run --rm -i` smoke pattern is unreliable
```task
id: RAILIANCE-WP-0004-I07
status: todo
priority: low
```
**Problem.** Repeated false negatives when testing service-IP
connectivity with `kubectl run --rm -i …`: the smoke pod exits
before the connection completes, producing "Connection refused"
output even though the destination service was fully healthy. Wasted
significant debugging time during apps-pg verification before
switching to a persistent pod + `kubectl exec`.
**Fix.** Add an `docs/operator-recipes.md` note (or inline in the
runbook) recommending the persistent-pod-plus-exec pattern for any
service-IP smoke check. Optional: ship `tools/smoke.sh` that
wraps the pattern.
---
## Notes
- Items are individually `todo`; the workplan status is `backlog` so
they don't show up in active-workstream lists. Promote an item to
`active` (and its tasks to `in_progress`) when you pick it up.
- I06 is genuinely cross-repo; the others are local to
`railiance-apps` or its operator workflow.
- The first three items (I01, I02, I03) are the highest-leverage
for the second S5 app onboarding.