7 items surfaced during RAILIANCE-WP-0002 (vergabe-teilnahme launch): URL-encoding DB passwords at Secret-build time, Django+kube-probe Host-header pattern, publishing issue-core to a Gitea PyPI registry to remove the BuildKit --build-context dependency, kubectl cnpg plugin + SOPS/age in operator onboarding, CI guard against stale yaml vs live CRD drift, and persistent-pod smoke pattern over kubectl run --rm. Status backlog; pick up individually before the second S5 app onboards. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
201 lines
6.7 KiB
Markdown
201 lines
6.7 KiB
Markdown
---
|
|
id: RAILIANCE-WP-0004
|
|
type: workplan
|
|
title: "App deployment improvements (lessons from RAILIANCE-WP-0002)"
|
|
domain: railiance
|
|
repo: railiance-apps
|
|
status: backlog
|
|
owner: railiance
|
|
topic_slug: railiance
|
|
planning_priority: medium
|
|
created: "2026-05-19"
|
|
updated: "2026-05-19"
|
|
---
|
|
|
|
# App deployment improvements
|
|
|
|
This workplan collects concrete follow-ups surfaced while shipping
|
|
`vergabe-teilnahme` under `RAILIANCE-WP-0002`. Each item is small,
|
|
independent, and can be picked up in isolation when the next S5 app
|
|
lands or when the next operator onboards. Status is `backlog` —
|
|
nothing here is blocking the live deployment.
|
|
|
|
## I01 — URL-encode DB passwords at Secret-build time
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I01
|
|
status: todo
|
|
priority: medium
|
|
```
|
|
|
|
**Problem.** cnpg-generated bootstrap passwords come from
|
|
`openssl rand -base64 N` and contain `=`, `+`, `/`. Embedded raw in
|
|
`DATABASE_URL`, those characters confuse `dj-database-url` (it parsed
|
|
`vergabe:<pw>@apps-pg-rw:5432/vergabe_db` as having an 80-character
|
|
database name). Cost us one Helm revision and one pod restart to
|
|
diagnose.
|
|
|
|
**Fix.** Add a tiny helper (shell script or Makefile target) that
|
|
takes the raw role password from the cnpg secret and emits the
|
|
DSN-ready URL-encoded form into the consumer-namespace env Secret.
|
|
Alternative: switch to individual env vars (`POSTGRES_HOST`,
|
|
`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`) so no URL
|
|
parsing is needed at all.
|
|
|
|
**Where it lives:** new `tools/` script + Makefile target, or chart
|
|
helper template.
|
|
|
|
---
|
|
|
|
## I02 — Document the Django + kube-probe Host-header pattern
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I02
|
|
status: todo
|
|
priority: low
|
|
```
|
|
|
|
**Problem.** The kube-probe sends `Host: <pod-ip>:8000`. With
|
|
production Django settings (`DEBUG=False`, narrow `ALLOWED_HOSTS`),
|
|
that fails the Host validation and returns `HTTP 400 Bad Request`,
|
|
which the kubelet treats as Unhealthy. First deploy revision
|
|
restarted on liveness failure for ~5 minutes before diagnosis.
|
|
|
|
**Fix.** The `charts/vergabe-teilnahme` chart already sets
|
|
`httpGet.httpHeaders[Host]` from `probes.hostHeader`. Promote this
|
|
pattern into a documented "Django-on-Railiance" recipe (short doc in
|
|
`docs/`) so the next Django app starts there rather than rediscovering
|
|
the gotcha. Also worth a "common chart values" sketch if a second
|
|
Django app justifies the abstraction.
|
|
|
|
---
|
|
|
|
## I03 — Publish `issue-core` to a Gitea Python package registry
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I03
|
|
status: todo
|
|
priority: medium
|
|
```
|
|
|
|
**Problem.** `vergabe-teilnahme/pyproject.toml` has a path dependency
|
|
on `../issue-core`. Building the container image therefore requires
|
|
the `--build-context issue-core=/home/worsch/issue-core` BuildKit
|
|
flag, which is operator-machine-specific and breaks CI builds /
|
|
remote builds / other workstations.
|
|
|
|
**Fix.** Enable the Gitea Python package registry (analogous to the
|
|
container registry from `RAIL-AP-WP-0001`), publish `issue-core` as a
|
|
proper wheel with version, and switch the dep to
|
|
`issue-core>=0.2,<0.3` with a normal index URL. The Dockerfile then
|
|
drops the `--build-context` and the build becomes portable.
|
|
|
|
**Coordination:** depends on Gitea PyPI enablement in `railiance-apps`
|
|
(small Helm values change) and a release pipeline for `issue-core`
|
|
(separate repo).
|
|
|
|
---
|
|
|
|
## I04 — Operator onboarding: install the `kubectl cnpg` plugin
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I04
|
|
status: todo
|
|
priority: low
|
|
```
|
|
|
|
**Problem.** `make vergabe-status`, `apps-pg-status`, `db-shell` use
|
|
`kubectl cnpg ...` first and fall back to bare `kubectl` when the
|
|
plugin is missing. The fallback works but the cnpg plugin gives much
|
|
better cluster diagnostics (`status` table, primary/replica health,
|
|
backup state).
|
|
|
|
**Fix.** Add the plugin install command to operator onboarding (one
|
|
line: `kubectl krew install cnpg` or a direct binary download). Add
|
|
a `make check-tools` target that warns when `kubectl cnpg` or `helm`
|
|
is missing.
|
|
|
|
---
|
|
|
|
## I05 — Operator onboarding: SOPS / age key bootstrap
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I05
|
|
status: todo
|
|
priority: low
|
|
```
|
|
|
|
**Problem.** Several Makefile targets read `helm/*.sops.yaml` via
|
|
`sops -d`. A new operator with no `~/.config/sops/age/keys.txt`
|
|
sees a confusing decryption failure rather than a clear "you need
|
|
the age key" message. The session that produced this workplan had to
|
|
skip the SOPS template step for `apps-pg-secret.sops.yaml.template`.
|
|
|
|
**Fix.** Add a `docs/operator-setup.md` with the age key handoff
|
|
procedure (where to put the key, how to verify, how to rotate). A
|
|
`make check-sops` target that asserts the keys file exists and can
|
|
decrypt a known sentinel would catch this at the first deploy attempt
|
|
rather than at the failing apply.
|
|
|
|
---
|
|
|
|
## I06 — CI guard against stale committed manifests vs live CRD drift
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I06
|
|
status: todo
|
|
priority: medium
|
|
```
|
|
|
|
**Problem.** `helm/gitea-db-cluster.yaml` (in `railiance-platform`)
|
|
had `spec.postgresql.version: "16"` — a field that has never
|
|
existed in the CNPG v1 schema. The committed manifest had silently
|
|
diverged from the live cluster for months and would have rejected on
|
|
the next `make db-deploy`. Caught only by trying to apply a new file
|
|
that copied the same stale shape.
|
|
|
|
**Fix.** Add a per-PR CI job that runs
|
|
`kubectl apply --dry-run=server -f <changed-yaml>` against a
|
|
representative cluster (or a kind cluster seeded with the same CRDs).
|
|
The cnpg / cert-manager / Traefik CRDs change between operator
|
|
releases; strict server-side decoding catches drift that
|
|
`yamllint` and Helm template rendering miss.
|
|
|
|
**Note.** Primarily a `railiance-platform` and `railiance-cluster`
|
|
concern, but mirrored here because every S5 manifest in
|
|
`charts/` and `manifests/` carries the same risk.
|
|
|
|
---
|
|
|
|
## I07 — `kubectl run --rm -i` smoke pattern is unreliable
|
|
|
|
```task
|
|
id: RAILIANCE-WP-0004-I07
|
|
status: todo
|
|
priority: low
|
|
```
|
|
|
|
**Problem.** Repeated false negatives when testing service-IP
|
|
connectivity with `kubectl run --rm -i …`: the smoke pod exits
|
|
before the connection completes, producing "Connection refused"
|
|
output even though the destination service was fully healthy. Wasted
|
|
significant debugging time during apps-pg verification before
|
|
switching to a persistent pod + `kubectl exec`.
|
|
|
|
**Fix.** Add an `docs/operator-recipes.md` note (or inline in the
|
|
runbook) recommending the persistent-pod-plus-exec pattern for any
|
|
service-IP smoke check. Optional: ship `tools/smoke.sh` that
|
|
wraps the pattern.
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Items are individually `todo`; the workplan status is `backlog` so
|
|
they don't show up in active-workstream lists. Promote an item to
|
|
`active` (and its tasks to `in_progress`) when you pick it up.
|
|
- I06 is genuinely cross-repo; the others are local to
|
|
`railiance-apps` or its operator workflow.
|
|
- The first three items (I01, I02, I03) are the highest-leverage
|
|
for the second S5 app onboarding.
|