Files
railiance-apps/workplans/railiance-apps-WP-0004-app-deployment-improvements.md
2026-05-23 06:43:30 +02:00

8.9 KiB

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug planning_priority created updated state_hub_workstream_id
RAILIANCE-WP-0004 workplan App deployment improvements (lessons from RAILIANCE-WP-0002) railiance railiance-apps active railiance railiance medium 2026-05-19 2026-05-23 b61a9aca-4e43-4b3d-a48b-999e0fa842cf

App deployment improvements

This workplan collects concrete follow-ups surfaced while shipping vergabe-teilnahme under RAILIANCE-WP-0002. Each item is small, independent, and can be picked up in isolation when the next S5 app lands or when the next operator onboards. Activated on 2026-05-22; local railiance-apps guardrails are implemented, with the package publication item blocked on Gitea package publish credentials.

I01 — URL-encode DB passwords at Secret-build time

id: RAILIANCE-WP-0004-I01
status: done
priority: medium
state_hub_task_id: "a05a855a-00a0-4e0e-ba82-27e0a072f777"

Problem. cnpg-generated bootstrap passwords come from openssl rand -base64 N and contain =, +, /. Embedded raw in DATABASE_URL, those characters confuse dj-database-url (it parsed vergabe:<pw>@apps-pg-rw:5432/vergabe_db as having an 80-character database name). Cost us one Helm revision and one pod restart to diagnose.

Fix. Add a tiny helper (shell script or Makefile target) that takes the raw role password from the cnpg secret and emits the DSN-ready URL-encoded form into the consumer-namespace env Secret. Alternative: switch to individual env vars (POSTGRES_HOST, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB) so no URL parsing is needed at all.

Where it lives: new tools/ script + Makefile target, or chart helper template.

Implemented 2026-05-22. Added tools/build-database-url-secret.sh and make vergabe-db-url-secret; updated the app runbook to use the helper during DB password rotation.


I02 — Document the Django + kube-probe Host-header pattern

id: RAILIANCE-WP-0004-I02
status: done
priority: low
state_hub_task_id: "22a212e6-31b1-490a-8d1c-0a33ddc62501"

Problem. The kube-probe sends Host: <pod-ip>:8000. With production Django settings (DEBUG=False, narrow ALLOWED_HOSTS), that fails the Host validation and returns HTTP 400 Bad Request, which the kubelet treats as Unhealthy. First deploy revision restarted on liveness failure for ~5 minutes before diagnosis.

Fix. The charts/vergabe-teilnahme chart already sets httpGet.httpHeaders[Host] from probes.hostHeader. Promote this pattern into a documented "Django-on-Railiance" recipe (short doc in docs/) so the next Django app starts there rather than rediscovering the gotcha. Also worth a "common chart values" sketch if a second Django app justifies the abstraction.

Implemented 2026-05-22. Added docs/django-on-railiance.md and cross-linked it from the vergabe-teilnahme runbook.


I03 — Publish issue-core to a Gitea Python package registry

id: RAILIANCE-WP-0004-I03
status: blocked
priority: medium
state_hub_task_id: "f412b874-0670-4a4a-89fc-575fe4994646"

Problem. vergabe-teilnahme/pyproject.toml has a path dependency on ../issue-core. Building the container image therefore requires the --build-context issue-core=/home/worsch/issue-core BuildKit flag, which is operator-machine-specific and breaks CI builds / remote builds / other workstations.

Fix. Enable the Gitea Python package registry (analogous to the container registry from RAIL-AP-WP-0001), publish issue-core as a proper wheel with version, and switch the dep to issue-core>=0.2,<0.3 with a normal index URL. The Dockerfile then drops the --build-context and the build becomes portable.

Coordination: depends on Gitea PyPI enablement in railiance-apps (small Helm values change) and a release pipeline for issue-core (separate repo).

Local progress 2026-05-22. helm/gitea-registry-values.yaml now sets packages.LIMIT_SIZE_PYPI: -1, and docs/gitea-package-registry.md documents the Gitea PyPI endpoint plus the issue-core migration. The remaining release and dependency change must happen in the issue-core and vergabe-teilnahme repos.

Cross-repo progress 2026-05-23. issue-core now has a validated make package-check build and Gitea Actions publish workflow for the 0.2.x package series. vergabe-teilnahme has been switched in pyproject.toml to issue-core>=0.2,<0.3, with the Docker named issue-core build context removed in favor of the Gitea PyPI index. The final unblock still requires a Gitea package username/token to publish issue-core==0.2.0; once published, regenerate vergabe-teilnahme/uv.lock from the registry and mark this task done.


I04 — Operator onboarding: install the kubectl cnpg plugin

id: RAILIANCE-WP-0004-I04
status: done
priority: low
state_hub_task_id: "2f44cad1-b70c-4406-91a9-0c0fa9c75583"

Problem. make vergabe-status, apps-pg-status, db-shell use kubectl cnpg ... first and fall back to bare kubectl when the plugin is missing. The fallback works but the cnpg plugin gives much better cluster diagnostics (status table, primary/replica health, backup state).

Fix. Add the plugin install command to operator onboarding (one line: kubectl krew install cnpg or a direct binary download). Add a make check-tools target that warns when kubectl cnpg or helm is missing.

Implemented 2026-05-22. Added make check-tools, docs/operator-setup.md, and cnpg fallback status output for Gitea and the shared apps-pg cluster.


I05 — Operator onboarding: SOPS / age key bootstrap

id: RAILIANCE-WP-0004-I05
status: done
priority: low
state_hub_task_id: "741d8a73-8cb0-40ac-a218-f1d3a74ebef3"

Problem. Several Makefile targets read helm/*.sops.yaml via sops -d. A new operator with no ~/.config/sops/age/keys.txt sees a confusing decryption failure rather than a clear "you need the age key" message. The session that produced this workplan had to skip the SOPS template step for apps-pg-secret.sops.yaml.template.

Fix. Add a docs/operator-setup.md with the age key handoff procedure (where to put the key, how to verify, how to rotate). A make check-sops target that asserts the keys file exists and can decrypt a known sentinel would catch this at the first deploy attempt rather than at the failing apply.

Implemented 2026-05-22. Added docs/operator-setup.md, tools/check-sops.sh, and make check-sops using helm/gitea-values.sops.yaml as the sentinel by default.


I06 — CI guard against stale committed manifests vs live CRD drift

id: RAILIANCE-WP-0004-I06
status: done
priority: medium
state_hub_task_id: "a319c20b-993c-46b7-889a-f0ac738056c4"

Problem. helm/gitea-db-cluster.yaml (in railiance-platform) had spec.postgresql.version: "16" — a field that has never existed in the CNPG v1 schema. The committed manifest had silently diverged from the live cluster for months and would have rejected on the next make db-deploy. Caught only by trying to apply a new file that copied the same stale shape.

Fix. Add a per-PR CI job that runs kubectl apply --dry-run=server -f <changed-yaml> against a representative cluster (or a kind cluster seeded with the same CRDs). The cnpg / cert-manager / Traefik CRDs change between operator releases; strict server-side decoding catches drift that yamllint and Helm template rendering miss.

Note. Primarily a railiance-platform and railiance-cluster concern, but mirrored here because every S5 manifest in charts/ and manifests/ carries the same risk.

Implemented 2026-05-22. Added tools/k8s-server-dry-run.sh, make k8s-server-dry-run, and a .gitea/workflows/ PR workflow that runs the guard when charts, Helm values, manifests, or the dry-run tool change.


I07 — kubectl run --rm -i smoke pattern is unreliable

id: RAILIANCE-WP-0004-I07
status: done
priority: low
state_hub_task_id: "e3f59b3d-95c8-4cf9-9943-b1597954fd77"

Problem. Repeated false negatives when testing service-IP connectivity with kubectl run --rm -i …: the smoke pod exits before the connection completes, producing "Connection refused" output even though the destination service was fully healthy. Wasted significant debugging time during apps-pg verification before switching to a persistent pod + kubectl exec.

Fix. Add an docs/operator-recipes.md note (or inline in the runbook) recommending the persistent-pod-plus-exec pattern for any service-IP smoke check. Optional: ship tools/smoke.sh that wraps the pattern.

Implemented 2026-05-22. Added docs/operator-recipes.md and tools/smoke-service.sh.


Notes

  • Items were activated on 2026-05-22. Local railiance-apps pieces are complete except I03, which is blocked on Gitea package publish credentials.
  • I06 is genuinely cross-repo; the others are local to railiance-apps or its operator workflow.
  • The first three items (I01, I02, I03) are the highest-leverage for the second S5 app onboarding.