173 lines
6.8 KiB
Markdown
173 lines
6.8 KiB
Markdown
# vergabe-teilnahme — operator runbook
|
|
|
|
Production deployment of the Django tender-management app, shipped
|
|
under `RAILIANCE-WP-0002`.
|
|
|
|
## Identity
|
|
|
|
| | |
|
|
|---|---|
|
|
| Public URL | https://vergabe-teilnahme.whywhynot.de |
|
|
| Namespace | `vergabe-teilnahme` |
|
|
| Helm release | `vergabe-teilnahme` |
|
|
| Chart | `charts/vergabe-teilnahme/` |
|
|
| Values | `helm/vergabe-teilnahme-values.yaml` (plain — no SOPS) |
|
|
| Ingress | `manifests/vergabe-teilnahme-ingress.yaml` |
|
|
| Image | `gitea.coulomb.social/coulomb/vergabe-teilnahme:<tag>` |
|
|
| Database | `vergabe_db` on shared cnpg `apps-pg` (see `railiance-platform/docs/apps-pg.md`) |
|
|
| TLS | `vergabe-teilnahme-tls`, issued by cert-manager `letsencrypt-prod` |
|
|
|
|
## Secrets
|
|
|
|
Two K8s Secrets in the `vergabe-teilnahme` namespace:
|
|
|
|
| Secret | Type | Source of truth | Used for |
|
|
|--------|------|-----------------|----------|
|
|
| `vergabe-app-credentials` | `kubernetes.io/basic-auth` | mirror of `databases/vergabe-app-credentials` (cnpg-owned) | raw DB role credential |
|
|
| `vergabe-teilnahme-env` | `Opaque` | created by operator | `SECRET_KEY` + URL-encoded `DATABASE_URL` (envFrom on the Deployment) |
|
|
|
|
**No SOPS encryption** for this app — all sensitive material lives in
|
|
K8s Secrets, not in committed values files.
|
|
|
|
### Rotating the DB password
|
|
|
|
1. Have `railiance-platform` rotate the cnpg-managed Secret
|
|
(`databases/vergabe-app-credentials`).
|
|
2. Mirror the new password into `vergabe-teilnahme/vergabe-app-credentials`.
|
|
3. Rebuild `DATABASE_URL` in `vergabe-teilnahme-env`, **URL-encoding
|
|
the password** (the base64 character set breaks the URL parser
|
|
otherwise - see `RAILIANCE-WP-0004 I01`):
|
|
```bash
|
|
make vergabe-db-url-secret
|
|
kubectl rollout restart deploy/vergabe-teilnahme -n vergabe-teilnahme
|
|
```
|
|
|
|
### Rotating `SECRET_KEY`
|
|
|
|
Django `SECRET_KEY` rotation invalidates active sessions but is
|
|
otherwise zero-downtime:
|
|
|
|
```bash
|
|
NEW=$(openssl rand -base64 50 | tr -d '\n' | tr '/+=' 'abc')
|
|
kubectl patch secret vergabe-teilnahme-env -n vergabe-teilnahme \
|
|
--type=merge -p "{\"stringData\":{\"SECRET_KEY\":\"$NEW\"}}"
|
|
kubectl rollout restart deploy/vergabe-teilnahme -n vergabe-teilnahme
|
|
```
|
|
|
|
## Day-to-day commands
|
|
|
|
```bash
|
|
make vergabe-status # pods, svc, ingress, certificate
|
|
make vergabe-logs # tail app logs
|
|
make vergabe-dry-run # helm template render (audit values)
|
|
make vergabe-deploy # helm upgrade --install (idempotent)
|
|
make vergabe-migrate # manage.py migrate against live deploy
|
|
make vergabe-seed # seed_dev — DEV ONLY, creates max.muster/testpass123 (do not run in prod)
|
|
make vergabe-superuser # interactive createsuperuser
|
|
```
|
|
|
|
## Promoting a new image tag
|
|
|
|
1. Build + push from the `vergabe-teilnahme` repo using the portable package
|
|
path: `issue-core` must resolve from the Gitea PyPI registry, not from a
|
|
sibling checkout. If `issue-core==0.2.0` is not published yet, keep
|
|
`railiance-apps-WP-0004 I03` in `wait`.
|
|
2. Update `image.tag` in `helm/vergabe-teilnahme-values.yaml` to the
|
|
new git SHA.
|
|
3. `make vergabe-deploy` — Helm rolls a new ReplicaSet with
|
|
zero-downtime (`maxSurge: 1, maxUnavailable: 0`).
|
|
4. Verify via `make vergabe-status` and an HTTPS probe.
|
|
5. If migrations are needed, run `make vergabe-migrate` after the
|
|
rollout completes.
|
|
|
|
## Rollback
|
|
|
|
```bash
|
|
helm history vergabe-teilnahme -n vergabe-teilnahme
|
|
helm rollback vergabe-teilnahme <REVISION> -n vergabe-teilnahme
|
|
```
|
|
|
|
Rollback does **not** unwind DB migrations. For any rollback that
|
|
crosses a migration boundary, plan a `manage.py migrate <app> <name>`
|
|
reverse step explicitly.
|
|
|
|
## Troubleshooting
|
|
|
|
### Pod stuck `Running` 0/1, kube-probe failing
|
|
|
|
Most likely the probe's `Host` header doesn't match
|
|
`ALLOWED_HOSTS`. The chart sets `probes.hostHeader:
|
|
vergabe-teilnahme.whywhynot.de` precisely to avoid this — if you
|
|
change `ALLOWED_HOSTS` in values, also update `probes.hostHeader`.
|
|
Symptom in `kubectl logs`: kube-probe requests returning HTTP 400.
|
|
See `docs/django-on-railiance.md` for the reusable pattern.
|
|
|
|
### `dj-database-url` error: "The database name 'XYZ...' is longer than 63 characters"
|
|
|
|
The `DATABASE_URL` password isn't URL-encoded. See the rotation
|
|
recipe above. Tracked in `RAILIANCE-WP-0004 I01`.
|
|
|
|
### Cert-manager: cert stuck in `False`
|
|
|
|
Check the Order/Challenge resources:
|
|
```bash
|
|
kubectl get order,challenge -n vergabe-teilnahme
|
|
kubectl describe challenge -n vergabe-teilnahme
|
|
```
|
|
Common causes: DNS not yet propagated to all resolvers, Let's
|
|
Encrypt rate-limited, or the ingress controller isn't forwarding
|
|
`/.well-known/acme-challenge/` requests.
|
|
|
|
### `make vergabe-status` shows certificate `False`
|
|
|
|
The chart leaves cert lifecycle to cert-manager. If the cert renews
|
|
fail, cert-manager keeps serving the old cert until it expires.
|
|
Investigate with `kubectl describe certificate vergabe-teilnahme-tls
|
|
-n vergabe-teilnahme`.
|
|
|
|
## Data durability and restore readiness
|
|
|
|
`vergabe_db` lives on the shared `apps-pg` CNPG cluster owned by
|
|
`railiance-platform`. S5 owns the app release runbook and post-restore app
|
|
checks; platform owns the database backup and restore mechanism.
|
|
|
|
Current status: `apps-pg` backup coverage is still platform follow-up work, so
|
|
`vergabe-teilnahme` should not be treated as production-critical data until the
|
|
gate in `docs/app-data-backup-restore-handoff.md` is satisfied.
|
|
|
|
Manual logical dump is a break-glass or inspection option, not the durable
|
|
backup contract:
|
|
|
|
```bash
|
|
kubectl exec -n databases apps-pg-1 -- pg_dump -U postgres -Fc vergabe_db > vergabe_db-$(date +%F).dump
|
|
```
|
|
|
|
Before promotion beyond smoke or development use, record platform backup
|
|
evidence, an isolated restore drill, migration result, health check, HTTPS
|
|
smoke check, and representative app workflow verification.
|
|
|
|
## Deferred for v1
|
|
|
|
- Multi-replica HA (`replicaCount: 1`).
|
|
- Media-upload PVC (`persistence.media.enabled: false` — Django
|
|
`MEDIA_ROOT` is in-pod ephemeral).
|
|
- 3-stage canary (the Staged Promotion Lifecycle workstream is still
|
|
0/7).
|
|
- SSO / Keycloak integration (Django built-in auth only).
|
|
- Celery + Redis workers.
|
|
|
|
## Cross-references
|
|
|
|
- Workplan: `workplans/railiance-apps-WP-0002-vergabe-teilnahme-on-railiance01.md`
|
|
- Improvements backlog: `workplans/railiance-apps-WP-0004-app-deployment-improvements.md`
|
|
- Shared DB cluster: `railiance-platform/docs/apps-pg.md`
|
|
- Container registry: `/home/worsch/railiance-forge/docs/gitea-container-registry.md`
|
|
- Python package registry: `/home/worsch/railiance-forge/docs/gitea-package-registry.md`
|
|
- S5 app onboarding checklist: `docs/s5-app-onboarding-checklist.md`
|
|
- App data backup handoff: `docs/app-data-backup-restore-handoff.md`
|
|
- Manifest dry-run prerequisites: `docs/manifest-server-dry-run.md`
|
|
- Django deployment recipe: `docs/django-on-railiance.md`
|
|
- Operator setup: `docs/operator-setup.md`
|
|
- Operator recipes: `docs/operator-recipes.md`
|
|
- App source: https://gitea.coulomb.social/coulomb/vergabe-teilnahme
|