Files

tegwick 01d280120d feat(platform): T01 — standalone PostgreSQL HA chart scaffold

Lays out the S3 platform layer foundation for RAIL-PL-WP-0001 T01:

- .sops.yaml: age encryption policy (shared key, *.sops.yaml pattern)
- .gitignore: prevents accidental commit of decrypted values files
- Makefile: pg-deploy, pg-status, pg-pgpool-check, valkey-deploy,
  valkey-status, backup targets with KUBECONFIG/HELM wiring
- helm/postgresql-ha-values.yaml.template: annotated values schema
  with CHANGEME_ placeholders; includes pgpool-password fix from
  RAIL-BS-WP-0003; notes on single-node vs ThreePhoenix scaling
- docs/postgresql-ha.md: connection strings, DB creation, password
  rotation, pgpool-password critical note, HA failover test ref,
  ThreePhoenix scaling path

To complete T01: fill in CHANGEME_ values, encrypt with sops -e -i,
then run make pg-deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-11 02:17:55 +01:00

3.9 KiB

Raw Blame History

PostgreSQL HA — Platform Service

Chart: bitnami/postgresql-ha Namespace: platform Managed by: railiance-platform (S3) Workplan: RAIL-PL-WP-0001

Architecture

Apps (S5)
  └── pgpool (load balancer / connection pooler)
        ├── postgresql-0  [Primary  — repmgr]
        ├── postgresql-1  [Standby  — repmgr]
        └── postgresql-2  [Standby  — repmgr]

pgpool-II distributes reads across standbys, routes writes to primary
repmgr handles automatic failover if the primary disappears
All pods in platform namespace; app pods connect via pgpool service

Connection string pattern

postgresql://DBUSER:DBPASS@postgresql-ha-pgpool.platform.svc.cluster.local:5432/DBNAME

Replace DBUSER, DBPASS, DBNAME with the database-specific credentials.

Initial deployment

Prerequisites

railiance-cluster converged (make smoke passes)
SOPS age key accessible: sops -d helm/postgresql-ha-values.sops.yaml returns plaintext
helm repo add bitnami https://charts.bitnami.com/bitnami && helm repo update done on the node

Steps

# 1. Ensure the platform namespace exists
kubectl create namespace platform --dry-run=client -o yaml | kubectl apply -f -

# 2. Deploy (from railiance-platform/)
make pg-deploy

# 3. Verify
make pg-status
# Expected: 3 postgresql pods + 1 pgpool pod, all Running

# 4. Smoke test
make smoke

Creating a new database for an app

# Connect via pgpool
kubectl exec -it -n platform \
  $(kubectl get pod -n platform -l app.kubernetes.io/component=pgpool -o name | head -1) \
  -- psql -U postgres

# Inside psql:
CREATE DATABASE myapp;
CREATE USER myapp WITH PASSWORD 'strong-password';
GRANT ALL PRIVILEGES ON DATABASE myapp TO myapp;
\c myapp
GRANT ALL ON SCHEMA public TO myapp;
\q

Add the user password to the app's own secrets (managed in the app's repo, not here). The connection string for the app will be:

postgresql://myapp:strong-password@postgresql-ha-pgpool.platform.svc.cluster.local:5432/myapp

Password rotation

Update the password in the plaintext values template
Re-encrypt: sops -e -i helm/postgresql-ha-values.sops.yaml
Upgrade: make pg-deploy
Update the app's connection secret to match
Rolling restart the app pods to pick up the new connection

pgpool-password — critical note

The postgresql.pgpoolPassword value in the Helm chart maps to the pgpool-password key in the postgresql-ha-postgresql Kubernetes Secret. The pgpool container mounts this key at startup; if it is absent, pgpool enters CrashLoopBackOff with no log output.

This was the root cause of the 2026-03-10 incident (RAIL-BS-WP-0003).

Always verify after helm upgrade:

kubectl get secret -n platform postgresql-ha-postgresql \
  -o jsonpath='{.data.pgpool-password}' | base64 -d && echo
# Must print a non-empty string

HA failover test

Per Decision D3, any change to this service requires a passing failover test:

# From railiance-cluster/
make test-ha-failover GITEA_URL=https://<gitea-hostname>

The test kills the primary PostgreSQL pod and asserts:

repmgr promotes a standby within 60s
All pods return to Running within 120s
pgpool returns to Running (catches the missing-key bug)

Backup

Platform backup (PostgreSQL logical dump) is handled by the railiance-backup tool in this repo:

make backup

This produces an age-encrypted dump uploaded to Nextcloud. For cluster-level backup (etcd, kubeconfig), see railiance-cluster/.

Scaling to 3 nodes (ThreePhoenix)

When Railiance02 and Railiance03 join the cluster:

Switch StorageClass from local-path to longhorn in the values file
Change postgresql.podAntiAffinityPreset from soft to hard
Run make pg-deploy — Helm rolling update spreads pods across nodes
Run make test-ha-failover to confirm HA is genuine (not just replicated on one node)

3.9 KiB Raw Blame History