Files
railiance-platform/docs/postgresql-ha.md
tegwick 01d280120d feat(platform): T01 — standalone PostgreSQL HA chart scaffold
Lays out the S3 platform layer foundation for RAIL-PL-WP-0001 T01:

- .sops.yaml: age encryption policy (shared key, *.sops.yaml pattern)
- .gitignore: prevents accidental commit of decrypted values files
- Makefile: pg-deploy, pg-status, pg-pgpool-check, valkey-deploy,
  valkey-status, backup targets with KUBECONFIG/HELM wiring
- helm/postgresql-ha-values.yaml.template: annotated values schema
  with CHANGEME_ placeholders; includes pgpool-password fix from
  RAIL-BS-WP-0003; notes on single-node vs ThreePhoenix scaling
- docs/postgresql-ha.md: connection strings, DB creation, password
  rotation, pgpool-password critical note, HA failover test ref,
  ThreePhoenix scaling path

To complete T01: fill in CHANGEME_ values, encrypt with sops -e -i,
then run make pg-deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 02:17:55 +01:00

3.9 KiB

PostgreSQL HA — Platform Service

Chart: bitnami/postgresql-ha Namespace: platform Managed by: railiance-platform (S3) Workplan: RAIL-PL-WP-0001


Architecture

Apps (S5)
  └── pgpool (load balancer / connection pooler)
        ├── postgresql-0  [Primary  — repmgr]
        ├── postgresql-1  [Standby  — repmgr]
        └── postgresql-2  [Standby  — repmgr]
  • pgpool-II distributes reads across standbys, routes writes to primary
  • repmgr handles automatic failover if the primary disappears
  • All pods in platform namespace; app pods connect via pgpool service

Connection string pattern

postgresql://DBUSER:DBPASS@postgresql-ha-pgpool.platform.svc.cluster.local:5432/DBNAME

Replace DBUSER, DBPASS, DBNAME with the database-specific credentials.


Initial deployment

Prerequisites

  • railiance-cluster converged (make smoke passes)
  • SOPS age key accessible: sops -d helm/postgresql-ha-values.sops.yaml returns plaintext
  • helm repo add bitnami https://charts.bitnami.com/bitnami && helm repo update done on the node

Steps

# 1. Ensure the platform namespace exists
kubectl create namespace platform --dry-run=client -o yaml | kubectl apply -f -

# 2. Deploy (from railiance-platform/)
make pg-deploy

# 3. Verify
make pg-status
# Expected: 3 postgresql pods + 1 pgpool pod, all Running

# 4. Smoke test
make smoke

Creating a new database for an app

# Connect via pgpool
kubectl exec -it -n platform \
  $(kubectl get pod -n platform -l app.kubernetes.io/component=pgpool -o name | head -1) \
  -- psql -U postgres

# Inside psql:
CREATE DATABASE myapp;
CREATE USER myapp WITH PASSWORD 'strong-password';
GRANT ALL PRIVILEGES ON DATABASE myapp TO myapp;
\c myapp
GRANT ALL ON SCHEMA public TO myapp;
\q

Add the user password to the app's own secrets (managed in the app's repo, not here). The connection string for the app will be:

postgresql://myapp:strong-password@postgresql-ha-pgpool.platform.svc.cluster.local:5432/myapp

Password rotation

  1. Update the password in the plaintext values template
  2. Re-encrypt: sops -e -i helm/postgresql-ha-values.sops.yaml
  3. Upgrade: make pg-deploy
  4. Update the app's connection secret to match
  5. Rolling restart the app pods to pick up the new connection

pgpool-password — critical note

The postgresql.pgpoolPassword value in the Helm chart maps to the pgpool-password key in the postgresql-ha-postgresql Kubernetes Secret. The pgpool container mounts this key at startup; if it is absent, pgpool enters CrashLoopBackOff with no log output.

This was the root cause of the 2026-03-10 incident (RAIL-BS-WP-0003).

Always verify after helm upgrade:

kubectl get secret -n platform postgresql-ha-postgresql \
  -o jsonpath='{.data.pgpool-password}' | base64 -d && echo
# Must print a non-empty string

HA failover test

Per Decision D3, any change to this service requires a passing failover test:

# From railiance-cluster/
make test-ha-failover GITEA_URL=https://<gitea-hostname>

The test kills the primary PostgreSQL pod and asserts:

  1. repmgr promotes a standby within 60s
  2. All pods return to Running within 120s
  3. pgpool returns to Running (catches the missing-key bug)

Backup

Platform backup (PostgreSQL logical dump) is handled by the railiance-backup tool in this repo:

make backup

This produces an age-encrypted dump uploaded to Nextcloud. For cluster-level backup (etcd, kubeconfig), see railiance-cluster/.


Scaling to 3 nodes (ThreePhoenix)

When Railiance02 and Railiance03 join the cluster:

  1. Switch StorageClass from local-path to longhorn in the values file
  2. Change postgresql.podAntiAffinityPreset from soft to hard
  3. Run make pg-deploy — Helm rolling update spreads pods across nodes
  4. Run make test-ha-failover to confirm HA is genuine (not just replicated on one node)