Lays out the S3 platform layer foundation for RAIL-PL-WP-0001 T01: - .sops.yaml: age encryption policy (shared key, *.sops.yaml pattern) - .gitignore: prevents accidental commit of decrypted values files - Makefile: pg-deploy, pg-status, pg-pgpool-check, valkey-deploy, valkey-status, backup targets with KUBECONFIG/HELM wiring - helm/postgresql-ha-values.yaml.template: annotated values schema with CHANGEME_ placeholders; includes pgpool-password fix from RAIL-BS-WP-0003; notes on single-node vs ThreePhoenix scaling - docs/postgresql-ha.md: connection strings, DB creation, password rotation, pgpool-password critical note, HA failover test ref, ThreePhoenix scaling path To complete T01: fill in CHANGEME_ values, encrypt with sops -e -i, then run make pg-deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.9 KiB
PostgreSQL HA — Platform Service
Chart: bitnami/postgresql-ha
Namespace: platform
Managed by: railiance-platform (S3)
Workplan: RAIL-PL-WP-0001
Architecture
Apps (S5)
└── pgpool (load balancer / connection pooler)
├── postgresql-0 [Primary — repmgr]
├── postgresql-1 [Standby — repmgr]
└── postgresql-2 [Standby — repmgr]
- pgpool-II distributes reads across standbys, routes writes to primary
- repmgr handles automatic failover if the primary disappears
- All pods in
platformnamespace; app pods connect via pgpool service
Connection string pattern
postgresql://DBUSER:DBPASS@postgresql-ha-pgpool.platform.svc.cluster.local:5432/DBNAME
Replace DBUSER, DBPASS, DBNAME with the database-specific credentials.
Initial deployment
Prerequisites
railiance-clusterconverged (make smokepasses)- SOPS age key accessible:
sops -d helm/postgresql-ha-values.sops.yamlreturns plaintext helm repo add bitnami https://charts.bitnami.com/bitnami && helm repo updatedone on the node
Steps
# 1. Ensure the platform namespace exists
kubectl create namespace platform --dry-run=client -o yaml | kubectl apply -f -
# 2. Deploy (from railiance-platform/)
make pg-deploy
# 3. Verify
make pg-status
# Expected: 3 postgresql pods + 1 pgpool pod, all Running
# 4. Smoke test
make smoke
Creating a new database for an app
# Connect via pgpool
kubectl exec -it -n platform \
$(kubectl get pod -n platform -l app.kubernetes.io/component=pgpool -o name | head -1) \
-- psql -U postgres
# Inside psql:
CREATE DATABASE myapp;
CREATE USER myapp WITH PASSWORD 'strong-password';
GRANT ALL PRIVILEGES ON DATABASE myapp TO myapp;
\c myapp
GRANT ALL ON SCHEMA public TO myapp;
\q
Add the user password to the app's own secrets (managed in the app's repo, not here). The connection string for the app will be:
postgresql://myapp:strong-password@postgresql-ha-pgpool.platform.svc.cluster.local:5432/myapp
Password rotation
- Update the password in the plaintext values template
- Re-encrypt:
sops -e -i helm/postgresql-ha-values.sops.yaml - Upgrade:
make pg-deploy - Update the app's connection secret to match
- Rolling restart the app pods to pick up the new connection
pgpool-password — critical note
The postgresql.pgpoolPassword value in the Helm chart maps to the
pgpool-password key in the postgresql-ha-postgresql Kubernetes Secret.
The pgpool container mounts this key at startup; if it is absent, pgpool
enters CrashLoopBackOff with no log output.
This was the root cause of the 2026-03-10 incident (RAIL-BS-WP-0003).
Always verify after helm upgrade:
kubectl get secret -n platform postgresql-ha-postgresql \
-o jsonpath='{.data.pgpool-password}' | base64 -d && echo
# Must print a non-empty string
HA failover test
Per Decision D3, any change to this service requires a passing failover test:
# From railiance-cluster/
make test-ha-failover GITEA_URL=https://<gitea-hostname>
The test kills the primary PostgreSQL pod and asserts:
- repmgr promotes a standby within 60s
- All pods return to Running within 120s
- pgpool returns to Running (catches the missing-key bug)
Backup
Platform backup (PostgreSQL logical dump) is handled by the railiance-backup
tool in this repo:
make backup
This produces an age-encrypted dump uploaded to Nextcloud. For cluster-level
backup (etcd, kubeconfig), see railiance-cluster/.
Scaling to 3 nodes (ThreePhoenix)
When Railiance02 and Railiance03 join the cluster:
- Switch StorageClass from
local-pathtolonghornin the values file - Change
postgresql.podAntiAffinityPresetfromsofttohard - Run
make pg-deploy— Helm rolling update spreads pods across nodes - Run
make test-ha-failoverto confirm HA is genuine (not just replicated on one node)