railiance-platform/docs/postgresql-ha.md

# PostgreSQL HA — Platform Service

**Chart:** `bitnami/postgresql-ha`
**Namespace:** `platform`
**Managed by:** `railiance-platform` (S3)
**Workplan:** `RAIL-PL-WP-0001`

---

## Architecture

```
Apps (S5)
  └── pgpool (load balancer / connection pooler)
        ├── postgresql-0  [Primary  — repmgr]
        ├── postgresql-1  [Standby  — repmgr]
        └── postgresql-2  [Standby  — repmgr]
```

- **pgpool-II** distributes reads across standbys, routes writes to primary
- **repmgr** handles automatic failover if the primary disappears
- All pods in `platform` namespace; app pods connect via pgpool service

## Connection string pattern

```
postgresql://DBUSER:DBPASS@postgresql-ha-pgpool.platform.svc.cluster.local:5432/DBNAME
```

Replace `DBUSER`, `DBPASS`, `DBNAME` with the database-specific credentials.

---

## Initial deployment

### Prerequisites

- `railiance-cluster` converged (`make smoke` passes)
- SOPS age key accessible: `sops -d helm/postgresql-ha-values.sops.yaml` returns plaintext
- `helm repo add bitnami https://charts.bitnami.com/bitnami && helm repo update` done on the node

### Steps

```bash
# 1. Ensure the platform namespace exists
kubectl create namespace platform --dry-run=client -o yaml | kubectl apply -f -

# 2. Deploy (from railiance-platform/)
make pg-deploy

# 3. Verify
make pg-status
# Expected: 3 postgresql pods + 1 pgpool pod, all Running

# 4. Smoke test
make smoke
```

---

## Creating a new database for an app

```bash
# Connect via pgpool
kubectl exec -it -n platform \
  $(kubectl get pod -n platform -l app.kubernetes.io/component=pgpool -o name | head -1) \
  -- psql -U postgres

# Inside psql:
CREATE DATABASE myapp;
CREATE USER myapp WITH PASSWORD 'strong-password';
GRANT ALL PRIVILEGES ON DATABASE myapp TO myapp;
\c myapp
GRANT ALL ON SCHEMA public TO myapp;
\q
```

Add the user password to the app's own secrets (managed in the app's repo,
not here). The connection string for the app will be:
```
postgresql://myapp:strong-password@postgresql-ha-pgpool.platform.svc.cluster.local:5432/myapp
```

---

## Password rotation

1. Update the password in the plaintext values template
2. Re-encrypt: `sops -e -i helm/postgresql-ha-values.sops.yaml`
3. Upgrade: `make pg-deploy`
4. Update the app's connection secret to match
5. Rolling restart the app pods to pick up the new connection

---

## pgpool-password — critical note

The `postgresql.pgpoolPassword` value in the Helm chart maps to the
`pgpool-password` key in the `postgresql-ha-postgresql` Kubernetes Secret.
The pgpool container mounts this key at startup; if it is absent, pgpool
enters CrashLoopBackOff with **no log output**.

**This was the root cause of the 2026-03-10 incident (RAIL-BS-WP-0003).**

Always verify after `helm upgrade`:
```bash
kubectl get secret -n platform postgresql-ha-postgresql \
  -o jsonpath='{.data.pgpool-password}' | base64 -d && echo
# Must print a non-empty string
```

---

## HA failover test

Per Decision D3, any change to this service requires a passing failover test:

```bash
# From railiance-cluster/
make test-ha-failover GITEA_URL=https://<gitea-hostname>
```

The test kills the primary PostgreSQL pod and asserts:
1. repmgr promotes a standby within 60s
2. All pods return to Running within 120s
3. pgpool returns to Running (catches the missing-key bug)

---

## Backup

Platform backup (PostgreSQL logical dump) is handled by the `railiance-backup`
tool in this repo:

```bash
make backup
```

This produces an age-encrypted dump uploaded to Nextcloud. For cluster-level
backup (etcd, kubeconfig), see `railiance-cluster/`.

---

## Scaling to 3 nodes (ThreePhoenix)

When Railiance02 and Railiance03 join the cluster:

1. Switch StorageClass from `local-path` to `longhorn` in the values file
2. Change `postgresql.podAntiAffinityPreset` from `soft` to `hard`
3. Run `make pg-deploy` — Helm rolling update spreads pods across nodes
4. Run `make test-ha-failover` to confirm HA is genuine (not just replicated on one node)