RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.9 KiB
RailianceThreePhoenix
*Three machine failover loadbalancing *
Architecture documentation for RailianceThreePhoenix service operations automation.
This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services.
Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.
ThreePhoenix System Architecture
Version: 1.0 | Status: Draft | Type: High-Availability Kubernetes Cluster
1. Executive Summary
The ThreePhoenix architecture is a self-healing, 3-node Kubernetes cluster designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.
2. Physical & Infrastructure Layer
-
Hardware: 3x Ubuntu Server nodes (Physical or Virtual).
-
Orchestration: K3s (Lightweight Kubernetes).
-
Topology: Multi-Master HA (Embedded etcd datastore).
-
Failure Tolerance: Cluster survives the loss of any single node (N-1 redundancy).
-
Storage (CSI): Longhorn (Distributed Block Storage).
-
Replication: Volume data is synchronously replicated across all 3 nodes.
-
Access Mode:
ReadWriteMany(RWX) enabled for shared application data (e.g., Gitea repositories).
3. Application Stack (The Standard Unit)
Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:
| Layer | Component | Configuration Strategy |
|---|---|---|
| Ingress | Nginx Ingress | SSL Termination via Cert-Manager (Let's Encrypt). No ports exposed directly. |
| Traffic | ClusterIP | Internal-only communication. |
| Routing | Pgpool-II | Load Balancing: Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
| Compute | Stateless App | ReplicaCount: 3. Pod anti-affinity ensures one pod per physical node. |
| Database | PostgreSQL HA | Repmgr Cluster: 1 Primary, 2 Standbys. Asynchronous replication. |
| Data | Persistent Volume | Longhorn StorageClass. ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |
4. The "Phoenix" Automation Engine
A centralized CronJob (phoenix-maintenance) manages the lifecycle of stateful workloads.
-
Schedule: Weekly (Sunday 03:00 UTC).
-
Cycle: 3-Week Rotation.
-
Week 1: Destroy & Re-clone Standby Node B.
-
Week 2: Destroy & Re-clone Standby Node C.
-
Week 3: Switchover Event. Promote Standby B to Primary -> Destroy old Primary Node A.
-
Objective: No database pod lives longer than 21 days.
Appendix A: Acceptance Criteria (The Audit Checklist)
Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.
I. Infrastructure Integrity
- Node Health: All 3 nodes report
Readystatus inkubectl get nodes. - Distribution:
kubectl get pods -o wideconfirms Gitea pods are running on 3 different physical nodes (Anti-Affinity is working). - Storage Sync: Longhorn UI shows all volumes have "Healthy" status with 3 replicas. No "Degraded" volumes allowed.
II. Database & Persistence
- Cluster State:
kubectl exec <primary-pod> -- repmgr cluster showlists exactly 1 Primary and 2 Standbys. - Replication Lag: Lag is
< 1 secondfor all standbys (visible in Grafana or Pgpool status). - Load Balancing: Pgpool logs confirm
SELECTqueries are being routed to Standby nodes (verifies Read-Scaling is active). - Backup Validation: A backup file exists in the external S3 bucket/location with a timestamp
< 24 hoursold. Crucial: File size is consistent with previous days.
III. Security & Network
- SSL Validity:
git.yourdomain.comcertificate expires in> 30 days. - Port Scan: Running
nmapagainst the public IP reveals ONLY ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must beClosed/Filtered. - Ingress Check: Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).
IV. Phoenix Mechanics
- Job History:
kubectl get jobsshows the lastphoenix-maintenancejob has statusCompleted(notFailed). - Pod Age: No
postgresqlpod has an "Age" greater than 22 days. (If one is 170 days old, the automation is broken).
V. Disaster Recovery Drill (Quarterly)
-
The "Kill" Test: Manually delete a Gitea Pod.
-
Pass Criteria: Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.
-
The "Restore" Test: Restore the database backup to a test namespace.
-
Pass Criteria: You can log in and see the latest repositories.
xxx