railiance-cluster/wiki/RailianceThreePhoenix.md

RailianceThreePhoenix

*Three machine failover loadbalancing *

Architecture documentation for **RailianceThreePhoenix** service operations automation.

This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services.

Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.

# ThreePhoenix System Architecture

**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster

### 1. Executive Summary

The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.

### 2. Physical & Infrastructure Layer

* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual).
* **Orchestration:** **K3s** (Lightweight Kubernetes).
* **Topology:** Multi-Master HA (Embedded etcd datastore).
* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy).


* **Storage (CSI):** **Longhorn** (Distributed Block Storage).
* **Replication:** Volume data is synchronously replicated across all 3 nodes.
* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories).


### 3. Application Stack (The Standard Unit)

Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:

| Layer | Component | Configuration Strategy |
| --- | --- | --- |
| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. |
| **Traffic** | ClusterIP | Internal-only communication. |
| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. |
| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. |
| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |

### 4. The "Phoenix" Automation Engine

A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads.

* **Schedule:** Weekly (Sunday 03:00 UTC).
* **Cycle:** 3-Week Rotation.
* **Week 1:** Destroy & Re-clone Standby Node B.
* **Week 2:** Destroy & Re-clone Standby Node C.
* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A.


* **Objective:** No database pod lives longer than 21 days.

---

### Appendix A: Acceptance Criteria (The Audit Checklist)

Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.

#### I. Infrastructure Integrity

* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`.
* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working).
* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed.

#### II. Database & Persistence

* [ ] **Cluster State:** `kubectl exec <primary-pod> -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**.
* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status).
* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active).
* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days.

#### III. Security & Network

* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`.
* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`.
* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).

#### IV. Phoenix Mechanics

* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`).
* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken).

#### V. Disaster Recovery Drill (Quarterly)

* [ ] **The "Kill" Test:** Manually delete a Gitea Pod.
* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.


* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace.
* *Pass Criteria:* You can log in and see the latest repositories.

xxx