RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
302 lines
7.0 KiB
Markdown
302 lines
7.0 KiB
Markdown
ThreePhoenixWorkplan
|
|
|
|
*Self-healing, load-balanced application and service hosting*
|
|
|
|
ThreePhoenixWorkplan
|
|
|
|
This is a plan for moving to a "3-Node Phoenix" architecture with High Availability (HA) at every layer—on bare metal.
|
|
|
|
Here is the staged workplan to go from **3 Blank Ubuntu Machines** to a **Self-Healing, Load-Balanced Gitea Cluster**.
|
|
|
|
### Prerequisite Checklist
|
|
|
|
* **Hardware:** 3x Ubuntu Servers (22.04 or 24.04 LTS).
|
|
* **Network:** All 3 nodes must be able to talk to each other.
|
|
* **DNS:** A domain pointing to your cluster (e.g., `git.yourdomain.com`).
|
|
|
|
---
|
|
|
|
### Phase 1: The Foundation (K3s Cluster)
|
|
|
|
We will use **K3s** with embedded etcd. This gives you a true HA control plane without the complexity of "The Hard Way."
|
|
|
|
**1. Prepare the Nodes (Run on All 3)**
|
|
Disable swap (Kubernetes requirement) and update.
|
|
|
|
```bash
|
|
sudo swapoff -a
|
|
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
|
|
sudo apt update && sudo apt upgrade -y
|
|
|
|
```
|
|
|
|
**2. Initialize the First Node (The Seed)**
|
|
Run this on **Node 1**:
|
|
|
|
```bash
|
|
curl -sfL https://get.k3s.io | sh -s - server \
|
|
--cluster-init \
|
|
--tls-san git.yourdomain.com \
|
|
--token SECRET_CLUSTER_TOKEN
|
|
|
|
```
|
|
|
|
* `--cluster-init`: Tells K3s this is the start of an HA cluster.
|
|
* `SECRET_CLUSTER_TOKEN`: Make up a strong password. You need this for the other nodes.
|
|
|
|
**3. Join Nodes 2 & 3**
|
|
Run this on **Node 2** and **Node 3**:
|
|
|
|
```bash
|
|
curl -sfL https://get.k3s.io | sh -s - server \
|
|
--server https://<IP_OF_NODE_1>:6443 \
|
|
--token SECRET_CLUSTER_TOKEN
|
|
|
|
```
|
|
|
|
**4. Verification**
|
|
On Node 1, run `sudo k3s kubectl get nodes`. You should see 3 Masters.
|
|
*(Copy the `/etc/rancher/k3s/k3s.yaml` to your local machine as `~/.kube/config` to manage it remotely.)*
|
|
|
|
---
|
|
|
|
### Phase 2: The Storage (Longhorn)
|
|
|
|
**Crucial:** Gitea HA requires a "Shared Filesystem" (ReadWriteMany) so all 3 Gitea pods see the same Git Repos. On bare metal, **Longhorn** is the standard way to achieve this.
|
|
|
|
**1. Install Longhorn**
|
|
|
|
```bash
|
|
helm repo add longhorn https://charts.longhorn.io
|
|
helm repo update
|
|
helm install longhorn longhorn/longhorn --namespace longhorn-system --create-namespace
|
|
|
|
```
|
|
|
|
**2. Verify Storage Class**
|
|
Run `kubectl get sc`. You should see `longhorn` (default).
|
|
|
|
---
|
|
|
|
### Phase 3: The Database (Postgres HA)
|
|
|
|
We deploy the 3-node Postgres cluster with `pgpool` load balancing.
|
|
|
|
**1. Create `postgres-values.yaml**`
|
|
|
|
```yaml
|
|
architecture: replication
|
|
postgresql:
|
|
replicaCount: 3
|
|
pgpool:
|
|
replicaCount: 3
|
|
loadBalancing:
|
|
mode: on # The magic setting for performance
|
|
persistence:
|
|
storageClass: "longhorn"
|
|
size: 10Gi
|
|
metrics:
|
|
enabled: true # For monitoring later
|
|
serviceMonitor:
|
|
enabled: true
|
|
|
|
```
|
|
|
|
**2. Install**
|
|
|
|
```bash
|
|
helm repo add bitnami https://charts.bitnami.com/bitnami
|
|
helm install gitea-db bitnami/postgresql-ha -f postgres-values.yaml
|
|
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: The Application (Gitea HA)
|
|
|
|
Now for the complex part. We need Gitea to be stateless.
|
|
|
|
**1. Create `gitea-values.yaml**`
|
|
|
|
```yaml
|
|
gitea:
|
|
replicaCount: 3 # Run 3 copies
|
|
config:
|
|
database:
|
|
DB_TYPE: postgres
|
|
HOST: gitea-db-postgresql-ha-pgpool:5432 # Point to Pgpool!
|
|
NAME: gitea
|
|
USER: postgres
|
|
# CRITICAL: Shared Storage for Repos
|
|
repository:
|
|
ROOT: /data/git/repositories
|
|
# Use Memcached/Redis for sessions (required for HA)
|
|
cache:
|
|
ADAPTER: memory # Ideally switch to Redis for true HA later
|
|
session:
|
|
PROVIDER: memory # Ideally switch to Redis for true HA later
|
|
|
|
persistence:
|
|
enabled: true
|
|
accessModes:
|
|
- ReadWriteMany # This demands Longhorn
|
|
size: 20Gi
|
|
storageClass: longhorn
|
|
|
|
service:
|
|
http:
|
|
type: ClusterIP # Don't expose directly!
|
|
|
|
```
|
|
|
|
**2. Install**
|
|
|
|
```bash
|
|
helm repo add gitea-charts https://dl.gitea.io/charts/
|
|
helm install gitea gitea-charts/gitea -f gitea-values.yaml
|
|
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 5: The Security (Nginx + SSL)
|
|
|
|
We stop exposing ports directly and use an Ingress Controller.
|
|
|
|
**1. Install Nginx Ingress**
|
|
K3s comes with Traefik by default. You can disable it or use it. If you prefer **Nginx**:
|
|
|
|
```bash
|
|
helm install ingress-nginx ingress-nginx/ingress-nginx
|
|
|
|
```
|
|
|
|
**2. Install Cert-Manager (For Let's Encrypt)**
|
|
|
|
```bash
|
|
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
|
|
|
|
```
|
|
|
|
**3. Create the Ingress Resource**
|
|
Save as `gitea-ingress.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: gitea-ingress
|
|
annotations:
|
|
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
|
spec:
|
|
ingressClassName: nginx
|
|
tls:
|
|
- hosts:
|
|
- git.yourdomain.com
|
|
secretName: gitea-tls-secret
|
|
rules:
|
|
- host: git.yourdomain.com
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: gitea-http
|
|
port:
|
|
number: 3000
|
|
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 6: The "Phoenix" Automation (Bonus)
|
|
|
|
Deploy the logic we discussed to rotate nodes.
|
|
|
|
**1. Create the Service Account**
|
|
Save as `rbac.yaml`. This gives the bot permission to kill pods.
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ServiceAccount
|
|
metadata:
|
|
name: phoenix-sa
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRole
|
|
metadata:
|
|
name: phoenix-role
|
|
rules:
|
|
- apiGroups: [""]
|
|
resources: ["pods", "persistentvolumeclaims"]
|
|
verbs: ["get", "list", "delete"]
|
|
- apiGroups: ["apps"]
|
|
resources: ["statefulsets"]
|
|
verbs: ["get", "list", "patch"]
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRoleBinding
|
|
metadata:
|
|
name: phoenix-binding
|
|
subjects:
|
|
- kind: ServiceAccount
|
|
name: phoenix-sa
|
|
namespace: default
|
|
roleRef:
|
|
kind: ClusterRole
|
|
name: phoenix-role
|
|
apiGroup: rbac.authorization.k8s.io
|
|
|
|
```
|
|
|
|
`kubectl apply -f rbac.yaml`
|
|
|
|
**2. Deploy the CronJob**
|
|
Use the **"3-Node Phoenix Script"** from my previous response. Save it as `phoenix-cron.yaml` and apply it.
|
|
|
|
---
|
|
|
|
### Phase 7: Monitoring & Notification (Bonus Material)
|
|
|
|
You want to know *before* your users do if something breaks.
|
|
|
|
**1. Install the "Kube-Prometheus-Stack"**
|
|
This gives you Prometheus (Database), Grafana (Dashboards), and Alertmanager (Notifications) in one go.
|
|
|
|
```bash
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm install monitoring prometheus-community/kube-prometheus-stack
|
|
|
|
```
|
|
|
|
**2. Configure Alerts (Email/Slack)**
|
|
Edit the `alertmanager` config to send notifications.
|
|
|
|
* **Trigger:** `PostgresqlDown` (If pgpool can't see a backend).
|
|
* **Trigger:** `KubePodCrashLooping` (If Gitea is restarting).
|
|
|
|
**3. The "Dead Man's Switch"**
|
|
Since you have a Phoenix strategy that *intentionally* kills pods, you need to **Silence Alerts** during that specific maintenance window (Sunday 3 AM), or you will wake up to panic emails every week.
|
|
|
|
* You can automate "Silence" creation via the Alertmanager API in your Phoenix script:
|
|
```bash
|
|
curl -XPOST http://monitoring-alertmanager:9093/api/v2/silences -d '{...}'
|
|
|
|
```
|
|
|
|
|
|
|
|
### Summary of Result
|
|
|
|
You now have:
|
|
|
|
1. **3 Physical Nodes** mirroring data via Longhorn.
|
|
2. **3 Database Replicas** load-balanced by Pgpool.
|
|
3. **SSL Termination** handling security.
|
|
4. **Auto-Rotation** killing and rebuilding servers weekly.
|
|
5. **Monitoring** watching it all.
|
|
|
|
|
|
xxx
|