railiance-cluster/wiki/ThreePhoenixWorkplan.md at 8957d8cc8f32ebfb9d5d934a48e12ea971af3954

Files

tegwick eb8a6902b6 docs: add ThreePhoenix architecture concept and workplan

RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd,
Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and
Phoenix CronJob for weekly node rotation to prevent configuration drift.

ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes
to self-healing Gitea cluster with monitoring and alert silencing.

Also adds CLAUDE.md with Custodian State Hub session protocol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-25 01:13:05 +01:00

7.0 KiB

Raw Blame History

ThreePhoenixWorkplan

Self-healing, load-balanced application and service hosting

ThreePhoenixWorkplan

This is a plan for moving to a "3-Node Phoenix" architecture with High Availability (HA) at every layer—on bare metal.

Here is the staged workplan to go from 3 Blank Ubuntu Machines to a Self-Healing, Load-Balanced Gitea Cluster.

Prerequisite Checklist

Hardware: 3x Ubuntu Servers (22.04 or 24.04 LTS).
Network: All 3 nodes must be able to talk to each other.
DNS: A domain pointing to your cluster (e.g., git.yourdomain.com).

Phase 1: The Foundation (K3s Cluster)

We will use K3s with embedded etcd. This gives you a true HA control plane without the complexity of "The Hard Way."

1. Prepare the Nodes (Run on All 3) Disable swap (Kubernetes requirement) and update.

sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
sudo apt update && sudo apt upgrade -y

2. Initialize the First Node (The Seed) Run this on Node 1:

curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --tls-san git.yourdomain.com \
  --token SECRET_CLUSTER_TOKEN

--cluster-init: Tells K3s this is the start of an HA cluster.
SECRET_CLUSTER_TOKEN: Make up a strong password. You need this for the other nodes.

3. Join Nodes 2 & 3 Run this on Node 2 and Node 3:

curl -sfL https://get.k3s.io | sh -s - server \
  --server https://<IP_OF_NODE_1>:6443 \
  --token SECRET_CLUSTER_TOKEN

4. Verification On Node 1, run sudo k3s kubectl get nodes. You should see 3 Masters. (Copy the /etc/rancher/k3s/k3s.yaml to your local machine as ~/.kube/config to manage it remotely.)

Phase 2: The Storage (Longhorn)

Crucial: Gitea HA requires a "Shared Filesystem" (ReadWriteMany) so all 3 Gitea pods see the same Git Repos. On bare metal, Longhorn is the standard way to achieve this.

1. Install Longhorn

helm repo add longhorn https://charts.longhorn.io
helm repo update
helm install longhorn longhorn/longhorn --namespace longhorn-system --create-namespace

2. Verify Storage Class Run kubectl get sc. You should see longhorn (default).

Phase 3: The Database (Postgres HA)

We deploy the 3-node Postgres cluster with pgpool load balancing.

**1. Create postgres-values.yaml**

architecture: replication
postgresql:
  replicaCount: 3
pgpool:
  replicaCount: 3
  loadBalancing:
    mode: on # The magic setting for performance
persistence:
  storageClass: "longhorn"
  size: 10Gi
metrics:
  enabled: true # For monitoring later
  serviceMonitor:
    enabled: true

2. Install

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install gitea-db bitnami/postgresql-ha -f postgres-values.yaml

Phase 4: The Application (Gitea HA)

Now for the complex part. We need Gitea to be stateless.

**1. Create gitea-values.yaml**

gitea:
  replicaCount: 3 # Run 3 copies
  config:
    database:
      DB_TYPE: postgres
      HOST: gitea-db-postgresql-ha-pgpool:5432 # Point to Pgpool!
      NAME: gitea
      USER: postgres
    # CRITICAL: Shared Storage for Repos
    repository:
      ROOT: /data/git/repositories
    # Use Memcached/Redis for sessions (required for HA)
    cache:
      ADAPTER: memory # Ideally switch to Redis for true HA later
    session:
      PROVIDER: memory # Ideally switch to Redis for true HA later

persistence:
  enabled: true
  accessModes:
    - ReadWriteMany # This demands Longhorn
  size: 20Gi
  storageClass: longhorn

service:
  http:
    type: ClusterIP # Don't expose directly!

2. Install

helm repo add gitea-charts https://dl.gitea.io/charts/
helm install gitea gitea-charts/gitea -f gitea-values.yaml

Phase 5: The Security (Nginx + SSL)

We stop exposing ports directly and use an Ingress Controller.

1. Install Nginx Ingress K3s comes with Traefik by default. You can disable it or use it. If you prefer Nginx:

helm install ingress-nginx ingress-nginx/ingress-nginx

2. Install Cert-Manager (For Let's Encrypt)

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

3. Create the Ingress Resource Save as gitea-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gitea-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - git.yourdomain.com
    secretName: gitea-tls-secret
  rules:
  - host: git.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: gitea-http
            port:
              number: 3000

Phase 6: The "Phoenix" Automation (Bonus)

Deploy the logic we discussed to rotate nodes.

1. Create the Service Account Save as rbac.yaml. This gives the bot permission to kill pods.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: phoenix-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: phoenix-role
rules:
- apiGroups: [""]
  resources: ["pods", "persistentvolumeclaims"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["get", "list", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: phoenix-binding
subjects:
- kind: ServiceAccount
  name: phoenix-sa
  namespace: default
roleRef:
  kind: ClusterRole
  name: phoenix-role
  apiGroup: rbac.authorization.k8s.io

kubectl apply -f rbac.yaml

2. Deploy the CronJob Use the "3-Node Phoenix Script" from my previous response. Save it as phoenix-cron.yaml and apply it.

Phase 7: Monitoring & Notification (Bonus Material)

You want to know before your users do if something breaks.

1. Install the "Kube-Prometheus-Stack" This gives you Prometheus (Database), Grafana (Dashboards), and Alertmanager (Notifications) in one go.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack

2. Configure Alerts (Email/Slack) Edit the alertmanager config to send notifications.

Trigger: PostgresqlDown (If pgpool can't see a backend).
Trigger: KubePodCrashLooping (If Gitea is restarting).

3. The "Dead Man's Switch" Since you have a Phoenix strategy that intentionally kills pods, you need to Silence Alerts during that specific maintenance window (Sunday 3 AM), or you will wake up to panic emails every week.

You can automate "Silence" creation via the Alertmanager API in your Phoenix script:

curl -XPOST http://monitoring-alertmanager:9093/api/v2/silences -d '{...}'

Summary of Result

You now have:

3 Physical Nodes mirroring data via Longhorn.
3 Database Replicas load-balanced by Pgpool.
SSL Termination handling security.
Auto-Rotation killing and rebuilding servers weekly.
Monitoring watching it all.

xxx

7.0 KiB Raw Blame History