From eb8a6902b6f3693233bbd13e9cc1c5f53ece8ca9 Mon Sep 17 00:00:00 2001 From: tegwick Date: Wed, 25 Feb 2026 01:13:05 +0100 Subject: [PATCH] docs: add ThreePhoenix architecture concept and workplan RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 78 +++++++++ wiki/RailianceThreePhoenix.md | 98 +++++++++++ wiki/ThreePhoenixWorkplan.md | 301 ++++++++++++++++++++++++++++++++++ 3 files changed, 477 insertions(+) create mode 100644 CLAUDE.md create mode 100644 wiki/RailianceThreePhoenix.md create mode 100644 wiki/ThreePhoenixWorkplan.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..29af0db --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,78 @@ +# railiance-bootstrap — Claude Code Instructions + +## Custodian State Hub Integration + +This project is tracked as the **railiance** domain in the Custodian State Hub. +Hub topic ID: `ca369340-a64e-442e-98f1-a4fa7dc74a38` + +The State Hub runs locally at http://127.0.0.1:8000. The MCP server (`state-hub`) +exposes tools for reading and writing state without touching the API directly. + +### Session Protocol + +**On receiving your first message — before writing any response text — call +`get_state_summary()` immediately.** Do not greet, do not ask what to do. +Call the tool first, then respond based on what you find. + +**At the start of every session:** +1. Call `get_state_summary()` — orients you to active workstreams, blocking decisions, + and recent progress. If it fails, the API is likely offline: + ``` + cd ~/the-custodian/state-hub && make api + ``` +2. Check whether the `railiance` topic has any open workstreams in the summary. + - **If workstreams exist:** review blocking decisions before starting work. + - **If no workstreams exist:** follow the First Session Protocol below. + +**During work:** +- Use `create_task()` / `update_task_status()` to track concrete deliverables. +- Use `record_decision()` for any decision that affects direction or dependencies. +- Use `add_progress_event()` for notable events (milestones, blockers, insights). + +**At the end of every session:** +- Call `add_progress_event()` with a summary of what was accomplished or decided. + Include `topic_id: ca369340-a64e-442e-98f1-a4fa7dc74a38` and the relevant `workstream_id`. + +### First Session Protocol + +Triggered when `get_state_summary()` shows **no workstreams** for the `railiance` topic. +This means the project is registered but work has not yet been structured. + +**Step 1 — Understand the project (read, don't write)** +- `canon/projects/railiance/project_charter_v0.1.md` in `~/the-custodian/` — purpose, scope, success criteria +- `canon/projects/railiance/roadmap_v0.1.md` — planned phases +- Scan this repo root: README, directory structure, any existing code or docs + +**Step 2 — Survey in-progress work** +- Look for TODOs, open branches, half-finished files, or notes +- Note what is already done vs. what is clearly started but incomplete + +**Step 3 — Propose workstreams to Bernd** +Based on what you found, propose 1–3 workstreams. Each workstream should be: +- A coherent strand of work lasting weeks to months (not a single task) +- Named clearly enough that its scope is obvious +- Anchored to a phase in the roadmap if possible + +Present the proposals and **wait for approval before creating anything**. + +**Step 4 — Create and populate (after approval)** +``` +create_workstream(topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", title="...", owner="...", description="...") +create_task(workstream_id="", title="...", priority="high|medium|low") +# repeat for each task in the workstream +``` +Aim for 3–7 tasks per workstream at this stage. Tasks should be concrete and actionable. + +**Step 5 — Record the setup** +``` +add_progress_event( + summary="First session: structured railiance work into N workstreams, M tasks", + event_type="milestone", + topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", + detail={"workstreams": [...], "tasks_created": M} +) +``` + +### Quick Reference + +See `~/the-custodian/state-hub/mcp_server/TOOLS.md` for a compact tool reference. diff --git a/wiki/RailianceThreePhoenix.md b/wiki/RailianceThreePhoenix.md new file mode 100644 index 0000000..1430ac5 --- /dev/null +++ b/wiki/RailianceThreePhoenix.md @@ -0,0 +1,98 @@ +RailianceThreePhoenix + +*Three machine failover loadbalancing * + +Architecture documentation for **RailianceThreePhoenix** service operations automation. + +This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services. + +Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern. + +# ThreePhoenix System Architecture + +**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster + +### 1. Executive Summary + +The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption. + +### 2. Physical & Infrastructure Layer + +* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual). +* **Orchestration:** **K3s** (Lightweight Kubernetes). +* **Topology:** Multi-Master HA (Embedded etcd datastore). +* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy). + + +* **Storage (CSI):** **Longhorn** (Distributed Block Storage). +* **Replication:** Volume data is synchronously replicated across all 3 nodes. +* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories). + + + +### 3. Application Stack (The Standard Unit) + +Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology: + +| Layer | Component | Configuration Strategy | +| --- | --- | --- | +| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. | +| **Traffic** | ClusterIP | Internal-only communication. | +| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. | +| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. | +| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. | +| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). | + +### 4. The "Phoenix" Automation Engine + +A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads. + +* **Schedule:** Weekly (Sunday 03:00 UTC). +* **Cycle:** 3-Week Rotation. +* **Week 1:** Destroy & Re-clone Standby Node B. +* **Week 2:** Destroy & Re-clone Standby Node C. +* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A. + + +* **Objective:** No database pod lives longer than 21 days. + +--- + +### Appendix A: Acceptance Criteria (The Audit Checklist) + +Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating. + +#### I. Infrastructure Integrity + +* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`. +* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working). +* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed. + +#### II. Database & Persistence + +* [ ] **Cluster State:** `kubectl exec -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**. +* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status). +* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active). +* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days. + +#### III. Security & Network + +* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`. +* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`. +* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect). + +#### IV. Phoenix Mechanics + +* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`). +* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken). + +#### V. Disaster Recovery Drill (Quarterly) + +* [ ] **The "Kill" Test:** Manually delete a Gitea Pod. +* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes. + + +* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace. +* *Pass Criteria:* You can log in and see the latest repositories. + +xxx diff --git a/wiki/ThreePhoenixWorkplan.md b/wiki/ThreePhoenixWorkplan.md new file mode 100644 index 0000000..db2b87c --- /dev/null +++ b/wiki/ThreePhoenixWorkplan.md @@ -0,0 +1,301 @@ +ThreePhoenixWorkplan + +*Self-healing, load-balanced application and service hosting* + +ThreePhoenixWorkplan + +This is a plan for moving to a "3-Node Phoenix" architecture with High Availability (HA) at every layer—on bare metal. + +Here is the staged workplan to go from **3 Blank Ubuntu Machines** to a **Self-Healing, Load-Balanced Gitea Cluster**. + +### Prerequisite Checklist + +* **Hardware:** 3x Ubuntu Servers (22.04 or 24.04 LTS). +* **Network:** All 3 nodes must be able to talk to each other. +* **DNS:** A domain pointing to your cluster (e.g., `git.yourdomain.com`). + +--- + +### Phase 1: The Foundation (K3s Cluster) + +We will use **K3s** with embedded etcd. This gives you a true HA control plane without the complexity of "The Hard Way." + +**1. Prepare the Nodes (Run on All 3)** +Disable swap (Kubernetes requirement) and update. + +```bash +sudo swapoff -a +sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab +sudo apt update && sudo apt upgrade -y + +``` + +**2. Initialize the First Node (The Seed)** +Run this on **Node 1**: + +```bash +curl -sfL https://get.k3s.io | sh -s - server \ + --cluster-init \ + --tls-san git.yourdomain.com \ + --token SECRET_CLUSTER_TOKEN + +``` + +* `--cluster-init`: Tells K3s this is the start of an HA cluster. +* `SECRET_CLUSTER_TOKEN`: Make up a strong password. You need this for the other nodes. + +**3. Join Nodes 2 & 3** +Run this on **Node 2** and **Node 3**: + +```bash +curl -sfL https://get.k3s.io | sh -s - server \ + --server https://:6443 \ + --token SECRET_CLUSTER_TOKEN + +``` + +**4. Verification** +On Node 1, run `sudo k3s kubectl get nodes`. You should see 3 Masters. +*(Copy the `/etc/rancher/k3s/k3s.yaml` to your local machine as `~/.kube/config` to manage it remotely.)* + +--- + +### Phase 2: The Storage (Longhorn) + +**Crucial:** Gitea HA requires a "Shared Filesystem" (ReadWriteMany) so all 3 Gitea pods see the same Git Repos. On bare metal, **Longhorn** is the standard way to achieve this. + +**1. Install Longhorn** + +```bash +helm repo add longhorn https://charts.longhorn.io +helm repo update +helm install longhorn longhorn/longhorn --namespace longhorn-system --create-namespace + +``` + +**2. Verify Storage Class** +Run `kubectl get sc`. You should see `longhorn` (default). + +--- + +### Phase 3: The Database (Postgres HA) + +We deploy the 3-node Postgres cluster with `pgpool` load balancing. + +**1. Create `postgres-values.yaml**` + +```yaml +architecture: replication +postgresql: + replicaCount: 3 +pgpool: + replicaCount: 3 + loadBalancing: + mode: on # The magic setting for performance +persistence: + storageClass: "longhorn" + size: 10Gi +metrics: + enabled: true # For monitoring later + serviceMonitor: + enabled: true + +``` + +**2. Install** + +```bash +helm repo add bitnami https://charts.bitnami.com/bitnami +helm install gitea-db bitnami/postgresql-ha -f postgres-values.yaml + +``` + +--- + +### Phase 4: The Application (Gitea HA) + +Now for the complex part. We need Gitea to be stateless. + +**1. Create `gitea-values.yaml**` + +```yaml +gitea: + replicaCount: 3 # Run 3 copies + config: + database: + DB_TYPE: postgres + HOST: gitea-db-postgresql-ha-pgpool:5432 # Point to Pgpool! + NAME: gitea + USER: postgres + # CRITICAL: Shared Storage for Repos + repository: + ROOT: /data/git/repositories + # Use Memcached/Redis for sessions (required for HA) + cache: + ADAPTER: memory # Ideally switch to Redis for true HA later + session: + PROVIDER: memory # Ideally switch to Redis for true HA later + +persistence: + enabled: true + accessModes: + - ReadWriteMany # This demands Longhorn + size: 20Gi + storageClass: longhorn + +service: + http: + type: ClusterIP # Don't expose directly! + +``` + +**2. Install** + +```bash +helm repo add gitea-charts https://dl.gitea.io/charts/ +helm install gitea gitea-charts/gitea -f gitea-values.yaml + +``` + +--- + +### Phase 5: The Security (Nginx + SSL) + +We stop exposing ports directly and use an Ingress Controller. + +**1. Install Nginx Ingress** +K3s comes with Traefik by default. You can disable it or use it. If you prefer **Nginx**: + +```bash +helm install ingress-nginx ingress-nginx/ingress-nginx + +``` + +**2. Install Cert-Manager (For Let's Encrypt)** + +```bash +kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml + +``` + +**3. Create the Ingress Resource** +Save as `gitea-ingress.yaml`: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: gitea-ingress + annotations: + cert-manager.io/cluster-issuer: "letsencrypt-prod" +spec: + ingressClassName: nginx + tls: + - hosts: + - git.yourdomain.com + secretName: gitea-tls-secret + rules: + - host: git.yourdomain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: gitea-http + port: + number: 3000 + +``` + +--- + +### Phase 6: The "Phoenix" Automation (Bonus) + +Deploy the logic we discussed to rotate nodes. + +**1. Create the Service Account** +Save as `rbac.yaml`. This gives the bot permission to kill pods. + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: phoenix-sa +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: phoenix-role +rules: +- apiGroups: [""] + resources: ["pods", "persistentvolumeclaims"] + verbs: ["get", "list", "delete"] +- apiGroups: ["apps"] + resources: ["statefulsets"] + verbs: ["get", "list", "patch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: phoenix-binding +subjects: +- kind: ServiceAccount + name: phoenix-sa + namespace: default +roleRef: + kind: ClusterRole + name: phoenix-role + apiGroup: rbac.authorization.k8s.io + +``` + +`kubectl apply -f rbac.yaml` + +**2. Deploy the CronJob** +Use the **"3-Node Phoenix Script"** from my previous response. Save it as `phoenix-cron.yaml` and apply it. + +--- + +### Phase 7: Monitoring & Notification (Bonus Material) + +You want to know *before* your users do if something breaks. + +**1. Install the "Kube-Prometheus-Stack"** +This gives you Prometheus (Database), Grafana (Dashboards), and Alertmanager (Notifications) in one go. + +```bash +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm install monitoring prometheus-community/kube-prometheus-stack + +``` + +**2. Configure Alerts (Email/Slack)** +Edit the `alertmanager` config to send notifications. + +* **Trigger:** `PostgresqlDown` (If pgpool can't see a backend). +* **Trigger:** `KubePodCrashLooping` (If Gitea is restarting). + +**3. The "Dead Man's Switch"** +Since you have a Phoenix strategy that *intentionally* kills pods, you need to **Silence Alerts** during that specific maintenance window (Sunday 3 AM), or you will wake up to panic emails every week. + +* You can automate "Silence" creation via the Alertmanager API in your Phoenix script: +```bash +curl -XPOST http://monitoring-alertmanager:9093/api/v2/silences -d '{...}' + +``` + + + +### Summary of Result + +You now have: + +1. **3 Physical Nodes** mirroring data via Longhorn. +2. **3 Database Replicas** load-balanced by Pgpool. +3. **SSL Termination** handling security. +4. **Auto-Rotation** killing and rebuilding servers weekly. +5. **Monitoring** watching it all. + + +xxx