docs: add ThreePhoenix architecture concept and workplan

RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd,
Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and
Phoenix CronJob for weekly node rotation to prevent configuration drift.

ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes
to self-healing Gitea cluster with monitoring and alert silencing.

Also adds CLAUDE.md with Custodian State Hub session protocol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-25 01:13:05 +01:00
parent b7696e657f
commit eb8a6902b6
3 changed files with 477 additions and 0 deletions

78
CLAUDE.md Normal file
View File

@@ -0,0 +1,78 @@
# railiance-bootstrap — Claude Code Instructions
## Custodian State Hub Integration
This project is tracked as the **railiance** domain in the Custodian State Hub.
Hub topic ID: `ca369340-a64e-442e-98f1-a4fa7dc74a38`
The State Hub runs locally at http://127.0.0.1:8000. The MCP server (`state-hub`)
exposes tools for reading and writing state without touching the API directly.
### Session Protocol
**On receiving your first message — before writing any response text — call
`get_state_summary()` immediately.** Do not greet, do not ask what to do.
Call the tool first, then respond based on what you find.
**At the start of every session:**
1. Call `get_state_summary()` — orients you to active workstreams, blocking decisions,
and recent progress. If it fails, the API is likely offline:
```
cd ~/the-custodian/state-hub && make api
```
2. Check whether the `railiance` topic has any open workstreams in the summary.
- **If workstreams exist:** review blocking decisions before starting work.
- **If no workstreams exist:** follow the First Session Protocol below.
**During work:**
- Use `create_task()` / `update_task_status()` to track concrete deliverables.
- Use `record_decision()` for any decision that affects direction or dependencies.
- Use `add_progress_event()` for notable events (milestones, blockers, insights).
**At the end of every session:**
- Call `add_progress_event()` with a summary of what was accomplished or decided.
Include `topic_id: ca369340-a64e-442e-98f1-a4fa7dc74a38` and the relevant `workstream_id`.
### First Session Protocol
Triggered when `get_state_summary()` shows **no workstreams** for the `railiance` topic.
This means the project is registered but work has not yet been structured.
**Step 1 — Understand the project (read, don't write)**
- `canon/projects/railiance/project_charter_v0.1.md` in `~/the-custodian/` — purpose, scope, success criteria
- `canon/projects/railiance/roadmap_v0.1.md` — planned phases
- Scan this repo root: README, directory structure, any existing code or docs
**Step 2 — Survey in-progress work**
- Look for TODOs, open branches, half-finished files, or notes
- Note what is already done vs. what is clearly started but incomplete
**Step 3 — Propose workstreams to Bernd**
Based on what you found, propose 13 workstreams. Each workstream should be:
- A coherent strand of work lasting weeks to months (not a single task)
- Named clearly enough that its scope is obvious
- Anchored to a phase in the roadmap if possible
Present the proposals and **wait for approval before creating anything**.
**Step 4 — Create and populate (after approval)**
```
create_workstream(topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", title="...", owner="...", description="...")
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
# repeat for each task in the workstream
```
Aim for 37 tasks per workstream at this stage. Tasks should be concrete and actionable.
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured railiance work into N workstreams, M tasks",
event_type="milestone",
topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38",
detail={"workstreams": [...], "tasks_created": M}
)
```
### Quick Reference
See `~/the-custodian/state-hub/mcp_server/TOOLS.md` for a compact tool reference.

View File

@@ -0,0 +1,98 @@
RailianceThreePhoenix
*Three machine failover loadbalancing *
Architecture documentation for **RailianceThreePhoenix** service operations automation.
This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services.
Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.
# ThreePhoenix System Architecture
**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster
### 1. Executive Summary
The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.
### 2. Physical & Infrastructure Layer
* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual).
* **Orchestration:** **K3s** (Lightweight Kubernetes).
* **Topology:** Multi-Master HA (Embedded etcd datastore).
* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy).
* **Storage (CSI):** **Longhorn** (Distributed Block Storage).
* **Replication:** Volume data is synchronously replicated across all 3 nodes.
* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories).
### 3. Application Stack (The Standard Unit)
Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:
| Layer | Component | Configuration Strategy |
| --- | --- | --- |
| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. |
| **Traffic** | ClusterIP | Internal-only communication. |
| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. |
| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. |
| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |
### 4. The "Phoenix" Automation Engine
A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads.
* **Schedule:** Weekly (Sunday 03:00 UTC).
* **Cycle:** 3-Week Rotation.
* **Week 1:** Destroy & Re-clone Standby Node B.
* **Week 2:** Destroy & Re-clone Standby Node C.
* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A.
* **Objective:** No database pod lives longer than 21 days.
---
### Appendix A: Acceptance Criteria (The Audit Checklist)
Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.
#### I. Infrastructure Integrity
* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`.
* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working).
* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed.
#### II. Database & Persistence
* [ ] **Cluster State:** `kubectl exec <primary-pod> -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**.
* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status).
* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active).
* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days.
#### III. Security & Network
* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`.
* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`.
* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).
#### IV. Phoenix Mechanics
* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`).
* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken).
#### V. Disaster Recovery Drill (Quarterly)
* [ ] **The "Kill" Test:** Manually delete a Gitea Pod.
* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.
* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace.
* *Pass Criteria:* You can log in and see the latest repositories.
xxx

View File

@@ -0,0 +1,301 @@
ThreePhoenixWorkplan
*Self-healing, load-balanced application and service hosting*
ThreePhoenixWorkplan
This is a plan for moving to a "3-Node Phoenix" architecture with High Availability (HA) at every layer—on bare metal.
Here is the staged workplan to go from **3 Blank Ubuntu Machines** to a **Self-Healing, Load-Balanced Gitea Cluster**.
### Prerequisite Checklist
* **Hardware:** 3x Ubuntu Servers (22.04 or 24.04 LTS).
* **Network:** All 3 nodes must be able to talk to each other.
* **DNS:** A domain pointing to your cluster (e.g., `git.yourdomain.com`).
---
### Phase 1: The Foundation (K3s Cluster)
We will use **K3s** with embedded etcd. This gives you a true HA control plane without the complexity of "The Hard Way."
**1. Prepare the Nodes (Run on All 3)**
Disable swap (Kubernetes requirement) and update.
```bash
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
sudo apt update && sudo apt upgrade -y
```
**2. Initialize the First Node (The Seed)**
Run this on **Node 1**:
```bash
curl -sfL https://get.k3s.io | sh -s - server \
--cluster-init \
--tls-san git.yourdomain.com \
--token SECRET_CLUSTER_TOKEN
```
* `--cluster-init`: Tells K3s this is the start of an HA cluster.
* `SECRET_CLUSTER_TOKEN`: Make up a strong password. You need this for the other nodes.
**3. Join Nodes 2 & 3**
Run this on **Node 2** and **Node 3**:
```bash
curl -sfL https://get.k3s.io | sh -s - server \
--server https://<IP_OF_NODE_1>:6443 \
--token SECRET_CLUSTER_TOKEN
```
**4. Verification**
On Node 1, run `sudo k3s kubectl get nodes`. You should see 3 Masters.
*(Copy the `/etc/rancher/k3s/k3s.yaml` to your local machine as `~/.kube/config` to manage it remotely.)*
---
### Phase 2: The Storage (Longhorn)
**Crucial:** Gitea HA requires a "Shared Filesystem" (ReadWriteMany) so all 3 Gitea pods see the same Git Repos. On bare metal, **Longhorn** is the standard way to achieve this.
**1. Install Longhorn**
```bash
helm repo add longhorn https://charts.longhorn.io
helm repo update
helm install longhorn longhorn/longhorn --namespace longhorn-system --create-namespace
```
**2. Verify Storage Class**
Run `kubectl get sc`. You should see `longhorn` (default).
---
### Phase 3: The Database (Postgres HA)
We deploy the 3-node Postgres cluster with `pgpool` load balancing.
**1. Create `postgres-values.yaml**`
```yaml
architecture: replication
postgresql:
replicaCount: 3
pgpool:
replicaCount: 3
loadBalancing:
mode: on # The magic setting for performance
persistence:
storageClass: "longhorn"
size: 10Gi
metrics:
enabled: true # For monitoring later
serviceMonitor:
enabled: true
```
**2. Install**
```bash
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install gitea-db bitnami/postgresql-ha -f postgres-values.yaml
```
---
### Phase 4: The Application (Gitea HA)
Now for the complex part. We need Gitea to be stateless.
**1. Create `gitea-values.yaml**`
```yaml
gitea:
replicaCount: 3 # Run 3 copies
config:
database:
DB_TYPE: postgres
HOST: gitea-db-postgresql-ha-pgpool:5432 # Point to Pgpool!
NAME: gitea
USER: postgres
# CRITICAL: Shared Storage for Repos
repository:
ROOT: /data/git/repositories
# Use Memcached/Redis for sessions (required for HA)
cache:
ADAPTER: memory # Ideally switch to Redis for true HA later
session:
PROVIDER: memory # Ideally switch to Redis for true HA later
persistence:
enabled: true
accessModes:
- ReadWriteMany # This demands Longhorn
size: 20Gi
storageClass: longhorn
service:
http:
type: ClusterIP # Don't expose directly!
```
**2. Install**
```bash
helm repo add gitea-charts https://dl.gitea.io/charts/
helm install gitea gitea-charts/gitea -f gitea-values.yaml
```
---
### Phase 5: The Security (Nginx + SSL)
We stop exposing ports directly and use an Ingress Controller.
**1. Install Nginx Ingress**
K3s comes with Traefik by default. You can disable it or use it. If you prefer **Nginx**:
```bash
helm install ingress-nginx ingress-nginx/ingress-nginx
```
**2. Install Cert-Manager (For Let's Encrypt)**
```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
```
**3. Create the Ingress Resource**
Save as `gitea-ingress.yaml`:
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: gitea-ingress
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- git.yourdomain.com
secretName: gitea-tls-secret
rules:
- host: git.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: gitea-http
port:
number: 3000
```
---
### Phase 6: The "Phoenix" Automation (Bonus)
Deploy the logic we discussed to rotate nodes.
**1. Create the Service Account**
Save as `rbac.yaml`. This gives the bot permission to kill pods.
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: phoenix-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: phoenix-role
rules:
- apiGroups: [""]
resources: ["pods", "persistentvolumeclaims"]
verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "list", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: phoenix-binding
subjects:
- kind: ServiceAccount
name: phoenix-sa
namespace: default
roleRef:
kind: ClusterRole
name: phoenix-role
apiGroup: rbac.authorization.k8s.io
```
`kubectl apply -f rbac.yaml`
**2. Deploy the CronJob**
Use the **"3-Node Phoenix Script"** from my previous response. Save it as `phoenix-cron.yaml` and apply it.
---
### Phase 7: Monitoring & Notification (Bonus Material)
You want to know *before* your users do if something breaks.
**1. Install the "Kube-Prometheus-Stack"**
This gives you Prometheus (Database), Grafana (Dashboards), and Alertmanager (Notifications) in one go.
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack
```
**2. Configure Alerts (Email/Slack)**
Edit the `alertmanager` config to send notifications.
* **Trigger:** `PostgresqlDown` (If pgpool can't see a backend).
* **Trigger:** `KubePodCrashLooping` (If Gitea is restarting).
**3. The "Dead Man's Switch"**
Since you have a Phoenix strategy that *intentionally* kills pods, you need to **Silence Alerts** during that specific maintenance window (Sunday 3 AM), or you will wake up to panic emails every week.
* You can automate "Silence" creation via the Alertmanager API in your Phoenix script:
```bash
curl -XPOST http://monitoring-alertmanager:9093/api/v2/silences -d '{...}'
```
### Summary of Result
You now have:
1. **3 Physical Nodes** mirroring data via Longhorn.
2. **3 Database Replicas** load-balanced by Pgpool.
3. **SSL Termination** handling security.
4. **Auto-Rotation** killing and rebuilding servers weekly.
5. **Monitoring** watching it all.
xxx