docs: add ThreePhoenix architecture concept and workplan
RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
78
CLAUDE.md
Normal file
78
CLAUDE.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# railiance-bootstrap — Claude Code Instructions
|
||||
|
||||
## Custodian State Hub Integration
|
||||
|
||||
This project is tracked as the **railiance** domain in the Custodian State Hub.
|
||||
Hub topic ID: `ca369340-a64e-442e-98f1-a4fa7dc74a38`
|
||||
|
||||
The State Hub runs locally at http://127.0.0.1:8000. The MCP server (`state-hub`)
|
||||
exposes tools for reading and writing state without touching the API directly.
|
||||
|
||||
### Session Protocol
|
||||
|
||||
**On receiving your first message — before writing any response text — call
|
||||
`get_state_summary()` immediately.** Do not greet, do not ask what to do.
|
||||
Call the tool first, then respond based on what you find.
|
||||
|
||||
**At the start of every session:**
|
||||
1. Call `get_state_summary()` — orients you to active workstreams, blocking decisions,
|
||||
and recent progress. If it fails, the API is likely offline:
|
||||
```
|
||||
cd ~/the-custodian/state-hub && make api
|
||||
```
|
||||
2. Check whether the `railiance` topic has any open workstreams in the summary.
|
||||
- **If workstreams exist:** review blocking decisions before starting work.
|
||||
- **If no workstreams exist:** follow the First Session Protocol below.
|
||||
|
||||
**During work:**
|
||||
- Use `create_task()` / `update_task_status()` to track concrete deliverables.
|
||||
- Use `record_decision()` for any decision that affects direction or dependencies.
|
||||
- Use `add_progress_event()` for notable events (milestones, blockers, insights).
|
||||
|
||||
**At the end of every session:**
|
||||
- Call `add_progress_event()` with a summary of what was accomplished or decided.
|
||||
Include `topic_id: ca369340-a64e-442e-98f1-a4fa7dc74a38` and the relevant `workstream_id`.
|
||||
|
||||
### First Session Protocol
|
||||
|
||||
Triggered when `get_state_summary()` shows **no workstreams** for the `railiance` topic.
|
||||
This means the project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Understand the project (read, don't write)**
|
||||
- `canon/projects/railiance/project_charter_v0.1.md` in `~/the-custodian/` — purpose, scope, success criteria
|
||||
- `canon/projects/railiance/roadmap_v0.1.md` — planned phases
|
||||
- Scan this repo root: README, directory structure, any existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
- Look for TODOs, open branches, half-finished files, or notes
|
||||
- Note what is already done vs. what is clearly started but incomplete
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Based on what you found, propose 1–3 workstreams. Each workstream should be:
|
||||
- A coherent strand of work lasting weeks to months (not a single task)
|
||||
- Named clearly enough that its scope is obvious
|
||||
- Anchored to a phase in the roadmap if possible
|
||||
|
||||
Present the proposals and **wait for approval before creating anything**.
|
||||
|
||||
**Step 4 — Create and populate (after approval)**
|
||||
```
|
||||
create_workstream(topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
# repeat for each task in the workstream
|
||||
```
|
||||
Aim for 3–7 tasks per workstream at this stage. Tasks should be concrete and actionable.
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured railiance work into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
### Quick Reference
|
||||
|
||||
See `~/the-custodian/state-hub/mcp_server/TOOLS.md` for a compact tool reference.
|
||||
98
wiki/RailianceThreePhoenix.md
Normal file
98
wiki/RailianceThreePhoenix.md
Normal file
@@ -0,0 +1,98 @@
|
||||
RailianceThreePhoenix
|
||||
|
||||
*Three machine failover loadbalancing *
|
||||
|
||||
Architecture documentation for **RailianceThreePhoenix** service operations automation.
|
||||
|
||||
This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services.
|
||||
|
||||
Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.
|
||||
|
||||
# ThreePhoenix System Architecture
|
||||
|
||||
**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster
|
||||
|
||||
### 1. Executive Summary
|
||||
|
||||
The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.
|
||||
|
||||
### 2. Physical & Infrastructure Layer
|
||||
|
||||
* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual).
|
||||
* **Orchestration:** **K3s** (Lightweight Kubernetes).
|
||||
* **Topology:** Multi-Master HA (Embedded etcd datastore).
|
||||
* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy).
|
||||
|
||||
|
||||
* **Storage (CSI):** **Longhorn** (Distributed Block Storage).
|
||||
* **Replication:** Volume data is synchronously replicated across all 3 nodes.
|
||||
* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories).
|
||||
|
||||
|
||||
|
||||
### 3. Application Stack (The Standard Unit)
|
||||
|
||||
Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:
|
||||
|
||||
| Layer | Component | Configuration Strategy |
|
||||
| --- | --- | --- |
|
||||
| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. |
|
||||
| **Traffic** | ClusterIP | Internal-only communication. |
|
||||
| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
|
||||
| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. |
|
||||
| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. |
|
||||
| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |
|
||||
|
||||
### 4. The "Phoenix" Automation Engine
|
||||
|
||||
A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads.
|
||||
|
||||
* **Schedule:** Weekly (Sunday 03:00 UTC).
|
||||
* **Cycle:** 3-Week Rotation.
|
||||
* **Week 1:** Destroy & Re-clone Standby Node B.
|
||||
* **Week 2:** Destroy & Re-clone Standby Node C.
|
||||
* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A.
|
||||
|
||||
|
||||
* **Objective:** No database pod lives longer than 21 days.
|
||||
|
||||
---
|
||||
|
||||
### Appendix A: Acceptance Criteria (The Audit Checklist)
|
||||
|
||||
Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.
|
||||
|
||||
#### I. Infrastructure Integrity
|
||||
|
||||
* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`.
|
||||
* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working).
|
||||
* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed.
|
||||
|
||||
#### II. Database & Persistence
|
||||
|
||||
* [ ] **Cluster State:** `kubectl exec <primary-pod> -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**.
|
||||
* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status).
|
||||
* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active).
|
||||
* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days.
|
||||
|
||||
#### III. Security & Network
|
||||
|
||||
* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`.
|
||||
* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`.
|
||||
* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).
|
||||
|
||||
#### IV. Phoenix Mechanics
|
||||
|
||||
* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`).
|
||||
* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken).
|
||||
|
||||
#### V. Disaster Recovery Drill (Quarterly)
|
||||
|
||||
* [ ] **The "Kill" Test:** Manually delete a Gitea Pod.
|
||||
* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.
|
||||
|
||||
|
||||
* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace.
|
||||
* *Pass Criteria:* You can log in and see the latest repositories.
|
||||
|
||||
xxx
|
||||
301
wiki/ThreePhoenixWorkplan.md
Normal file
301
wiki/ThreePhoenixWorkplan.md
Normal file
@@ -0,0 +1,301 @@
|
||||
ThreePhoenixWorkplan
|
||||
|
||||
*Self-healing, load-balanced application and service hosting*
|
||||
|
||||
ThreePhoenixWorkplan
|
||||
|
||||
This is a plan for moving to a "3-Node Phoenix" architecture with High Availability (HA) at every layer—on bare metal.
|
||||
|
||||
Here is the staged workplan to go from **3 Blank Ubuntu Machines** to a **Self-Healing, Load-Balanced Gitea Cluster**.
|
||||
|
||||
### Prerequisite Checklist
|
||||
|
||||
* **Hardware:** 3x Ubuntu Servers (22.04 or 24.04 LTS).
|
||||
* **Network:** All 3 nodes must be able to talk to each other.
|
||||
* **DNS:** A domain pointing to your cluster (e.g., `git.yourdomain.com`).
|
||||
|
||||
---
|
||||
|
||||
### Phase 1: The Foundation (K3s Cluster)
|
||||
|
||||
We will use **K3s** with embedded etcd. This gives you a true HA control plane without the complexity of "The Hard Way."
|
||||
|
||||
**1. Prepare the Nodes (Run on All 3)**
|
||||
Disable swap (Kubernetes requirement) and update.
|
||||
|
||||
```bash
|
||||
sudo swapoff -a
|
||||
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
|
||||
sudo apt update && sudo apt upgrade -y
|
||||
|
||||
```
|
||||
|
||||
**2. Initialize the First Node (The Seed)**
|
||||
Run this on **Node 1**:
|
||||
|
||||
```bash
|
||||
curl -sfL https://get.k3s.io | sh -s - server \
|
||||
--cluster-init \
|
||||
--tls-san git.yourdomain.com \
|
||||
--token SECRET_CLUSTER_TOKEN
|
||||
|
||||
```
|
||||
|
||||
* `--cluster-init`: Tells K3s this is the start of an HA cluster.
|
||||
* `SECRET_CLUSTER_TOKEN`: Make up a strong password. You need this for the other nodes.
|
||||
|
||||
**3. Join Nodes 2 & 3**
|
||||
Run this on **Node 2** and **Node 3**:
|
||||
|
||||
```bash
|
||||
curl -sfL https://get.k3s.io | sh -s - server \
|
||||
--server https://<IP_OF_NODE_1>:6443 \
|
||||
--token SECRET_CLUSTER_TOKEN
|
||||
|
||||
```
|
||||
|
||||
**4. Verification**
|
||||
On Node 1, run `sudo k3s kubectl get nodes`. You should see 3 Masters.
|
||||
*(Copy the `/etc/rancher/k3s/k3s.yaml` to your local machine as `~/.kube/config` to manage it remotely.)*
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: The Storage (Longhorn)
|
||||
|
||||
**Crucial:** Gitea HA requires a "Shared Filesystem" (ReadWriteMany) so all 3 Gitea pods see the same Git Repos. On bare metal, **Longhorn** is the standard way to achieve this.
|
||||
|
||||
**1. Install Longhorn**
|
||||
|
||||
```bash
|
||||
helm repo add longhorn https://charts.longhorn.io
|
||||
helm repo update
|
||||
helm install longhorn longhorn/longhorn --namespace longhorn-system --create-namespace
|
||||
|
||||
```
|
||||
|
||||
**2. Verify Storage Class**
|
||||
Run `kubectl get sc`. You should see `longhorn` (default).
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: The Database (Postgres HA)
|
||||
|
||||
We deploy the 3-node Postgres cluster with `pgpool` load balancing.
|
||||
|
||||
**1. Create `postgres-values.yaml**`
|
||||
|
||||
```yaml
|
||||
architecture: replication
|
||||
postgresql:
|
||||
replicaCount: 3
|
||||
pgpool:
|
||||
replicaCount: 3
|
||||
loadBalancing:
|
||||
mode: on # The magic setting for performance
|
||||
persistence:
|
||||
storageClass: "longhorn"
|
||||
size: 10Gi
|
||||
metrics:
|
||||
enabled: true # For monitoring later
|
||||
serviceMonitor:
|
||||
enabled: true
|
||||
|
||||
```
|
||||
|
||||
**2. Install**
|
||||
|
||||
```bash
|
||||
helm repo add bitnami https://charts.bitnami.com/bitnami
|
||||
helm install gitea-db bitnami/postgresql-ha -f postgres-values.yaml
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: The Application (Gitea HA)
|
||||
|
||||
Now for the complex part. We need Gitea to be stateless.
|
||||
|
||||
**1. Create `gitea-values.yaml**`
|
||||
|
||||
```yaml
|
||||
gitea:
|
||||
replicaCount: 3 # Run 3 copies
|
||||
config:
|
||||
database:
|
||||
DB_TYPE: postgres
|
||||
HOST: gitea-db-postgresql-ha-pgpool:5432 # Point to Pgpool!
|
||||
NAME: gitea
|
||||
USER: postgres
|
||||
# CRITICAL: Shared Storage for Repos
|
||||
repository:
|
||||
ROOT: /data/git/repositories
|
||||
# Use Memcached/Redis for sessions (required for HA)
|
||||
cache:
|
||||
ADAPTER: memory # Ideally switch to Redis for true HA later
|
||||
session:
|
||||
PROVIDER: memory # Ideally switch to Redis for true HA later
|
||||
|
||||
persistence:
|
||||
enabled: true
|
||||
accessModes:
|
||||
- ReadWriteMany # This demands Longhorn
|
||||
size: 20Gi
|
||||
storageClass: longhorn
|
||||
|
||||
service:
|
||||
http:
|
||||
type: ClusterIP # Don't expose directly!
|
||||
|
||||
```
|
||||
|
||||
**2. Install**
|
||||
|
||||
```bash
|
||||
helm repo add gitea-charts https://dl.gitea.io/charts/
|
||||
helm install gitea gitea-charts/gitea -f gitea-values.yaml
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: The Security (Nginx + SSL)
|
||||
|
||||
We stop exposing ports directly and use an Ingress Controller.
|
||||
|
||||
**1. Install Nginx Ingress**
|
||||
K3s comes with Traefik by default. You can disable it or use it. If you prefer **Nginx**:
|
||||
|
||||
```bash
|
||||
helm install ingress-nginx ingress-nginx/ingress-nginx
|
||||
|
||||
```
|
||||
|
||||
**2. Install Cert-Manager (For Let's Encrypt)**
|
||||
|
||||
```bash
|
||||
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
|
||||
|
||||
```
|
||||
|
||||
**3. Create the Ingress Resource**
|
||||
Save as `gitea-ingress.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: gitea-ingress
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- git.yourdomain.com
|
||||
secretName: gitea-tls-secret
|
||||
rules:
|
||||
- host: git.yourdomain.com
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: gitea-http
|
||||
port:
|
||||
number: 3000
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 6: The "Phoenix" Automation (Bonus)
|
||||
|
||||
Deploy the logic we discussed to rotate nodes.
|
||||
|
||||
**1. Create the Service Account**
|
||||
Save as `rbac.yaml`. This gives the bot permission to kill pods.
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: phoenix-sa
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: phoenix-role
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources: ["pods", "persistentvolumeclaims"]
|
||||
verbs: ["get", "list", "delete"]
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["statefulsets"]
|
||||
verbs: ["get", "list", "patch"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: phoenix-binding
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: phoenix-sa
|
||||
namespace: default
|
||||
roleRef:
|
||||
kind: ClusterRole
|
||||
name: phoenix-role
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
|
||||
```
|
||||
|
||||
`kubectl apply -f rbac.yaml`
|
||||
|
||||
**2. Deploy the CronJob**
|
||||
Use the **"3-Node Phoenix Script"** from my previous response. Save it as `phoenix-cron.yaml` and apply it.
|
||||
|
||||
---
|
||||
|
||||
### Phase 7: Monitoring & Notification (Bonus Material)
|
||||
|
||||
You want to know *before* your users do if something breaks.
|
||||
|
||||
**1. Install the "Kube-Prometheus-Stack"**
|
||||
This gives you Prometheus (Database), Grafana (Dashboards), and Alertmanager (Notifications) in one go.
|
||||
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm install monitoring prometheus-community/kube-prometheus-stack
|
||||
|
||||
```
|
||||
|
||||
**2. Configure Alerts (Email/Slack)**
|
||||
Edit the `alertmanager` config to send notifications.
|
||||
|
||||
* **Trigger:** `PostgresqlDown` (If pgpool can't see a backend).
|
||||
* **Trigger:** `KubePodCrashLooping` (If Gitea is restarting).
|
||||
|
||||
**3. The "Dead Man's Switch"**
|
||||
Since you have a Phoenix strategy that *intentionally* kills pods, you need to **Silence Alerts** during that specific maintenance window (Sunday 3 AM), or you will wake up to panic emails every week.
|
||||
|
||||
* You can automate "Silence" creation via the Alertmanager API in your Phoenix script:
|
||||
```bash
|
||||
curl -XPOST http://monitoring-alertmanager:9093/api/v2/silences -d '{...}'
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Summary of Result
|
||||
|
||||
You now have:
|
||||
|
||||
1. **3 Physical Nodes** mirroring data via Longhorn.
|
||||
2. **3 Database Replicas** load-balanced by Pgpool.
|
||||
3. **SSL Termination** handling security.
|
||||
4. **Auto-Rotation** killing and rebuilding servers weekly.
|
||||
5. **Monitoring** watching it all.
|
||||
|
||||
|
||||
xxx
|
||||
Reference in New Issue
Block a user