Files
railiance-cluster/workplans/RAIL-BS-WP-0007-threephoenix-ha-cluster.md
tegwick 1eb8559f27
Some checks failed
railiance-tests / smoke (push) Has been cancelled
tools and workplans
2026-05-15 23:03:28 +02:00

230 lines
6.0 KiB
Markdown

---
id: RAIL-BS-WP-0007
type: workplan
title: "ThreePhoenix - HA Cluster Implementation"
domain: railiance
repo: railiance-cluster
status: active
owner: railiance
topic_slug: railiance
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
state_hub_workstream_id: "9e208376-23f1-40c7-9813-fac1f7d6ad3b"
created: "2026-02-25"
updated: "2026-05-03"
---
# ThreePhoenix - HA Cluster Implementation
## Goal
Implement the ThreePhoenix architecture: a self-healing three-node Kubernetes
cluster substrate for Railiance production systems.
The cluster target includes:
- k3s HA with embedded etcd.
- Distributed storage.
- High-availability database patterns.
- Ingress and certificate automation.
- Node rotation and recovery drills.
- Monitoring and acceptance audits.
## Why This Belongs Before Forgejo
Forgejo will be the source forge, package base, and Actions surface for the
Railiance stack. Moving it before the production cluster lifecycle is clear
would make Forgejo both the migration target and the infrastructure experiment.
ThreePhoenix should come first, or at least its lifecycle gates should be
designed first, so Forgejo is deployed onto a substrate whose failure and
promotion behavior is already understood.
## Boundary
This workplan is S2 cluster runtime work.
In scope for `railiance-cluster`:
- k3s HA topology and runtime configuration.
- Cluster-level storage/operator installation hooks.
- Ingress and certificate controllers.
- Cluster health, rotation, and acceptance checks.
Out of scope:
- Database cluster definitions and credentials: `railiance-platform`.
- Forgejo/Gitea application Helm values: `railiance-apps`.
- Developer workflows and Actions templates: `railiance-enablement`.
- OS provisioning and host hardening: `railiance-infra`.
## Tasks
### T01 - K3s HA cluster setup
```task
id: RAIL-BS-WP-0007-T01
status: todo
priority: high
state_hub_task_id: "1f8a8668-31eb-4d79-bbcd-50f6430a8d66"
```
Implement the three-node k3s HA cluster setup using embedded etcd.
Minimum scope:
- Define node roles and join sequence.
- Automate first server and additional server joins.
- Validate etcd quorum.
- Document failure behavior for one missing node.
**Done when:** three nodes can form a healthy k3s HA cluster from documented
commands.
---
### T02 - Longhorn distributed storage
```task
id: RAIL-BS-WP-0007-T02
status: todo
priority: high
state_hub_task_id: "b1d4e0fa-da41-4b13-a7d6-34dd040cb605"
```
Install and validate distributed storage for stateful workloads.
Minimum scope:
- Storage prerequisites and node labeling.
- Longhorn installation or approved alternative.
- Default storage class decision.
- Volume replica and recovery behavior.
- Backup target handoff to `railiance-platform` where appropriate.
**Done when:** a test PVC survives a node disruption according to the
ThreePhoenix acceptance criteria.
---
### T03 - PostgreSQL HA pattern
```task
id: RAIL-BS-WP-0007-T03
status: todo
priority: high
state_hub_task_id: "11283b4c-7e4d-490d-91b3-0d06a593bdf0"
```
Define the PostgreSQL HA runtime pattern and handoff to S3.
The original State Hub task names repmgr and Pgpool-II. Before implementation,
reconcile that with the current Railiance production baseline using
CloudNative PG.
**Done when:** the chosen HA database pattern is documented, tested, and
owned by the correct layer without conflicting with `railiance-platform`.
---
### T04 - Reference stateful application HA
```task
id: RAIL-BS-WP-0007-T04
status: todo
priority: high
state_hub_task_id: "4a20e593-a89d-43da-abcc-5a39a4c8b3c0"
```
Validate a representative stateful source-forge workload on the HA cluster.
The historical task names Gitea. In the current roadmap this should become
Forgejo unless a temporary Gitea reference drill is still useful.
Minimum checks:
- Repository storage survives pod reschedule and node disruption.
- Database failover behavior is understood.
- Package registry storage is included in backup/restore thinking.
- Application-level rollback is compatible with the staged promotion lifecycle.
**Done when:** Railiance has a proven stateful source-forge deployment pattern
that can be reused for the Forgejo migration.
---
### T05 - Nginx ingress and cert-manager SSL
```task
id: RAIL-BS-WP-0007-T05
status: todo
priority: medium
state_hub_task_id: "68315a40-dd5b-4032-a9e7-1152e38f9807"
```
Implement and validate the production ingress and certificate path.
Minimum scope:
- Ingress controller topology.
- TLS certificate issuance and renewal.
- Private/public exposure rules.
- Health checks for ingress and certificate validity.
**Done when:** representative services can be exposed through the intended
ingress path with valid certificates.
---
### T06 - Phoenix CronJob automation
```task
id: RAIL-BS-WP-0007-T06
status: todo
priority: medium
state_hub_task_id: "f658aa6a-1c48-4660-88fa-35eaa0137e12"
```
Implement weekly node rotation or equivalent Phoenix recovery automation.
Minimum scope:
- Define what "rotation" means for the current host reality.
- Automate safe cordon, drain, rebuild/rejoin, and validation steps where
feasible.
- Include explicit human gates for destructive host actions.
- Log rotation results to State Hub.
**Done when:** the cluster recovery rhythm is scripted, documented, and tested
without risking production data.
---
### T07 - Monitoring stack and acceptance audit checklist
```task
id: RAIL-BS-WP-0007-T07
status: todo
priority: medium
state_hub_task_id: "70f6c8ab-a700-4fb2-893e-cf5a40615044"
```
Add the monitoring stack and final acceptance audit checklist.
Minimum scope:
- Cluster health signals.
- Storage health.
- Database/operator health handoff.
- Ingress and certificate health.
- Backup/restore freshness.
- Promotion lifecycle readiness.
**Done when:** ThreePhoenix can be declared ready for critical workloads only
after the checklist passes.
## Dependencies
This workplan should precede the Forgejo production cutover. It should also
shape the Stage 2 and Stage 3 gates in `RAIL-BS-WP-0006` so canaries and
promotions operate against the real HA substrate.