230 lines
6.0 KiB
Markdown
230 lines
6.0 KiB
Markdown
---
|
|
id: RAIL-BS-WP-0007
|
|
type: workplan
|
|
title: "ThreePhoenix - HA Cluster Implementation"
|
|
domain: railiance
|
|
repo: railiance-cluster
|
|
status: active
|
|
owner: railiance
|
|
topic_slug: railiance
|
|
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
|
|
state_hub_workstream_id: "9e208376-23f1-40c7-9813-fac1f7d6ad3b"
|
|
created: "2026-02-25"
|
|
updated: "2026-05-03"
|
|
---
|
|
|
|
# ThreePhoenix - HA Cluster Implementation
|
|
|
|
## Goal
|
|
|
|
Implement the ThreePhoenix architecture: a self-healing three-node Kubernetes
|
|
cluster substrate for Railiance production systems.
|
|
|
|
The cluster target includes:
|
|
|
|
- k3s HA with embedded etcd.
|
|
- Distributed storage.
|
|
- High-availability database patterns.
|
|
- Ingress and certificate automation.
|
|
- Node rotation and recovery drills.
|
|
- Monitoring and acceptance audits.
|
|
|
|
## Why This Belongs Before Forgejo
|
|
|
|
Forgejo will be the source forge, package base, and Actions surface for the
|
|
Railiance stack. Moving it before the production cluster lifecycle is clear
|
|
would make Forgejo both the migration target and the infrastructure experiment.
|
|
|
|
ThreePhoenix should come first, or at least its lifecycle gates should be
|
|
designed first, so Forgejo is deployed onto a substrate whose failure and
|
|
promotion behavior is already understood.
|
|
|
|
## Boundary
|
|
|
|
This workplan is S2 cluster runtime work.
|
|
|
|
In scope for `railiance-cluster`:
|
|
|
|
- k3s HA topology and runtime configuration.
|
|
- Cluster-level storage/operator installation hooks.
|
|
- Ingress and certificate controllers.
|
|
- Cluster health, rotation, and acceptance checks.
|
|
|
|
Out of scope:
|
|
|
|
- Database cluster definitions and credentials: `railiance-platform`.
|
|
- Forgejo/Gitea application Helm values: `railiance-apps`.
|
|
- Developer workflows and Actions templates: `railiance-enablement`.
|
|
- OS provisioning and host hardening: `railiance-infra`.
|
|
|
|
## Tasks
|
|
|
|
### T01 - K3s HA cluster setup
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T01
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "1f8a8668-31eb-4d79-bbcd-50f6430a8d66"
|
|
```
|
|
|
|
Implement the three-node k3s HA cluster setup using embedded etcd.
|
|
|
|
Minimum scope:
|
|
|
|
- Define node roles and join sequence.
|
|
- Automate first server and additional server joins.
|
|
- Validate etcd quorum.
|
|
- Document failure behavior for one missing node.
|
|
|
|
**Done when:** three nodes can form a healthy k3s HA cluster from documented
|
|
commands.
|
|
|
|
---
|
|
|
|
### T02 - Longhorn distributed storage
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T02
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "b1d4e0fa-da41-4b13-a7d6-34dd040cb605"
|
|
```
|
|
|
|
Install and validate distributed storage for stateful workloads.
|
|
|
|
Minimum scope:
|
|
|
|
- Storage prerequisites and node labeling.
|
|
- Longhorn installation or approved alternative.
|
|
- Default storage class decision.
|
|
- Volume replica and recovery behavior.
|
|
- Backup target handoff to `railiance-platform` where appropriate.
|
|
|
|
**Done when:** a test PVC survives a node disruption according to the
|
|
ThreePhoenix acceptance criteria.
|
|
|
|
---
|
|
|
|
### T03 - PostgreSQL HA pattern
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T03
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "11283b4c-7e4d-490d-91b3-0d06a593bdf0"
|
|
```
|
|
|
|
Define the PostgreSQL HA runtime pattern and handoff to S3.
|
|
|
|
The original State Hub task names repmgr and Pgpool-II. Before implementation,
|
|
reconcile that with the current Railiance production baseline using
|
|
CloudNative PG.
|
|
|
|
**Done when:** the chosen HA database pattern is documented, tested, and
|
|
owned by the correct layer without conflicting with `railiance-platform`.
|
|
|
|
---
|
|
|
|
### T04 - Reference stateful application HA
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T04
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "4a20e593-a89d-43da-abcc-5a39a4c8b3c0"
|
|
```
|
|
|
|
Validate a representative stateful source-forge workload on the HA cluster.
|
|
|
|
The historical task names Gitea. In the current roadmap this should become
|
|
Forgejo unless a temporary Gitea reference drill is still useful.
|
|
|
|
Minimum checks:
|
|
|
|
- Repository storage survives pod reschedule and node disruption.
|
|
- Database failover behavior is understood.
|
|
- Package registry storage is included in backup/restore thinking.
|
|
- Application-level rollback is compatible with the staged promotion lifecycle.
|
|
|
|
**Done when:** Railiance has a proven stateful source-forge deployment pattern
|
|
that can be reused for the Forgejo migration.
|
|
|
|
---
|
|
|
|
### T05 - Nginx ingress and cert-manager SSL
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T05
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "68315a40-dd5b-4032-a9e7-1152e38f9807"
|
|
```
|
|
|
|
Implement and validate the production ingress and certificate path.
|
|
|
|
Minimum scope:
|
|
|
|
- Ingress controller topology.
|
|
- TLS certificate issuance and renewal.
|
|
- Private/public exposure rules.
|
|
- Health checks for ingress and certificate validity.
|
|
|
|
**Done when:** representative services can be exposed through the intended
|
|
ingress path with valid certificates.
|
|
|
|
---
|
|
|
|
### T06 - Phoenix CronJob automation
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T06
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "f658aa6a-1c48-4660-88fa-35eaa0137e12"
|
|
```
|
|
|
|
Implement weekly node rotation or equivalent Phoenix recovery automation.
|
|
|
|
Minimum scope:
|
|
|
|
- Define what "rotation" means for the current host reality.
|
|
- Automate safe cordon, drain, rebuild/rejoin, and validation steps where
|
|
feasible.
|
|
- Include explicit human gates for destructive host actions.
|
|
- Log rotation results to State Hub.
|
|
|
|
**Done when:** the cluster recovery rhythm is scripted, documented, and tested
|
|
without risking production data.
|
|
|
|
---
|
|
|
|
### T07 - Monitoring stack and acceptance audit checklist
|
|
|
|
```task
|
|
id: RAIL-BS-WP-0007-T07
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "70f6c8ab-a700-4fb2-893e-cf5a40615044"
|
|
```
|
|
|
|
Add the monitoring stack and final acceptance audit checklist.
|
|
|
|
Minimum scope:
|
|
|
|
- Cluster health signals.
|
|
- Storage health.
|
|
- Database/operator health handoff.
|
|
- Ingress and certificate health.
|
|
- Backup/restore freshness.
|
|
- Promotion lifecycle readiness.
|
|
|
|
**Done when:** ThreePhoenix can be declared ready for critical workloads only
|
|
after the checklist passes.
|
|
|
|
## Dependencies
|
|
|
|
This workplan should precede the Forgejo production cutover. It should also
|
|
shape the Stage 2 and Stage 3 gates in `RAIL-BS-WP-0006` so canaries and
|
|
promotions operate against the real HA substrate.
|