6.0 KiB
id, type, title, domain, repo, status, owner, topic_slug, repo_goal_id, state_hub_workstream_id, created, updated
| id | type | title | domain | repo | status | owner | topic_slug | repo_goal_id | state_hub_workstream_id | created | updated |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RAIL-BS-WP-0007 | workplan | ThreePhoenix - HA Cluster Implementation | railiance | railiance-cluster | active | railiance | railiance | 6ea441f7-7fe3-4598-922b-38baf20c0580 | 9e208376-23f1-40c7-9813-fac1f7d6ad3b | 2026-02-25 | 2026-05-03 |
ThreePhoenix - HA Cluster Implementation
Goal
Implement the ThreePhoenix architecture: a self-healing three-node Kubernetes cluster substrate for Railiance production systems.
The cluster target includes:
- k3s HA with embedded etcd.
- Distributed storage.
- High-availability database patterns.
- Ingress and certificate automation.
- Node rotation and recovery drills.
- Monitoring and acceptance audits.
Why This Belongs Before Forgejo
Forgejo will be the source forge, package base, and Actions surface for the Railiance stack. Moving it before the production cluster lifecycle is clear would make Forgejo both the migration target and the infrastructure experiment.
ThreePhoenix should come first, or at least its lifecycle gates should be designed first, so Forgejo is deployed onto a substrate whose failure and promotion behavior is already understood.
Boundary
This workplan is S2 cluster runtime work.
In scope for railiance-cluster:
- k3s HA topology and runtime configuration.
- Cluster-level storage/operator installation hooks.
- Ingress and certificate controllers.
- Cluster health, rotation, and acceptance checks.
Out of scope:
- Database cluster definitions and credentials:
railiance-platform. - Forgejo/Gitea application Helm values:
railiance-apps. - Developer workflows and Actions templates:
railiance-enablement. - OS provisioning and host hardening:
railiance-infra.
Tasks
T01 - K3s HA cluster setup
id: RAIL-BS-WP-0007-T01
status: todo
priority: high
state_hub_task_id: "1f8a8668-31eb-4d79-bbcd-50f6430a8d66"
Implement the three-node k3s HA cluster setup using embedded etcd.
Minimum scope:
- Define node roles and join sequence.
- Automate first server and additional server joins.
- Validate etcd quorum.
- Document failure behavior for one missing node.
Done when: three nodes can form a healthy k3s HA cluster from documented commands.
T02 - Longhorn distributed storage
id: RAIL-BS-WP-0007-T02
status: todo
priority: high
state_hub_task_id: "b1d4e0fa-da41-4b13-a7d6-34dd040cb605"
Install and validate distributed storage for stateful workloads.
Minimum scope:
- Storage prerequisites and node labeling.
- Longhorn installation or approved alternative.
- Default storage class decision.
- Volume replica and recovery behavior.
- Backup target handoff to
railiance-platformwhere appropriate.
Done when: a test PVC survives a node disruption according to the ThreePhoenix acceptance criteria.
T03 - PostgreSQL HA pattern
id: RAIL-BS-WP-0007-T03
status: todo
priority: high
state_hub_task_id: "11283b4c-7e4d-490d-91b3-0d06a593bdf0"
Define the PostgreSQL HA runtime pattern and handoff to S3.
The original State Hub task names repmgr and Pgpool-II. Before implementation, reconcile that with the current Railiance production baseline using CloudNative PG.
Done when: the chosen HA database pattern is documented, tested, and
owned by the correct layer without conflicting with railiance-platform.
T04 - Reference stateful application HA
id: RAIL-BS-WP-0007-T04
status: todo
priority: high
state_hub_task_id: "4a20e593-a89d-43da-abcc-5a39a4c8b3c0"
Validate a representative stateful source-forge workload on the HA cluster.
The historical task names Gitea. In the current roadmap this should become Forgejo unless a temporary Gitea reference drill is still useful.
Minimum checks:
- Repository storage survives pod reschedule and node disruption.
- Database failover behavior is understood.
- Package registry storage is included in backup/restore thinking.
- Application-level rollback is compatible with the staged promotion lifecycle.
Done when: Railiance has a proven stateful source-forge deployment pattern that can be reused for the Forgejo migration.
T05 - Nginx ingress and cert-manager SSL
id: RAIL-BS-WP-0007-T05
status: todo
priority: medium
state_hub_task_id: "68315a40-dd5b-4032-a9e7-1152e38f9807"
Implement and validate the production ingress and certificate path.
Minimum scope:
- Ingress controller topology.
- TLS certificate issuance and renewal.
- Private/public exposure rules.
- Health checks for ingress and certificate validity.
Done when: representative services can be exposed through the intended ingress path with valid certificates.
T06 - Phoenix CronJob automation
id: RAIL-BS-WP-0007-T06
status: todo
priority: medium
state_hub_task_id: "f658aa6a-1c48-4660-88fa-35eaa0137e12"
Implement weekly node rotation or equivalent Phoenix recovery automation.
Minimum scope:
- Define what "rotation" means for the current host reality.
- Automate safe cordon, drain, rebuild/rejoin, and validation steps where feasible.
- Include explicit human gates for destructive host actions.
- Log rotation results to State Hub.
Done when: the cluster recovery rhythm is scripted, documented, and tested without risking production data.
T07 - Monitoring stack and acceptance audit checklist
id: RAIL-BS-WP-0007-T07
status: todo
priority: medium
state_hub_task_id: "70f6c8ab-a700-4fb2-893e-cf5a40615044"
Add the monitoring stack and final acceptance audit checklist.
Minimum scope:
- Cluster health signals.
- Storage health.
- Database/operator health handoff.
- Ingress and certificate health.
- Backup/restore freshness.
- Promotion lifecycle readiness.
Done when: ThreePhoenix can be declared ready for critical workloads only after the checklist passes.
Dependencies
This workplan should precede the Forgejo production cutover. It should also
shape the Stage 2 and Stage 3 gates in RAIL-BS-WP-0006 so canaries and
promotions operate against the real HA substrate.