coulomb/railiance-cluster

Fork 0

Files

tegwick 1eb8559f27

railiance-tests / smoke (push) Has been cancelled

Details

tools and workplans

2026-05-15 23:03:28 +02:00

6.0 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, repo_goal_id, state_hub_workstream_id, created, updated

id	type	title	domain	repo	status	owner	topic_slug	repo_goal_id	state_hub_workstream_id	created	updated
RAIL-BS-WP-0007	workplan	ThreePhoenix - HA Cluster Implementation	railiance	railiance-cluster	active	railiance	railiance	6ea441f7-7fe3-4598-922b-38baf20c0580	9e208376-23f1-40c7-9813-fac1f7d6ad3b	2026-02-25	2026-05-03

ThreePhoenix - HA Cluster Implementation

Goal

Implement the ThreePhoenix architecture: a self-healing three-node Kubernetes cluster substrate for Railiance production systems.

The cluster target includes:

k3s HA with embedded etcd.
Distributed storage.
High-availability database patterns.
Ingress and certificate automation.
Node rotation and recovery drills.
Monitoring and acceptance audits.

Why This Belongs Before Forgejo

Forgejo will be the source forge, package base, and Actions surface for the Railiance stack. Moving it before the production cluster lifecycle is clear would make Forgejo both the migration target and the infrastructure experiment.

ThreePhoenix should come first, or at least its lifecycle gates should be designed first, so Forgejo is deployed onto a substrate whose failure and promotion behavior is already understood.

Boundary

This workplan is S2 cluster runtime work.

In scope for railiance-cluster:

k3s HA topology and runtime configuration.
Cluster-level storage/operator installation hooks.
Ingress and certificate controllers.
Cluster health, rotation, and acceptance checks.

Out of scope:

Database cluster definitions and credentials: railiance-platform.
Forgejo/Gitea application Helm values: railiance-apps.
Developer workflows and Actions templates: railiance-enablement.
OS provisioning and host hardening: railiance-infra.

Tasks

T01 - K3s HA cluster setup

id: RAIL-BS-WP-0007-T01
status: todo
priority: high
state_hub_task_id: "1f8a8668-31eb-4d79-bbcd-50f6430a8d66"

Implement the three-node k3s HA cluster setup using embedded etcd.

Minimum scope:

Define node roles and join sequence.
Automate first server and additional server joins.
Validate etcd quorum.
Document failure behavior for one missing node.

Done when: three nodes can form a healthy k3s HA cluster from documented commands.

T02 - Longhorn distributed storage

id: RAIL-BS-WP-0007-T02
status: todo
priority: high
state_hub_task_id: "b1d4e0fa-da41-4b13-a7d6-34dd040cb605"

Install and validate distributed storage for stateful workloads.

Minimum scope:

Storage prerequisites and node labeling.
Longhorn installation or approved alternative.
Default storage class decision.
Volume replica and recovery behavior.
Backup target handoff to railiance-platform where appropriate.

Done when: a test PVC survives a node disruption according to the ThreePhoenix acceptance criteria.

T03 - PostgreSQL HA pattern

id: RAIL-BS-WP-0007-T03
status: todo
priority: high
state_hub_task_id: "11283b4c-7e4d-490d-91b3-0d06a593bdf0"

Define the PostgreSQL HA runtime pattern and handoff to S3.

The original State Hub task names repmgr and Pgpool-II. Before implementation, reconcile that with the current Railiance production baseline using CloudNative PG.

Done when: the chosen HA database pattern is documented, tested, and owned by the correct layer without conflicting with railiance-platform.

T04 - Reference stateful application HA

id: RAIL-BS-WP-0007-T04
status: todo
priority: high
state_hub_task_id: "4a20e593-a89d-43da-abcc-5a39a4c8b3c0"

Validate a representative stateful source-forge workload on the HA cluster.

The historical task names Gitea. In the current roadmap this should become Forgejo unless a temporary Gitea reference drill is still useful.

Minimum checks:

Repository storage survives pod reschedule and node disruption.
Database failover behavior is understood.
Package registry storage is included in backup/restore thinking.
Application-level rollback is compatible with the staged promotion lifecycle.

Done when: Railiance has a proven stateful source-forge deployment pattern that can be reused for the Forgejo migration.

T05 - Nginx ingress and cert-manager SSL

id: RAIL-BS-WP-0007-T05
status: todo
priority: medium
state_hub_task_id: "68315a40-dd5b-4032-a9e7-1152e38f9807"

Implement and validate the production ingress and certificate path.

Minimum scope:

Ingress controller topology.
TLS certificate issuance and renewal.
Private/public exposure rules.
Health checks for ingress and certificate validity.

Done when: representative services can be exposed through the intended ingress path with valid certificates.

T06 - Phoenix CronJob automation

id: RAIL-BS-WP-0007-T06
status: todo
priority: medium
state_hub_task_id: "f658aa6a-1c48-4660-88fa-35eaa0137e12"

Implement weekly node rotation or equivalent Phoenix recovery automation.

Minimum scope:

Define what "rotation" means for the current host reality.
Automate safe cordon, drain, rebuild/rejoin, and validation steps where feasible.
Include explicit human gates for destructive host actions.
Log rotation results to State Hub.

Done when: the cluster recovery rhythm is scripted, documented, and tested without risking production data.

T07 - Monitoring stack and acceptance audit checklist

id: RAIL-BS-WP-0007-T07
status: todo
priority: medium
state_hub_task_id: "70f6c8ab-a700-4fb2-893e-cf5a40615044"

Add the monitoring stack and final acceptance audit checklist.

Minimum scope:

Cluster health signals.
Storage health.
Database/operator health handoff.
Ingress and certificate health.
Backup/restore freshness.
Promotion lifecycle readiness.

Done when: ThreePhoenix can be declared ready for critical workloads only after the checklist passes.

Dependencies

This workplan should precede the Forgejo production cutover. It should also shape the Stage 2 and Stage 3 gates in RAIL-BS-WP-0006 so canaries and promotions operate against the real HA substrate.

6.0 KiB Raw Permalink Blame History

ThreePhoenix - HA Cluster Implementation

Goal

Why This Belongs Before Forgejo

Boundary

Tasks

T01 - K3s HA cluster setup

T02 - Longhorn distributed storage

T03 - PostgreSQL HA pattern

T04 - Reference stateful application HA

T05 - Nginx ingress and cert-manager SSL

T06 - Phoenix CronJob automation

T07 - Monitoring stack and acceptance audit checklist

Dependencies

6.0 KiB

Raw Permalink Blame History