CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan

Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
This commit is contained in:
codex
2026-07-04 00:29:55 +02:00
parent 0a77483861
commit cf4be716e1
10 changed files with 1050 additions and 34 deletions

View File

@@ -3,7 +3,7 @@
<!-- generated by ops/render_service_inventory.py; edit ops/service-inventory.yml instead -->
Source: `ops/service-inventory.yml`
Inventory last reviewed: `2026-06-05`
Inventory last reviewed: `2026-07-03`
This is the repo-native first view for `CUST-WP-0047`. It exists so an
operator can answer what is running where before the full standalone
@@ -16,9 +16,9 @@ operator can answer what is running where before the full standalone
| Environments | 4 |
| Hosts | 3 |
| Clusters | 3 |
| Services | 8 |
| Services: observed_ok | 2 |
| Services: unknown | 6 |
| Services | 11 |
| Services: observed_ok | 6 |
| Services: unknown | 5 |
## Service Catalog
@@ -27,10 +27,13 @@ operator can answer what is running where before the full standalone
| Gitea (gitea) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-apps | https://gitea.coulomb.social/v2/<br>Expected: status 401, OCI registry auth challenge | unknown<br>2026-05-16: Inventory draft records Helm release gitea, namespace default, app version 1.25.4, NodePort 32166, and registry auth challenge. | database:gitea-db<br>pvc:default/gitea-shared-storage | k8s: unknown (coulombcore-k3s/default) | Package token and push/pull verification need current evidence. |
| Gitea Database (gitea-database) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: databases | railiance-platform | - | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/databases) | Backup and restore evidence not recorded in ops inventory. |
| Gitea Shared Storage (gitea-shared-storage) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-platform<br>railiance-apps | - | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/default/pvc/gitea-shared-storage) | Package blob backup and restore evidence not confirmed. |
| State Hub (state-hub) | Local Workstation<br>type: local-process; host: local-workstation; ports: 8000 | state-hub<br>the-custodian | http://127.0.0.1:8000/state/health<br>Expected: status 200, health response | observed_ok<br>2026-06-05: State Hub accepted inbox, task, and progress API calls. | postgresql:state-hub | http: observed_ok (http://127.0.0.1:8000) | Future cluster deployment readiness still needs ops evidence. |
| State Hub (state-hub) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: state-hub | state-hub<br>the-custodian | http://127.0.0.1:8000/state/health<br>Expected: status 200, health response | observed_ok<br>2026-07-03: Cluster hub healthy; railiance01 reaches via fleet forward tunnel. | postgresql:state-hub-db | http: observed_ok (workstation tunnel state-hub-primary → cluster)<br>tunnel: observed_ok (railiance01 systemd fleet-state-hub-coulombcore → cluster) | Primary home must move to railiance01 per CUST-WP-0054-T05. |
| issue-core (issue-core) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: issue-core | issue-core | http://127.0.0.1:8765/healthz<br>Expected: status 200, version response | observed_ok<br>2026-07-02: REST emission live via cross-machine fleet path. | postgresql:issue-core | tunnel: observed_ok (railiance01 fleet-issue-core-coulombcore → cluster) | Target railiance01 overlay per CUST-WP-0054 drain Wave 4. |
| Core Hub (core-hub) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: core-hub-staging | core-hub | https://hub.coulomb.social/api/v2/hubs<br>Expected: status 200, hub list when authenticated | observed_ok<br>2026-07-02: Staging deployed; production cutover gated on CORE-WP-0005-T04. | postgresql:core-hub | k8s: observed_ok (coulombcore-k3s/core-hub-staging) | Production cutover to railiance01 pending operator approval. |
| Fleet Mesh (railiance01) (fleet-mesh-railiance01) | Railiance01<br>type: systemd; host: railiance01 | the-custodian<br>ops-bridge | http://127.0.0.1:18000/state/health<br>Expected: status 200 | observed_ok<br>2026-07-03: Workstation reverse tunnels stopped; systemd forwards healthy. | - | ssh-tunnel: observed_ok (railiance01 → coulombcore ClusterIPs) | Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available. |
| Inter-Hub (inter-hub) | ThreePhoenix Production<br>type: external; public_endpoint: https://hub.coulomb.social | inter-hub | https://hub.coulomb.social/api/v2/openapi.json<br>Expected: status 200, OpenAPI document | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | https: unknown (https://hub.coulomb.social) | ops-hub bootstrap requires authenticated UI flow or deployment-side migration. |
| activity-core (activity-core) | Railiance01<br>type: k3s; cluster: railiance01-k3s; namespace: activity-core | activity-core<br>the-custodian | activity-core API health endpoint<br>Expected: status 200, healthy DB and Temporal status | observed_ok<br>2026-05-23: API health, worker rollout, Temporal CLI schedule listing, and State Hub bridge were verified. | postgresql:activity-core<br>temporal:activity-core<br>nats:railiance01 | k8s: observed_ok (railiance01-k3s/activity-core) | Add explicit ops inventory probes and evidence events. |
| Ops Bridge (ops-bridge) | Local Workstation<br>type: bridge; host: local-workstation | ops-bridge | - | unknown<br>2026-05-16: Bridge is useful for connected-server visibility but is not itself the service catalog. | - | ssh-tunnel: unknown (connected remote servers) | Emit reachability evidence into ops-hub instead of relying on bridge state as inventory. |
| Ops Bridge (ops-bridge) | Local Workstation<br>type: bridge; host: local-workstation | ops-bridge | - | observed_ok<br>2026-07-03: state-hub-railiance01 and issue-core-railiance01 stopped; not production-critical. | - | ssh-tunnel: observed_ok (interactive dev tunnels only (k3s-api, state-hub-primary)) | Install ops-bridge on railiance01 or keep systemd fleet-mesh units. |
| Haskell Build Agent (haskell-build-agent) | Local Workstation<br>type: systemd; host: haskell-build-vm | the-custodian | http://127.0.0.1:18000<br>Expected: VM can reach State Hub through SSH forward | unknown<br>undated: Build agent is a systemd service and registers with State Hub on boot. | - | ssh: unknown (local workstation reverse tunnel port 12222) | Current tunnel and capability registration need live evidence in ops-hub. |
## Open Operating Gaps
@@ -50,7 +53,21 @@ operator can answer what is running where before the full standalone
### State Hub (`state-hub`)
- Future cluster deployment readiness still needs ops evidence.
- Primary home must move to railiance01 per CUST-WP-0054-T05.
- Consistency sweep writebacks still target workstation paths.
### issue-core (`issue-core`)
- Target railiance01 overlay per CUST-WP-0054 drain Wave 4.
### Core Hub (`core-hub`)
- Production cutover to railiance01 pending operator approval.
### Fleet Mesh (railiance01) (`fleet-mesh-railiance01`)
- Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available.
- Retire when State Hub and issue-core move to railiance01.
### Inter-Hub (`inter-hub`)
@@ -62,7 +79,7 @@ operator can answer what is running where before the full standalone
### Ops Bridge (`ops-bridge`)
- Emit reachability evidence into ops-hub instead of relying on bridge state as inventory.
- Install ops-bridge on railiance01 or keep systemd fleet-mesh units.
### Haskell Build Agent (`haskell-build-agent`)