Files
the-custodian/workplans/CUST-WP-0051-infrastructure-stabilization-metaplan.md

22 KiB

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug planning_priority planning_order created updated state_hub_workstream_id
CUST-WP-0051 workplan Infrastructure Stabilization Metaplan infotech the-custodian active codex custodian high 51 2026-06-27 2026-06-27 21cabc98-3f80-4d00-b3b7-06e2ac2af88f

CUST-WP-0051 - Infrastructure Stabilization Metaplan

Goal

Drive the registered infrastructure workplans from a scattered blocked state to a stable checkpoint where:

  • active blockers have a named owner, route, and next command or decision;
  • production credential work uses approved custody paths only;
  • daily operational automation has one healthy runner and clean evidence;
  • State Hub registration reflects the real file state;
  • unfinished strategic work is sequenced into clear follow-on lanes.

This workplan does not replace the child workplans. It is the coordination lane for removing cross-workplan blocks and creating a reliable handoff point.

Review Snapshot

Reviewed on 2026-06-27 from State Hub and the repo workplan files.

Active registered workstreams with open work:

Workstream Open state Main stabilization meaning
artifact-store-wp-0007 5 todo Object-store compatibility and STS credential vending lane.
ihub-wp-0022 3 wait, 5 done Ops-hub evidence intake waits on widget seed/runtime key/smoke.
cust-wp-0047 1 wait, 6 done Ops-hub now view waits on Inter-Hub widget activation.
cust-wp-0049 1 wait, 5 done Access lane is ready; live bootstrap needs approved admin execution.
activity-wp-0016 1 wait, 2 progress, 5 todo, 2 done Daily-triage output robustness needs live deploy/smoke evidence.
three-phoenix-ha-cluster 7 todo Target HA substrate is planned but not executed.
staged-promotion-lifecycle finished, 7 done Promotion discipline ready for broad production cutovers.
rail-ho-wp-0005 11 todo, 1 progress Forgejo production migration needs human design and cutover decisions.
cust-wp-0045-cutover-runbook 0 tasks Registered runbook is appearing as an active no-task workstream.
net-wp-0020 2 wait, 1 todo, 2 done OpenBao unseal custody models still need operator profile decisions.
issue-wp-0003 2 progress, 5 done issue-core deploy is close; finish live wiring and runbook evidence.
activity-wp-0006 1 wait, 1 todo, 6 done Three-run calibration waits on the daily-triage live gate.
cust-wp-0038 8 todo Full ThreePhoenix State Hub HA migration remains strategic follow-on.
cust-wp-0025 17 todo, 9 done FOS hub bootstrap now depends on identity, ops-hub, and fin-hub lanes.
cust-wp-0011 3 todo, 6 done Pragmatic State Hub railiance01 migration still needs cutover/stabilize/retire.

Additional repo-local hygiene issue:

  • CUST-WP-0014 has frontmatter status: done but all six task blocks are still todo. Treat it as either superseded and archive it, or reopen it as a focused State Hub sync-health workplan.

State Hub hygiene issue:

  • There are stale needs_human flags on completed or cancelled tasks. These do not all block execution, but they make the operator view noisier and should be cleared or annotated after the source workplans are reconciled.

Dependency Shape

The critical path is:

  1. Credential and operator-access custody: OpenBao, Inter-Hub operator key, ops-hub runtime key, Forgejo SMTP/cutover approvals, and OpenBao unseal profile decisions.
  2. Ops evidence and daily automation: Inter-Hub ops-hub records, activity-core daily-triage robustness deployment, schema-valid smoke, then three clean scheduled runs.
  3. Production substrate and source forge: issue-core GitOps pilot, Forgejo production migration, artifact-store STS, staged promotion, and State Hub migration strategy.
  4. Federation buildout: identity completion, Core Hub replacement evidence, ops-hub scaffold reset, fin-hub scaffold, and business/runway canon.

Task: Normalize Registry And Workplan Hygiene

id: CUST-WP-0051-T01
status: done
priority: high
state_hub_task_id: "7e83bd50-5ca2-4341-9d18-65512e3f0442"

Clean up the planning substrate before execution work resumes.

Minimum scope:

  • Decide whether CUST-WP-0045-cutover-runbook should stay registered as an active workstream or be represented only as a runbook under CUST-WP-0045.
  • Resolve CUST-WP-0014: archive as superseded, or reopen and re-scope the six remaining State Hub sync-health tasks.
  • Clear or annotate stale needs_human flags on done/cancel tasks after source workplans confirm they are no longer live gates.
  • Run State Hub consistency after file changes.

Done when the active workstream list no longer contains no-task runbooks or contradictory done-with-todo files, and the human-needed view shows only live human gates.

Progress 2026-06-27:

  • CUST-WP-0045-cutover-runbook now has status: finished; State Hub no longer lists it as an active workstream.
  • CUST-WP-0014 is reopened as backlog with its task detail preserved, so it is no longer a contradictory done-with-todo file or an active queue item.
  • make fix-consistency REPO=the-custodian passed with pre-existing C-12 warnings and synced the lifecycle changes into State Hub.

Completed 2026-06-27: cleared 15 stale needs_human flags from tasks that were already done or cancel, leaving live todo/progress/wait human gates untouched. T01 is complete.

Task: Establish One Credential-Custody Unblock Board

id: CUST-WP-0051-T02
status: done
priority: high
state_hub_task_id: "312bde29-4370-4352-b5a3-00a8c4fe2059"

Collect the live operator-access decisions in one non-secret board.

Inputs:

  • CUST-WP-0049-T06: Inter-Hub admin access or deployment-side bootstrap path.
  • IHUB-WP-0022-T04: ops-hub runtime OPS_HUB_KEY custody.
  • NET-WP-0020: OpenBao unseal custody and SSH automation profile.
  • RAIL-HO-WP-0005: Forgejo hostname, SMTP, runner, backup, cutover, rollback, and retirement decisions.

Rules:

  • Do not put secrets in Git, State Hub, workplans, or chat.
  • Use warden route find / warden route show before requesting credentials.
  • Treat ops-warden as SSH certificate authority only, not as a secret store.

Done when each human/operator gate has an owner, approved route, expected execution host, non-secret evidence target, and fallback decision.

Completed 2026-06-27: added docs/credential-custody-unblock-board.md with route records, live gate owners, expected execution hosts, non-secret evidence targets, fallback decisions, and pickup order. Route lookup was verified through /home/worsch/ops-warden using uv run warden route show ... --json because the globally installed warden lacks the route subcommand.

Refined 2026-06-27: added docs/ops-warden-secret-posture-review.md and updated the unblock board/checkpoint to consume ops-warden's warden access assist boundary plus WARDEN-WP-0015 environment-posture/workload-maturity triage. This turns vague IT-security blockers into dev/test doubles, owner-routed production custody gates, or real maturity/posture violations.

Task: Close The Ops-Hub Inter-Hub Evidence Lane

id: CUST-WP-0051-T03
status: progress
priority: high
state_hub_task_id: "d6c3a39e-629d-47e4-b589-9e1a0273d9fa"

Finish the linked ops-hub activation chain:

  • Execute CUST-WP-0049-T06 using the approved access route.
  • Close CUST-WP-0047-T05 by proving ops-hub widgets exist and accept evidence events.
  • Unblock IHUB-WP-0022 by provisioning the runtime key through the approved secret path and running the end-to-end evidence submission smoke.

Done when ops inventory probes and activity-core evidence can land in Inter-Hub without manual SQL or secret exposure.

Progress 2026-06-27:

  • Added docs/ops-hub-interhub-evidence-lane-status.md with non-secret public probe evidence. Production Inter-Hub has an ops-hub row and the ops-hub seed vocabulary is visible on public registry endpoints.
  • Protected widget, manifest, and hub-registry surfaces correctly require authentication; no runtime-key smoke was attempted.
  • New blocker surfaced: the older IHUB-WP-0022 activity-core mapping contract names event types, policy scope, aggregate widget refs, and widget types that do not match the live ops-hub seed vocabulary. Align that contract before an attended bootstrap/runtime-key smoke, or the operator key may still hit manifest/schema failures.

Progress 2026-06-27 contract alignment:

  • Updated /home/worsch/inter-hub contract docs for IHUB-WP-0022 to target the live ops-hub seed vocabulary. Old ops-service-observed and ops-inventory-drift names are transition aliases, ops-access-path-checked is deferred to fallback until supported, and payload examples now post only live manifest event types.
  • Ran make fix-consistency REPO=inter-hub; it passed with pre-existing C-12 warnings and synced the IHUB-WP-0022 description drift into State Hub.
  • Remaining T03 gate is authenticated widget lookup, any missing backup/risk seed widget, runtime key custody, and protected submission smoke.

Progress 2026-06-27 Core Hub pivot:

  • Created CUST-WP-0052 to drive the reframe from old Inter-Hub production bootstrap toward Core Hub-owned replacement implementation.
  • Treat remaining Inter-Hub evidence as legacy compatibility or fallback evidence. Do not spend new design work on Haskell Inter-Hub unless it is needed for migration proof or rollback.
  • Next implementation lane should be Core Hub API first, CLI second, and web UI third, with whynot-design used for the rebuilt UI where practical.

Task: Stabilize Daily-Triage Automation

id: CUST-WP-0051-T04
status: progress
priority: high
state_hub_task_id: "42810d3b-5557-4efd-871b-65bef7c19e0e"

Finish the activity-core daily-triage reliability lane.

Sequence:

  1. Deploy the activity-wp-0016 robustness bundle: bounded prompt/schema, per-item parsing, quarantine lane, and producer guardrails.
  2. Run a schema-valid live daily-triage smoke on railiance01.
  3. Collect three clean scheduled runs with matching activity-core, State Hub, and working-memory evidence.
  4. Close activity-wp-0006 calibration and decide the fate of the CUST-WP-0045 cutover runbook registration.

Done when there is exactly one trusted daily triage runner and the fallback state is documented.

Progress 2026-06-27:

  • Added docs/daily-triage-stabilization-status.md with the current evidence chain. The 2026-06-24 and 2026-06-25 scheduled runs were schema-valid; the 2026-06-26 and 2026-06-27 runs reached State Hub and working memory but failed output validation around char 5.2k.
  • Current primary blocker is no longer a silent schedule or State Hub sink outage. The live runner still needs the ACTIVITY-WP-0016 code/schema bundle and Railiance runtime prompt changes so malformed tails degrade to quarantined partial output.
  • Pickup sequence: deploy WP-0016 code/schema together, update the runtime prompt bundle for bounded top-N/per-item framing/token headroom, run a live railiance01 smoke, then restart the three-clean-run gate.
  • Normalized ACTIVITY-WP-0016 source task status in activity-core: T04 is done and T05 is progress, matching its own progress notes.
  • Updated activity-core daily-triage source notes: ACTIVITY-WP-0010-T02 is now done, T03/T04 point at the post-WP-0016 live smoke and three-run gate, and ACTIVITY-WP-0006-T03 records the 2026-06-27 validation failure.
  • Cleared the stale human-needed flag from the completed bridge/config task and moved live intervention notes onto the deploy/smoke/calibration gate.

Task: Finish Near-Term Production Service Lanes

id: CUST-WP-0051-T05
status: progress
priority: medium
state_hub_task_id: "2083f0e4-e037-48bf-8069-f31e8db2fd95"

Move near-complete service workstreams to done before starting larger migrations.

Priority order:

  • issue-wp-0003: finish activity-core wiring and end-to-end GitOps runbook.
  • rail-ho-wp-0005: resolve Forgejo production decisions, email recovery, and cutover approval gates.
  • artifact-store-wp-0007: complete MinIO compatibility and STS credential vending assessment if it is required by backup, registry, or app lanes.
  • staged-promotion-lifecycle: make production promotion gates explicit before further cluster/source-forge cutovers.

Done when each lane is either finished or parked with a precise dependency and no ambiguous human-needed state.

Progress 2026-06-27:

  • Added docs/near-term-production-service-lanes-status.md with a lane board for issue-core, Forgejo, artifact-store, and staged promotion.
  • issue-core is the immediate near-done lane: the service itself is healthy, but activity-core still points at port 8010 and ISSUE_SINK_TYPE=null. Do not flip it to REST until ISSUE_CORE_API_KEY is injected into activity-core's runtime secret via route activity-core-issue-sink.
  • Forgejo remains parked behind explicit production design decisions, SMTP/email recovery, package registry, Actions, backup/restore, migration drill, and cutover approval.
  • artifact-store and staged promotion are executable planning/build lanes: artifact-store D7.1/D7.2 remains open; staged-promotion T02 is now complete before broad production source-forge migration work.

Progress 2026-06-27 staged promotion:

  • Completed RAIL-BS-WP-0006-T02 in /home/worsch/railiance-cluster. Added docs/app-toml-contract.md, schemas/railiance-app.schema.json, and examples/railiance/app.toml, defining the repository-local railiance/app.toml declaration for identity, ownership, source/artifact policy, platform dependencies, secret references without plaintext values, observability, stage commands/checks/evidence, canary/promotion modes, rollback, and human approval gates.
  • make fix-consistency REPO=railiance-cluster passed with pre-existing C-12 warnings and synced the T02 status into State Hub.
  • T02 through T07 are complete; the staged-promotion lifecycle is finished.

Progress 2026-06-27 staged promotion T03:

  • Completed RAIL-BS-WP-0006-T03 in /home/worsch/railiance-cluster. Added docs/overlay-repo-pattern.md, tools/create_railiance_overlay_repo.sh, and the bin/railiance create-overlay dispatcher entry. The scaffold writes a separate overlay repo with railiance/upstream.toml, schema-valid railiance/app.toml, stage values, a thin Helm chart, Stage 1 test script, rollback runbook, and promotion notes without cloning upstream code or handling secrets.
  • Verified the generated Forgejo overlay sample against schemas/railiance-app.schema.json; generated Stage 1 script ran with Helm skipped because Helm is unavailable in this environment.
  • make fix-consistency REPO=railiance-cluster passed with pre-existing C-12 warnings and synced the T03 status into State Hub.

Progress 2026-06-27 staged promotion T04:

  • Completed RAIL-BS-WP-0006-T04 in /home/worsch/railiance-cluster. Added tools/cmd/railiance-run, the bin/railiance run dispatcher entry, and docs/railiance-run-command.md. The command reads railiance/app.toml, runs Stage 1 commands and local checks, and emits a railiance.run-result.v1 JSON result with command references and scrubbed HTTP URLs rather than command logs, stdout/stderr, or secret-bearing URL details.
  • Updated generated overlays so a Forgejo overlay completes Stage 1 locally: stage1-script is required, local-health is optional when no local service is running, and Helm rendering remains optional when Helm is unavailable.
  • Verified a fresh generated Forgejo overlay against schemas/railiance-app.schema.json and bin/railiance run; the smoke passed with one command, two checks, and zero required failures.
  • make fix-consistency REPO=railiance-cluster passed with pre-existing C-12 warnings and synced the T04 status into State Hub.

Progress 2026-06-27 staged promotion T05:

  • Completed RAIL-BS-WP-0006-T05 in /home/worsch/railiance-cluster. Generated overlays now include a Stage 2 canary Helm template with stable/canary release identities, isolated ingress by default, optional Traefik weighted routing, Prometheus annotations, HTTP probes, conservative resource limits, rollback-safe Stage 2/Stage 3 values, and tests/stage2-template.sh.
  • Verified a fresh generated Forgejo overlay with schema validation, tests/stage1.sh, tests/stage2-template.sh, and bin/railiance run. Helm rendering was skipped because Helm is unavailable in this environment.
  • make fix-consistency REPO=railiance-cluster passed with pre-existing C-12 warnings and synced the T05 status into State Hub.

Progress 2026-06-27 staged promotion T06:

  • Completed RAIL-BS-WP-0006-T06 in /home/worsch/railiance-cluster. Added tools/cmd/railiance-stage2 plus bin/railiance deploy and bin/railiance observe dispatch. Both commands default to non-mutating plans; apply/live modes fail closed on missing prerequisites.
  • Verified a fresh generated Forgejo overlay with schema validation, tests/stage1.sh, tests/stage2-template.sh, Stage 2 deploy plan, Stage 2 observe plan, and blocked apply without approval/Helm.
  • make fix-consistency REPO=railiance-cluster passed with pre-existing C-12 warnings and synced the T06 status into State Hub.

Progress 2026-06-27 staged promotion T07 and finish:

  • Completed RAIL-BS-WP-0006-T07 in /home/worsch/railiance-cluster. Added tools/cmd/railiance-stage3, bin/railiance promote, bin/railiance rollback, and docs/promote-rollback-onboarding.md. Generated overlays now declare promote/rollback plan commands.
  • Verified a fresh generated Forgejo overlay through Stage 1 run, Stage 2 deploy/observe plans, Stage 3 promote/rollback plans, and blocked apply paths for missing approval/Helm/revision evidence.
  • Marked RAIL-BS-WP-0006 status: finished; make fix-consistency REPO=railiance-cluster synced the finished workstream with only pre-existing C-12 orphan-row warnings.

Task: Decide State Hub Migration Strategy

id: CUST-WP-0051-T06
status: progress
priority: high
state_hub_task_id: "0ac3763f-eac0-4773-9be8-cb0a7979e444"

Choose and execute the State Hub stabilization path.

Decision:

  • If pragmatic railiance01 service is enough for the next operating period, finish CUST-WP-0011: cutover MCP config, observe the stabilization window, then retire or retain WSL2 fallback by explicit decision.
  • If HA is now required, promote CUST-WP-0038 and the ThreePhoenix HA cluster lane: readiness, storage/database strategy, HA API behavior, failover drill, restore drill, and endpoint/runbook update.

Done when the active State Hub path is singular, tested, and documented, and the alternate path is either cancelled, deferred, or explicitly retained as a future workplan.

Progress 2026-06-27:

  • Added docs/state-hub-migration-strategy-status.md and selected the pragmatic CUST-WP-0011 railiance01 path as the singular active State Hub stabilization lane.
  • CUST-WP-0011 is already through T01-T06: image pushed, cluster manifests defined, empty deploy healthy, migrations run, WSL2 data restored, row counts compared, and cluster API health/summary verified.
  • Next gate is CUST-WP-0011-T07: explicit approval to freeze WSL2 writes, restore the final dump, compare again, and redirect MCP/private access to the cluster endpoint.
  • CUST-WP-0038 and RAIL-BS-WP-0007 remain deferred HA lanes until the pragmatic path stabilizes and ThreePhoenix storage/database strategy is current.

Task: Sequence FOS Hub Bootstrap To Completion

id: CUST-WP-0051-T07
status: progress
priority: medium
state_hub_task_id: "27b6828a-9e87-4135-a036-bce760c3057c"

Use the stabilized substrate to finish CUST-WP-0025 without reviving the mega-hub pattern.

Recommended order:

  1. Finish identity foundations: NK-WP-0001, NK-WP-0002, then the IAM profile integration test.
  2. Create the standalone ops-hub repo from hub-core and ingest the inventory artifacts from CUST-WP-0047.
  3. Add ops models, MCP tools, Railiance integration, dev-hub coupling, dashboard, and MCP registration.
  4. Only then start the fin-hub/business-model tasks.

Done when CUST-WP-0025 has no open foundational identity or ops-hub tasks and fin-hub work is either started on a stable scaffold or deliberately deferred.

Progress 2026-06-27:

  • Added docs/fos-hub-bootstrap-sequence-status.md with the current sequence.
  • Corrected the identity foundation baseline in CUST-WP-0025: the old NK-WP-0001 Keycloak task is cancelled as superseded, NK-WP-0002 local identity is done, and the remaining identity gate is the IAM Profile v0.2 FastAPI integration test.
  • Current ops-hub reality is extension-first: ops-hub exists, OPS-WP-0001 is finished, and OPS-WP-0002 waits on authenticated Inter-Hub bootstrap/runtime-key evidence. Reconcile CUST-WP-0025-T13-T19 after the first governed ops event lands.
  • Fin-hub/business tasks remain deliberately deferred until identity integration and ops-hub extension evidence are proven.

Progress 2026-06-27 Core Hub reset:

  • CUST-WP-0052 now owns the reset criteria. CUST-WP-0025-T13 through T19 should not be executed literally as the old standalone ops-hub scaffold until Core Hub replacement evidence is good enough and the tasks are rewritten.
  • Core Hub is promising enough to stop expanding the Inter-Hub-first path: local ops-hub bootstrap compatibility and /console visual checks exist, but staging import, deployed dual-run smokes, and cutover evidence are still open.

Task: Create The Stable Pickup Checkpoint

id: CUST-WP-0051-T08
status: done
priority: high
state_hub_task_id: "2cc0a127-a749-4228-962e-f8c9b693a1b3"

Close this metaplan by creating an operator-friendly checkpoint.

Minimum contents:

  • active workstream list with zero stale runbooks and zero contradictory task states;
  • blocker board showing no unowned credential, access, or approval gates;
  • daily automation evidence from the latest successful scheduled run;
  • production service status summary for State Hub, Inter-Hub, ops-hub evidence, issue-core, Forgejo, and artifact-store;
  • explicit next-pick list for remaining strategic tasks.

Done when a future agent can start from the checkpoint and choose the next workplan without reconstructing this review.

Completed 2026-06-27: added docs/infrastructure-stabilization-pickup-checkpoint.md with the live active workstream list, named blocker board, latest daily-triage evidence, production service status summary, and next-pick sequence. This closes the handoff surface for future agents while the child workplans remain the execution source of truth.