Files
the-custodian/workplans/CUST-WP-0051-infrastructure-stabilization-metaplan.md

538 lines
24 KiB
Markdown

---
id: CUST-WP-0051
type: workplan
title: "Infrastructure Stabilization Metaplan"
domain: infotech
repo: the-custodian
status: active
owner: codex
topic_slug: custodian
planning_priority: high
planning_order: 51
created: "2026-06-27"
updated: "2026-06-27"
state_hub_workstream_id: "21cabc98-3f80-4d00-b3b7-06e2ac2af88f"
---
# CUST-WP-0051 - Infrastructure Stabilization Metaplan
## Goal
Drive the registered infrastructure workplans from a scattered blocked state to
a stable checkpoint where:
- active blockers have a named owner, route, and next command or decision;
- production credential work uses approved custody paths only;
- daily operational automation has one healthy runner and clean evidence;
- State Hub registration reflects the real file state;
- unfinished strategic work is sequenced into clear follow-on lanes.
This workplan does not replace the child workplans. It is the coordination lane
for removing cross-workplan blocks and creating a reliable handoff point.
## Review Snapshot
Reviewed on 2026-06-27 from State Hub and the repo workplan files.
Active registered workstreams with open work:
| Workstream | Open state | Main stabilization meaning |
| --- | --- | --- |
| artifact-store-wp-0007 | 5 todo | Object-store compatibility and STS credential vending lane. |
| ihub-wp-0022 | 3 wait, 5 done | Ops-hub evidence intake waits on widget seed/runtime key/smoke. |
| cust-wp-0047 | 1 wait, 6 done | Ops-hub now view waits on Inter-Hub widget activation. |
| cust-wp-0049 | 1 wait, 5 done | Access lane is ready; live bootstrap needs approved admin execution. |
| activity-wp-0016 | 1 wait, 2 progress, 5 todo, 2 done | Daily-triage output robustness needs live deploy/smoke evidence. |
| three-phoenix-ha-cluster | 7 todo | Target HA substrate is planned but not executed. |
| staged-promotion-lifecycle | finished, 7 done | Promotion discipline ready for broad production cutovers. |
| rail-ho-wp-0005 | 11 todo, 1 progress | Forgejo production migration needs human design and cutover decisions. |
| cust-wp-0045-cutover-runbook | 0 tasks | Registered runbook is appearing as an active no-task workstream. |
| net-wp-0020 | 2 wait, 1 todo, 2 done | OpenBao unseal custody models still need operator profile decisions. |
| issue-wp-0003 | 2 progress, 5 done | issue-core deploy is close; finish live wiring and runbook evidence. |
| activity-wp-0006 | 1 wait, 1 todo, 6 done | Three-run calibration waits on the daily-triage live gate. |
| cust-wp-0038 | 8 todo | Full ThreePhoenix State Hub HA migration remains strategic follow-on. |
| cust-wp-0025 | 17 todo, 9 done | FOS hub bootstrap now depends on identity, ops-hub, and fin-hub lanes. |
| cust-wp-0011 | 3 todo, 6 done | Pragmatic State Hub railiance01 migration still needs cutover/stabilize/retire. |
Additional repo-local hygiene issue:
- `CUST-WP-0014` has frontmatter `status: done` but all six task blocks are
still `todo`. Treat it as either superseded and archive it, or reopen it as a
focused State Hub sync-health workplan.
State Hub hygiene issue:
- There are stale `needs_human` flags on completed or cancelled tasks. These do
not all block execution, but they make the operator view noisier and should be
cleared or annotated after the source workplans are reconciled.
## Dependency Shape
The critical path is:
1. Credential and operator-access custody:
OpenBao, Inter-Hub operator key, ops-hub runtime key, Forgejo SMTP/cutover
approvals, and OpenBao unseal profile decisions.
2. Ops evidence and daily automation:
Inter-Hub ops-hub records, activity-core daily-triage robustness deployment,
schema-valid smoke, then three clean scheduled runs.
3. Production substrate and source forge:
issue-core GitOps pilot, Forgejo production migration, artifact-store STS,
staged promotion, and State Hub migration strategy.
4. Federation buildout:
identity completion, Core Hub replacement evidence, ops-hub scaffold reset,
fin-hub scaffold, and business/runway canon.
## Task: Normalize Registry And Workplan Hygiene
```task
id: CUST-WP-0051-T01
status: done
priority: high
state_hub_task_id: "7e83bd50-5ca2-4341-9d18-65512e3f0442"
```
Clean up the planning substrate before execution work resumes.
Minimum scope:
- Decide whether `CUST-WP-0045-cutover-runbook` should stay registered as an
active workstream or be represented only as a runbook under `CUST-WP-0045`.
- Resolve `CUST-WP-0014`: archive as superseded, or reopen and re-scope the six
remaining State Hub sync-health tasks.
- Clear or annotate stale `needs_human` flags on done/cancel tasks after source
workplans confirm they are no longer live gates.
- Run State Hub consistency after file changes.
Done when the active workstream list no longer contains no-task runbooks or
contradictory done-with-todo files, and the human-needed view shows only live
human gates.
Progress 2026-06-27:
- `CUST-WP-0045-cutover-runbook` now has `status: finished`; State Hub no
longer lists it as an active workstream.
- `CUST-WP-0014` is reopened as `backlog` with its task detail preserved, so it
is no longer a contradictory done-with-todo file or an active queue item.
- `make fix-consistency REPO=the-custodian` passed with pre-existing C-12
warnings and synced the lifecycle changes into State Hub.
Completed 2026-06-27: cleared 15 stale `needs_human` flags from tasks that
were already `done` or `cancel`, leaving live `todo`/`progress`/`wait` human
gates untouched. T01 is complete.
## Task: Establish One Credential-Custody Unblock Board
```task
id: CUST-WP-0051-T02
status: done
priority: high
state_hub_task_id: "312bde29-4370-4352-b5a3-00a8c4fe2059"
```
Collect the live operator-access decisions in one non-secret board.
Inputs:
- `CUST-WP-0049-T06`: Inter-Hub admin access or deployment-side bootstrap path.
- `IHUB-WP-0022-T04`: ops-hub runtime `OPS_HUB_KEY` custody.
- `NET-WP-0020`: OpenBao unseal custody and SSH automation profile.
- `RAIL-HO-WP-0005`: Forgejo hostname, SMTP, runner, backup, cutover, rollback,
and retirement decisions.
Rules:
- Do not put secrets in Git, State Hub, workplans, or chat.
- Use `warden route find` / `warden route show` before requesting credentials.
- Treat ops-warden as SSH certificate authority only, not as a secret store.
Done when each human/operator gate has an owner, approved route, expected
execution host, non-secret evidence target, and fallback decision.
Completed 2026-06-27: added `docs/credential-custody-unblock-board.md` with
route records, live gate owners, expected execution hosts, non-secret evidence
targets, fallback decisions, and pickup order. Route lookup was verified through
`/home/worsch/ops-warden` using `uv run warden route show ... --json` because
the globally installed `warden` lacks the `route` subcommand.
Refined 2026-06-27: added `docs/ops-warden-secret-posture-review.md` and updated
the unblock board/checkpoint to consume ops-warden's `warden access` assist
boundary plus WARDEN-WP-0015 environment-posture/workload-maturity triage. This
turns vague IT-security blockers into dev/test doubles, owner-routed production
custody gates, or real maturity/posture violations.
## Task: Close The Ops-Hub Inter-Hub Evidence Lane
```task
id: CUST-WP-0051-T03
status: progress
priority: high
state_hub_task_id: "d6c3a39e-629d-47e4-b589-9e1a0273d9fa"
```
Finish the linked ops-hub activation chain:
- Execute `CUST-WP-0049-T06` using the approved access route.
- Close `CUST-WP-0047-T05` by proving ops-hub widgets exist and accept evidence
events.
- Unblock `IHUB-WP-0022` by provisioning the runtime key through the approved
secret path and running the end-to-end evidence submission smoke.
Done when ops inventory probes and activity-core evidence can land in Inter-Hub
without manual SQL or secret exposure.
Progress 2026-06-27:
- Added `docs/ops-hub-interhub-evidence-lane-status.md` with non-secret public
probe evidence. Production Inter-Hub has an `ops-hub` row and the ops-hub seed
vocabulary is visible on public registry endpoints.
- Protected widget, manifest, and hub-registry surfaces correctly require
authentication; no runtime-key smoke was attempted.
- New blocker surfaced: the older `IHUB-WP-0022` activity-core mapping contract
names event types, policy scope, aggregate widget refs, and widget types that
do not match the live ops-hub seed vocabulary. Align that contract before an
attended bootstrap/runtime-key smoke, or the operator key may still hit
manifest/schema failures.
Progress 2026-06-27 contract alignment:
- Updated `/home/worsch/inter-hub` contract docs for `IHUB-WP-0022` to target
the live ops-hub seed vocabulary. Old `ops-service-observed` and
`ops-inventory-drift` names are transition aliases, `ops-access-path-checked`
is deferred to fallback until supported, and payload examples now post only
live manifest event types.
- Ran `make fix-consistency REPO=inter-hub`; it passed with pre-existing C-12
warnings and synced the IHUB-WP-0022 description drift into State Hub.
- Remaining T03 gate is authenticated widget lookup, any missing backup/risk
seed widget, runtime key custody, and protected submission smoke.
Progress 2026-06-27 Core Hub pivot:
- Created `CUST-WP-0052` to drive the reframe from old Inter-Hub production
bootstrap toward Core Hub-owned replacement implementation.
- Treat remaining Inter-Hub evidence as legacy compatibility or fallback
evidence. Do not spend new design work on Haskell Inter-Hub unless it is
needed for migration proof or rollback.
- Next implementation lane should be Core Hub API first, CLI second, and web UI
third, with whynot-design used for the rebuilt UI where practical.
## Task: Stabilize Daily-Triage Automation
```task
id: CUST-WP-0051-T04
status: progress
priority: high
state_hub_task_id: "42810d3b-5557-4efd-871b-65bef7c19e0e"
```
Finish the activity-core daily-triage reliability lane.
Sequence:
1. Deploy the `activity-wp-0016` robustness bundle: bounded prompt/schema,
per-item parsing, quarantine lane, and producer guardrails.
2. Run a schema-valid live daily-triage smoke on railiance01.
3. Collect three clean scheduled runs with matching activity-core, State Hub,
and working-memory evidence.
4. Close `activity-wp-0006` calibration and decide the fate of the
`CUST-WP-0045` cutover runbook registration.
Done when there is exactly one trusted daily triage runner and the fallback
state is documented.
Progress 2026-06-27:
- Added `docs/daily-triage-stabilization-status.md` with the current evidence
chain. The 2026-06-24 and 2026-06-25 scheduled runs were schema-valid; the
2026-06-26 and 2026-06-27 runs reached State Hub and working memory but failed
output validation around char 5.2k.
- Current primary blocker is no longer a silent schedule or State Hub sink
outage. The live runner still needs the `ACTIVITY-WP-0016` code/schema bundle
and Railiance runtime prompt changes so malformed tails degrade to quarantined
partial output.
- Pickup sequence: deploy WP-0016 code/schema together, update the runtime
prompt bundle for bounded top-N/per-item framing/token headroom, run a live
railiance01 smoke, then restart the three-clean-run gate.
- Normalized ACTIVITY-WP-0016 source task status in activity-core: T04 is done
and T05 is progress, matching its own progress notes.
- Updated activity-core daily-triage source notes: ACTIVITY-WP-0010-T02 is
now done, T03/T04 point at the post-WP-0016 live smoke and three-run gate,
and ACTIVITY-WP-0006-T03 records the 2026-06-27 validation failure.
- Cleared the stale human-needed flag from the completed bridge/config task and
moved live intervention notes onto the deploy/smoke/calibration gate.
## Task: Finish Near-Term Production Service Lanes
```task
id: CUST-WP-0051-T05
status: progress
priority: medium
state_hub_task_id: "2083f0e4-e037-48bf-8069-f31e8db2fd95"
```
Move near-complete service workstreams to done before starting larger migrations.
Priority order:
- `issue-wp-0003`: finish activity-core wiring and end-to-end GitOps runbook.
- `rail-ho-wp-0005`: resolve Forgejo production decisions, email recovery, and
cutover approval gates.
- `artifact-store-wp-0007`: complete MinIO compatibility and STS credential
vending assessment if it is required by backup, registry, or app lanes.
- `staged-promotion-lifecycle`: make production promotion gates explicit before
further cluster/source-forge cutovers.
Done when each lane is either finished or parked with a precise dependency and
no ambiguous human-needed state.
Progress 2026-06-27:
- Added `docs/near-term-production-service-lanes-status.md` with a lane board
for issue-core, Forgejo, artifact-store, and staged promotion.
- issue-core is the immediate near-done lane: the service itself is healthy, but
activity-core still points at port `8010` and `ISSUE_SINK_TYPE=null`. Do not
flip it to REST until `ISSUE_CORE_API_KEY` is injected into activity-core's
runtime secret via route `activity-core-issue-sink`.
- Forgejo remains parked behind explicit production design decisions, SMTP/email
recovery, package registry, Actions, backup/restore, migration drill, and
cutover approval.
- artifact-store and staged promotion are executable planning/build lanes:
artifact-store D7.1/D7.2 remains open; staged-promotion T02 is now complete
before broad production source-forge migration work.
Progress 2026-06-27 artifact-store D7.1/D7.2:
- Advanced `/home/worsch/artifact-store` `ARTIFACT-STORE-WP-0007`: D7.1 is
done with `docs/minio-compatibility-landscape-2026-06-27.md`, deciding to
pursue a compatibility profile instead of a direct MaxIO server fork.
- D7.2 is now `progress` with an opt-in live MinIO compatibility pytest harness
(`tests/integration/test_storage_s3_minio.py`), `make test-minio`, and manual
smoke docs in `docs/OPERATOR.md`.
- Verified artifact-store with `make test` (`110 passed, 2 skipped`), targeted
Ruff checks for the new harness, direct harness execution (`2 skipped` without
endpoint variables), and `git diff --check`. Repo-wide `make lint` still
reports pre-existing Ruff format drift in seven untouched files.
- Remaining artifact-store gate is live evidence: run D7.2 against an approved
MinIO-compatible endpoint with non-secret health, round-trip, and multipart
output. D7.3 STS vending remains identity/platform-routed work.
Progress 2026-06-27 staged promotion:
- Completed `RAIL-BS-WP-0006-T02` in `/home/worsch/railiance-cluster`.
Added `docs/app-toml-contract.md`, `schemas/railiance-app.schema.json`,
and `examples/railiance/app.toml`, defining the repository-local
`railiance/app.toml` declaration for identity, ownership, source/artifact
policy, platform dependencies, secret references without plaintext values,
observability, stage commands/checks/evidence, canary/promotion modes,
rollback, and human approval gates.
- `make fix-consistency REPO=railiance-cluster` passed with pre-existing
C-12 warnings and synced the T02 status into State Hub.
- T02 through T07 are complete; the staged-promotion lifecycle is finished.
Progress 2026-06-27 staged promotion T03:
- Completed `RAIL-BS-WP-0006-T03` in `/home/worsch/railiance-cluster`.
Added `docs/overlay-repo-pattern.md`,
`tools/create_railiance_overlay_repo.sh`, and the `bin/railiance
create-overlay` dispatcher entry. The scaffold writes a separate overlay
repo with `railiance/upstream.toml`, schema-valid `railiance/app.toml`,
stage values, a thin Helm chart, Stage 1 test script, rollback runbook, and
promotion notes without cloning upstream code or handling secrets.
- Verified the generated Forgejo overlay sample against
`schemas/railiance-app.schema.json`; generated Stage 1 script ran with Helm
skipped because Helm is unavailable in this environment.
- `make fix-consistency REPO=railiance-cluster` passed with pre-existing
C-12 warnings and synced the T03 status into State Hub.
Progress 2026-06-27 staged promotion T04:
- Completed `RAIL-BS-WP-0006-T04` in `/home/worsch/railiance-cluster`.
Added `tools/cmd/railiance-run`, the `bin/railiance run` dispatcher entry,
and `docs/railiance-run-command.md`. The command reads `railiance/app.toml`,
runs Stage 1 commands and local checks, and emits a
`railiance.run-result.v1` JSON result with command references and scrubbed
HTTP URLs rather than command logs, stdout/stderr, or secret-bearing URL
details.
- Updated generated overlays so a Forgejo overlay completes Stage 1 locally:
`stage1-script` is required, `local-health` is optional when no local service
is running, and Helm rendering remains optional when Helm is unavailable.
- Verified a fresh generated Forgejo overlay against
`schemas/railiance-app.schema.json` and `bin/railiance run`; the smoke passed
with one command, two checks, and zero required failures.
- `make fix-consistency REPO=railiance-cluster` passed with pre-existing
C-12 warnings and synced the T04 status into State Hub.
Progress 2026-06-27 staged promotion T05:
- Completed `RAIL-BS-WP-0006-T05` in `/home/worsch/railiance-cluster`.
Generated overlays now include a Stage 2 canary Helm template with
stable/canary release identities, isolated ingress by default, optional
Traefik weighted routing, Prometheus annotations, HTTP probes, conservative
resource limits, rollback-safe Stage 2/Stage 3 values, and
`tests/stage2-template.sh`.
- Verified a fresh generated Forgejo overlay with schema validation,
`tests/stage1.sh`, `tests/stage2-template.sh`, and `bin/railiance run`.
Helm rendering was skipped because Helm is unavailable in this environment.
- `make fix-consistency REPO=railiance-cluster` passed with pre-existing
C-12 warnings and synced the T05 status into State Hub.
Progress 2026-06-27 staged promotion T06:
- Completed `RAIL-BS-WP-0006-T06` in `/home/worsch/railiance-cluster`.
Added `tools/cmd/railiance-stage2` plus `bin/railiance deploy` and
`bin/railiance observe` dispatch. Both commands default to non-mutating
plans; apply/live modes fail closed on missing prerequisites.
- Verified a fresh generated Forgejo overlay with schema validation,
`tests/stage1.sh`, `tests/stage2-template.sh`, Stage 2 deploy plan,
Stage 2 observe plan, and blocked apply without approval/Helm.
- `make fix-consistency REPO=railiance-cluster` passed with pre-existing
C-12 warnings and synced the T06 status into State Hub.
Progress 2026-06-27 staged promotion T07 and finish:
- Completed `RAIL-BS-WP-0006-T07` in `/home/worsch/railiance-cluster`.
Added `tools/cmd/railiance-stage3`, `bin/railiance promote`,
`bin/railiance rollback`, and `docs/promote-rollback-onboarding.md`.
Generated overlays now declare promote/rollback plan commands.
- Verified a fresh generated Forgejo overlay through Stage 1 run, Stage 2
deploy/observe plans, Stage 3 promote/rollback plans, and blocked apply paths
for missing approval/Helm/revision evidence.
- Marked `RAIL-BS-WP-0006` `status: finished`; `make fix-consistency
REPO=railiance-cluster` synced the finished workstream with only pre-existing
C-12 orphan-row warnings.
## Task: Decide State Hub Migration Strategy
```task
id: CUST-WP-0051-T06
status: progress
priority: high
state_hub_task_id: "0ac3763f-eac0-4773-9be8-cb0a7979e444"
```
Choose and execute the State Hub stabilization path.
Decision:
- If pragmatic railiance01 service is enough for the next operating period,
finish `CUST-WP-0011`: cutover MCP config, observe the stabilization window,
then retire or retain WSL2 fallback by explicit decision.
- If HA is now required, promote `CUST-WP-0038` and the ThreePhoenix HA cluster
lane: readiness, storage/database strategy, HA API behavior, failover drill,
restore drill, and endpoint/runbook update.
Done when the active State Hub path is singular, tested, and documented, and
the alternate path is either cancelled, deferred, or explicitly retained as a
future workplan.
Progress 2026-06-27:
- Added `docs/state-hub-migration-strategy-status.md` and selected
the pragmatic `CUST-WP-0011` railiance01 path as the singular active
State Hub stabilization lane.
- `CUST-WP-0011` is already through T01-T06: image pushed, cluster
manifests defined, empty deploy healthy, migrations run, WSL2 data restored,
row counts compared, and cluster API health/summary verified.
- Next gate is `CUST-WP-0011-T07`: explicit approval to freeze WSL2
writes, restore the final dump, compare again, and redirect MCP/private access
to the cluster endpoint.
- `CUST-WP-0038` and `RAIL-BS-WP-0007` remain deferred HA
lanes until the pragmatic path stabilizes and ThreePhoenix storage/database
strategy is current.
## Task: Sequence FOS Hub Bootstrap To Completion
```task
id: CUST-WP-0051-T07
status: progress
priority: medium
state_hub_task_id: "27b6828a-9e87-4135-a036-bce760c3057c"
```
Use the stabilized substrate to finish `CUST-WP-0025` without reviving the
mega-hub pattern.
Recommended order:
1. Keep `CUST-WP-0025-T03` as the remaining identity integration gate, targeting
the current IAM Profile v0.2 contract and local-identity or KeyCape issuer.
2. Execute the rewritten Core Hub Phase 3 lane: ops evidence contract/read-model
gaps, deployed Core Hub smoke, activity-core Core Hub sink smoke,
migration/cutover readiness, and whynot-aligned first UI screens.
3. Keep `CUST-WP-0047-T05` and `CUST-WP-0049-T06` as legacy/fallback Inter-Hub
records until deployed Core Hub evidence or an explicit supersede decision
closes them.
4. Start fin-hub/business-model tasks only after identity and Core Hub ops-hub
evidence are proven enough to demonstrate the multi-hub pattern.
Done when `CUST-WP-0025` has no open foundational identity or ops-hub tasks and
fin-hub work is either started on a stable Core Hub pattern or deliberately
deferred with a dated condition.
Progress 2026-06-27:
- Added `docs/fos-hub-bootstrap-sequence-status.md` with the current sequence.
- Corrected the identity foundation baseline in `CUST-WP-0025`: the old
`NK-WP-0001` Keycloak task is cancelled as superseded, `NK-WP-0002` local
identity is done, and the remaining identity gate is the IAM Profile v0.2
FastAPI integration test.
- Current ops-hub reality is Core Hub replacement-first: `CORE-WP-0008`
finished the API smoke harness, activity-core sink, staging profile, CLI
wrappers, UI rebuild backlog, and Custodian handoff. `CUST-WP-0025-T13`-`T19`
have been rewritten away from the obsolete standalone scaffold.
- Fin-hub/business tasks remain deliberately deferred until identity integration
and ops-hub extension evidence are proven.
Progress 2026-06-27 Core Hub reset:
- `CUST-WP-0052` completed the Phase 3 reset. `CUST-WP-0025-T13` through
`T19` now point at Core Hub-owned API evidence, CLI parity, deployed
smoke/cutover gates, whynot-aligned UI, and cancellation of immediate
standalone ops-hub MCP registration.
- Core Hub is now the preferred replacement lane, but staging import, deployed
dual-run smokes, cutover evidence, and Haskell retirement approval remain
open.
Progress 2026-06-27 CUST-WP-0052 closeout:
- `CUST-WP-0052` is finished. It closed the Core Hub reframe, rewrote
`CUST-WP-0025-T13` through `T19`, aligned the build/release lane with
HelixForge/Railiance Forge practice, and posted non-secret State Hub
requirements to `railiance-apps` and `railiance-forge`.
- The remaining T07 gates are execution gates, not sequencing ambiguity:
`CUST-WP-0025-T03` identity integration, `T14` Core Hub ops evidence contract
gaps, `T16/T17` deployed evidence/cutover waits, and `T18` Core Hub operator
UI first screens.
## Task: Create The Stable Pickup Checkpoint
```task
id: CUST-WP-0051-T08
status: done
priority: high
state_hub_task_id: "2cc0a127-a749-4228-962e-f8c9b693a1b3"
```
Close this metaplan by creating an operator-friendly checkpoint.
Minimum contents:
- active workstream list with zero stale runbooks and zero contradictory task
states;
- blocker board showing no unowned credential, access, or approval gates;
- daily automation evidence from the latest successful scheduled run;
- production service status summary for State Hub, Inter-Hub, ops-hub evidence,
issue-core, Forgejo, and artifact-store;
- explicit next-pick list for remaining strategic tasks.
Done when a future agent can start from the checkpoint and choose the next
workplan without reconstructing this review.
Completed 2026-06-27: added
`docs/infrastructure-stabilization-pickup-checkpoint.md` with the live active
workstream list, named blocker board, latest daily-triage evidence, production
service status summary, and next-pick sequence. This closes the handoff surface
for future agents while the child workplans remain the execution source of
truth.