Files
the-custodian/docs/infrastructure-stabilization-pickup-checkpoint.md

9.0 KiB

Infrastructure Stabilization Pickup Checkpoint

Updated: 2026-06-27 Coordinator workplan: CUST-WP-0051

Purpose

This checkpoint is the restart surface for the infrastructure stabilization metaplan. It consolidates the workplan review, unblock boards, current State Hub registration state, and the next strategic picks.

Use this file first when resuming the lane. Then open the source workplan named in the relevant row and continue from its task state.

Registration State

State Hub active workstreams queried on 2026-06-27:

Workstream Current pickup meaning
artifact-store-wp-0007 Start D7.1/D7.2 assessment and compatibility harness; D7.3 STS vending may route to NetKingdom.
ihub-wp-0022 Ops-hub evidence intake contract is aligned to live vocabulary; runtime key custody, protected widget lookup, and smoke remain.
cust-wp-0047 Now-view waits on the ops-hub Inter-Hub evidence lane, not on service inventory collection.
cust-wp-0049 Bootstrap access helper/runbook is ready; authenticated execution is operator-gated.
cust-wp-0051 This metaplan is the coordination layer for remaining cross-workplan gates.
activity-wp-0016-llm-output-robustness-trust-boundary Repo-side output robustness bundle is prepared; live deploy/smoke proof remains.
three-phoenix-ha-cluster HA substrate remains future critical-workload work, not the current State Hub cutover blocker.
staged-promotion-lifecycle T02 railiance/app.toml contract and T03 overlay repo pattern/script are done; continue with T04/T05 command/canary implementation before broad production migrations.
rail-ho-wp-0005 Forgejo production migration is parked behind explicit design, SMTP, backup, runner, and cutover decisions.
net-wp-0020 OpenBao unseal/token custody remains an operator design and smoke gate.
issue-wp-0003 issue-core service is healthy; activity-core REST emission wiring remains.
activity-wp-0006 Calibration waits on the post-WP-0016 live daily-triage smoke and three clean scheduled runs.
cust-wp-0038 Full State Hub HA migration is deferred until the pragmatic railiance01 path stabilizes.
cust-wp-0025 FOS bootstrap resumes from identity integration and ops-hub evidence, not the old mega-hub scaffold.
cust-wp-0011 Active State Hub migration path; next gate is explicit cutover approval.

Hygiene status:

  • CUST-WP-0045-cutover-runbook is no longer active; it is a finished runbook record, not an empty active workstream.
  • CUST-WP-0014 is reopened as backlog; it is no longer a done workplan with todo task blocks.
  • Completed or cancelled tasks no longer carry the stale human-needed flags cleared during this stabilization session.
  • make fix-consistency REPO=the-custodian still reports pre-existing C-12 orphan-row warnings, but the relevant workplan lifecycle and task states sync.

Blocker Board

No live credential, access, or approval gate is unowned. Do not ask ops-warden for secret values; use the route catalog and the owning subsystem.

Gate Owner/route Non-secret evidence to collect Next action
State Hub pragmatic cutover Custodian operator approval; CUST-WP-0011-T07 Final dump id/time, row-count comparison, chosen private endpoint, stabilization notes Approve freeze/final restore and make railiance01 State Hub primary, or leave WSL2 primary explicitly.
State Hub fallback retirement Custodian/operator approval; CUST-WP-0038-T08 HA failover drill id, restore drill id, stabilization pass Keep deferred until after HA drills; do not retire WSL2 fallback early.
Inter-Hub ops-hub bootstrap inter-hub-bootstrap-ssh, openbao-api-key, ssh-cert-host-access as needed Hub id, manifest id, widget count, runtime key prefix only, smoke result Use the aligned live-vocabulary mapping, then run attended bootstrap and protected widget lookup.
Ops-hub runtime evidence key openbao-api-key / OpenBao custody OpenBao path/version or populated key count, event smoke id Store/provide OPS_HUB_KEY outside Git and run the protected evidence smoke.
Daily-triage live proof activity-core deploy/runtime operator State Hub daily_triage id, output-valid or partial/quarantine status, working-memory path Deploy WP-0016 code/schema and bounded runtime prompt bundle, then run railiance01 smoke.
activity-core to issue-core route activity-core-issue-sink actcore-runtime-secret has key, activity-core points to issue-core port 8765, HTTP 201, Gitea issue id Inject ISSUE_CORE_API_KEY through approved custody, set REST sink env, restart/sync, run safe emission.
Forgejo production design Forgejo/operator decisions plus OpenBao/KeyCape/ops-bridge routes as needed Decision id, SMTP smoke, backup/restore drill, package/action smoke, cutover approval id Resolve T02 production choices before any production cutover work.
OpenBao unseal and credential helper openbao-api-key, railiance-infra-principals, ssh-cert-host-access, key-cape-oidc-login Policy names, role names, token accessor only, allow/deny smoke Approve custody profile and apply narrow issuer policies before live helper smokes.

Daily Automation Evidence

The scheduled daily-triage runner is alive and writing State Hub plus working memory evidence. The current blocker is output validation, not scheduling or sink reachability.

Latest clean scheduled run:

  • 2026-06-25: State Hub event cbba6bc0-14cb-492b-ab23-74b9349326c8, schema-valid daily triage, working memory written.

Latest failed scheduled runs:

  • 2026-06-26: event 97fd20a0-eee0-45ea-8290-6d91874e1515, validation failed at char 5268, working memory written.
  • 2026-06-27: event c5ab50a8-404b-4e30-849f-841b059ace65, validation failed at char 5246, working memory written.

Resume from docs/daily-triage-stabilization-status.md and ACTIVITY-WP-0016 before restarting the three-clean-run gate.

Production Service Summary

Surface Stable fact Remaining gate
State Hub Pragmatic railiance01 path has image, manifests, empty deploy, migrations, restored WSL2 data, row-count comparison, and healthy API through CUST-WP-0011-T06. CUST-WP-0011-T07 cutover approval, then stabilization; HA path stays deferred.
Inter-Hub Public https://hub.coulomb.social/api/v2/hubs exposes ops-hub; public registry vocabulary is visible; Inter-Hub contract docs now target the live seed vocabulary. Protected widget lookup, runtime key custody, and authenticated event smoke remain.
ops-hub evidence ops-hub exists as the Inter-Hub Operations extension; OPS-WP-0001 finished; OPS-WP-0002 has early seed tasks done. Attended bootstrap, runtime key custody, protected widget/event smoke.
issue-core ArgoCD service is healthy on port 8765; image 0.2.1; ExternalSecret Ready; authenticated smoke created Gitea issue 175. activity-core still needs ISSUE_CORE_API_KEY, URL port 8765, ISSUE_SINK_TYPE=rest, and a safe emission smoke.
Forgejo Migration inventory/design lane is active but pre-cutover. Production design decisions, SMTP/email recovery, package registry, Actions, backup/restore, migration drill, cutover approval.
artifact-store Workplan is active with all tasks open and no current live secret handoff recorded. Start D7.1 fork/object-store landscape and D7.2 compatibility harness.
FOS hub Old NK-WP-0001 Keycloak prerequisite is cancelled; NK-WP-0002 local identity and IAM Profile v0.2 are done; hub-core extraction/dev-hub work is done. Keep CUST-WP-0025-T03 as the identity integration test, then reconcile old ops-hub scaffold tasks after first Inter-Hub ops event lands.

Next-Pick List

  1. Run the attended Inter-Hub ops-hub bootstrap with the aligned live-vocabulary mapping, confirm protected widget ids, and seed any missing backup/risk target widgets.
  2. Store/confirm OPS_HUB_KEY through approved custody and run the protected widget/hub-registry/event smoke.
  3. Deploy the activity-core WP-0016 code/schema and bounded runtime prompt bundle, then run the railiance01 daily-triage smoke.
  4. Complete the issue-core handoff by wiring activity-core to port 8765 with ISSUE_SINK_TYPE=rest and one known-safe emission smoke.
  5. Request explicit State Hub cutover approval for CUST-WP-0011-T07, or record that WSL2 remains primary for the next operating period.
  6. Continue staged-promotion T04/T05 and start artifact-store D7.1/D7.2 so Forgejo and storage work inherit clear production promotion gates.
  7. Keep Forgejo cutover and State Hub HA work parked until their human decision and drill gates are satisfied.

Resume Commands

cd /home/worsch/the-custodian
sed -n '1,260p' workplans/CUST-WP-0051-infrastructure-stabilization-metaplan.md
sed -n '1,260p' docs/infrastructure-stabilization-pickup-checkpoint.md
sed -n '1,260p' docs/credential-custody-unblock-board.md

After workplan edits, sync from State Hub:

cd /home/worsch/state-hub
make fix-consistency REPO=the-custodian