160 lines
12 KiB
Markdown
160 lines
12 KiB
Markdown
# Infrastructure Stabilization Pickup Checkpoint
|
|
|
|
Updated: 2026-06-30
|
|
Coordinator workplan: `CUST-WP-0051`
|
|
|
|
## Purpose
|
|
|
|
This checkpoint is the restart surface for the infrastructure stabilization
|
|
metaplan. It consolidates the workplan review, unblock boards, current State
|
|
Hub registration state, and the next strategic picks.
|
|
|
|
Use this file first when resuming the lane. Then open the source workplan named
|
|
in the relevant row and continue from its task state.
|
|
|
|
## Registration State
|
|
|
|
State Hub active workstreams queried on 2026-06-27:
|
|
|
|
| Workstream | Current pickup meaning |
|
|
| --- | --- |
|
|
| `artifact-store-wp-0007` | Start D7.1/D7.2 assessment and compatibility harness; D7.3 STS vending may route to NetKingdom. |
|
|
| `ihub-wp-0022` | Ops-hub evidence intake contract is aligned to live vocabulary; runtime key custody, protected widget lookup, and smoke remain. |
|
|
| `cust-wp-0047` | Now-view waits on the ops-hub Inter-Hub evidence lane, not on service inventory collection. |
|
|
| `cust-wp-0049` | Bootstrap access helper/runbook is ready; authenticated execution is operator-gated. |
|
|
| `cust-wp-0051` | This metaplan is the coordination layer for remaining cross-workplan gates. |
|
|
| `activity-wp-0016-llm-output-robustness-trust-boundary` | Repo-side output robustness bundle is prepared; live deploy/smoke proof remains. |
|
|
| `three-phoenix-ha-cluster` | HA substrate remains future critical-workload work, not the current State Hub cutover blocker. |
|
|
| `rail-ho-wp-0005` | Forgejo production migration is parked behind explicit design, SMTP, backup, runner, and cutover decisions. |
|
|
| `net-wp-0020` | OpenBao unseal/token custody remains an operator design and smoke gate. |
|
|
| `issue-wp-0003` | issue-core service is healthy; activity-core REST emission wiring remains. |
|
|
| `activity-wp-0006` | Calibration waits on the post-WP-0016 live daily-triage smoke and three clean scheduled runs. |
|
|
| `cust-wp-0038` | Full State Hub HA migration is deferred until the pragmatic railiance01 path stabilizes. |
|
|
| `cust-wp-0025` | FOS bootstrap resumes from identity integration and ops-hub evidence, not the old mega-hub scaffold. |
|
|
| `cust-wp-0011` | Active State Hub migration path; next gate is explicit cutover approval. |
|
|
|
|
Hygiene status:
|
|
|
|
- `CUST-WP-0045-cutover-runbook` is no longer active; it is a finished runbook
|
|
record, not an empty active workstream.
|
|
- `CUST-WP-0014` is reopened as `backlog`; it is no longer a done workplan with
|
|
todo task blocks.
|
|
- Completed or cancelled tasks no longer carry the stale human-needed flags
|
|
cleared during this stabilization session.
|
|
- `make fix-consistency REPO=the-custodian` still reports pre-existing C-12
|
|
orphan-row warnings, but the relevant workplan lifecycle and task states sync.
|
|
- `RAIL-BS-WP-0006-staged-promotion-lifecycle` is finished: all seven tasks
|
|
are done, the workstream is finished in State Hub, and the file frontmatter
|
|
is `status: finished`.
|
|
|
|
## Blocker Board
|
|
|
|
No live credential, access, or approval gate is unowned. Do not ask
|
|
`ops-warden` for secret values; use the route catalog, the `warden access`
|
|
assist/proxy surface where the catalog lane allows it, and the owning subsystem.
|
|
|
|
For credential-related blockers, classify the environment posture and workload
|
|
maturity first. Dev/test work can use synthetic contract doubles; production
|
|
real-value work needs owner custody, policy gates where applicable, and
|
|
non-secret evidence. See `docs/ops-warden-secret-posture-review.md`.
|
|
|
|
Do not implement ops-warden changes from this Custodian lane. New ops-warden
|
|
needs should be posted through State Hub as requirements or suggestions for the
|
|
separate ops-warden worker.
|
|
|
|
| Gate | Owner/route | Non-secret evidence to collect | Next action |
|
|
| --- | --- | --- | --- |
|
|
| State Hub pragmatic cutover | Custodian operator approval; `CUST-WP-0011-T07` | Final dump id/time, row-count comparison, chosen private endpoint, stabilization notes | Approve freeze/final restore and make railiance01 State Hub primary, or leave WSL2 primary explicitly. |
|
|
| State Hub fallback retirement | Custodian/operator approval; `CUST-WP-0038-T08` | HA failover drill id, restore drill id, stabilization pass | Keep deferred until after HA drills; do not retire WSL2 fallback early. |
|
|
| Inter-Hub ops-hub bootstrap | `inter-hub-bootstrap-ssh`, `openbao-api-key`, `ssh-cert-host-access` as needed | Hub id, manifest id, widget count, runtime key prefix only, smoke result | Legacy/fallback only. Prefer Core Hub deployed smoke; run attended Inter-Hub bootstrap only by explicit operator supersede/rollback decision. |
|
|
| Ops-hub runtime evidence key | `openbao-api-key` / OpenBao custody | OpenBao path/version or populated key count, event smoke id | Do not materialize legacy `OPS_HUB_KEY` until a deployed Core Hub smoke or explicit legacy Inter-Hub smoke is ready to use it. |
|
|
| Daily-triage live proof | activity-core deploy/runtime operator | State Hub `daily_triage` id, output-valid or partial/quarantine status, working-memory path | Bank the 2026-06-28 / 2026-06-29 / 2026-06-30 clean streak, then have the activity-core owner land/sync the in-flight WP-0016 diagnostics and prove bounded top-N plus graceful-degradation smoke. |
|
|
| activity-core to issue-core | route `activity-core-issue-sink` | `actcore-runtime-secret` has key, activity-core points to issue-core port `8765`, HTTP 201, Gitea issue id | Inject `ISSUE_CORE_API_KEY` through approved custody, set REST sink env, restart/sync, run safe emission. |
|
|
| Forgejo production design | Forgejo/operator decisions plus OpenBao/KeyCape/ops-bridge routes as needed | Decision id, SMTP smoke, backup/restore drill, package/action smoke, cutover approval id | Resolve T02 production choices before any production cutover work. |
|
|
| OpenBao unseal and credential helper | `openbao-api-key`, `railiance-infra-principals`, `ssh-cert-host-access`, `key-cape-oidc-login` | Policy names, role names, token accessor only, allow/deny smoke | `warden-sign` lane is verified/banked; broader custody profile and issuer automation remain separate operator-design gates. |
|
|
| ops-warden policy gate / warden-sign lane | `SECRETS-WP-0004` + `FLEX-WP-0007` finished; ops-warden operator posture | `decision:032b096c433ad80c`, `ttl_out_of_bounds`, backend `vault`; no token/role/secret/accessor values | No Custodian action. Keep `policy.enabled` off until testing/production maturity. |
|
|
|
|
## Daily Automation Evidence
|
|
|
|
The scheduled daily-triage runner is alive and writing State Hub plus working
|
|
memory evidence. The current blocker is bounded output-contract adoption and
|
|
live graceful-degradation proof, not scheduling or sink reachability.
|
|
|
|
Latest clean scheduled streak:
|
|
|
|
- 2026-06-28: event `f0d8477e-1db9-4c07-bb8c-d28cbb868abc`, schema-valid daily
|
|
triage, working memory written.
|
|
- 2026-06-29: event `176d2ea7-f0e3-48cd-999b-4ab6055c6a55`, schema-valid daily
|
|
triage, working memory written.
|
|
- 2026-06-30: event `27d695b2-a537-481b-ada6-ca84ec24cd96`, schema-valid daily
|
|
triage, working memory written.
|
|
|
|
Latest failed scheduled runs before the clean streak:
|
|
|
|
- 2026-06-26: event `97fd20a0-eee0-45ea-8290-6d91874e1515`, validation failed
|
|
at char 5268, working memory written.
|
|
- 2026-06-27: event `c5ab50a8-404b-4e30-849f-841b059ace65`, validation failed
|
|
at char 5246, working memory written.
|
|
|
|
Bank the three-run calibration streak, but keep the WP-0016 live-proof gate open
|
|
until the bounded top-N contract and graceful-degradation smoke are proven. The
|
|
activity-core worktree currently has in-flight uncommitted ACTIVITY-WP-0016
|
|
and ACTIVITY-WP-0018/0019 changes, so Custodian should wait for that owner to
|
|
commit/sync or explicitly hand off before treating those files as source truth.
|
|
Use activity-core repo-native automation status surface once it lands; do not
|
|
use assistant-provided scheduling as operational evidence.
|
|
|
|
## Production Service Summary
|
|
|
|
| Surface | Stable fact | Remaining gate |
|
|
| --- | --- | --- |
|
|
| State Hub | Pragmatic railiance01 path has image, manifests, empty deploy, migrations, restored WSL2 data, row-count comparison, and healthy API through `CUST-WP-0011-T06`. | `CUST-WP-0011-T07` cutover approval, then stabilization; HA path stays deferred. |
|
|
| Inter-Hub / Core Hub | Public `https://hub.coulomb.social/api/v2/hubs` exposes `ops-hub`; `CORE-WP-0008` finished the Core Hub API smoke harness, activity-core sink, staging profile, CLI wrappers, UI backlog, and Custodian handoff. | Run deployed Core Hub smoke, staging import, activity-core sink smoke, and readiness summary; keep Haskell Inter-Hub only for migration/rollback proof. |
|
|
| ops-hub evidence | `CUST-WP-0025-T14` is done with the Core Hub ops evidence contract spec. `CUST-WP-0025-T13` through `T19` now use Core Hub API/CLI/UI gates; `CUST-WP-0047` and `CUST-WP-0049` remain legacy/fallback records. | Execute `CUST-WP-0025-T16`, `T17`, and `T18`; close legacy Inter-Hub waits only through deployed Core Hub evidence or explicit supersede decision. |
|
|
| issue-core | ArgoCD service is healthy on port `8765`; image `0.2.1`; ExternalSecret Ready; authenticated smoke created Gitea issue `175`. | activity-core still needs `ISSUE_CORE_API_KEY`, URL port `8765`, `ISSUE_SINK_TYPE=rest`, and a safe emission smoke. |
|
|
| Forgejo | Migration inventory/design lane is active but pre-cutover. | Production design decisions, SMTP/email recovery, package registry, Actions, backup/restore, migration drill, cutover approval. |
|
|
| artifact-store | D7.1 is done; D7.2 has an opt-in live MinIO compatibility harness and manual smoke docs. No live secret handoff is recorded. | Run D7.2 against an approved MinIO-compatible endpoint, then route D7.3 STS vending through identity/platform custody before changing credential behavior. |
|
|
| secrets-engine | `SECRETS-WP-0004` is finished: the scoped `warden-sign` lane supported the vault-backed policy-gate smoke without exposing token material. `SECRETS-WP-0003` remains active for the real whynot-design npm publish pilot. | Finish or park `SECRETS-WP-0003` behind Gitea bot/package-token provisioning, OpenBao custody, ops-warden route confirmation, and real package publish evidence. |
|
|
| FOS hub | Old NK-WP-0001 Keycloak prerequisite is cancelled; NK-WP-0002 local identity, IAM Profile v0.2, the Core Hub FastAPI IAM Profile integration test, and Core Hub operator UI first screens are done; hub-core extraction/dev-hub work is done; CUST-WP-0025 Phase 3 has been rewritten for Core Hub. | Execute the remaining Core Hub deployed evidence and cutover gates: `CUST-WP-0025-T16` and `T17`. |
|
|
|
|
## Next-Pick List
|
|
|
|
1. Execute the remaining rewritten `CUST-WP-0025` Core Hub gates: deployed
|
|
smoke and activity-core proof (`T16`) and cutover decision coupling (`T17`).
|
|
T03, T14, and T18 are complete as the identity integration template, ops
|
|
evidence/read-model contract, and operator UI first-screen gates.
|
|
2. Keep `CUST-WP-0047` and `CUST-WP-0049` as legacy evidence/fallback until
|
|
Core Hub deployed smoke evidence or an explicit supersede decision closes
|
|
them.
|
|
3. Bank the 2026-06-28 / 2026-06-29 / 2026-06-30 clean daily-triage
|
|
streak for calibration, then have the activity-core owner land/sync the
|
|
in-flight WP-0016 diagnostics/status work and prove the bounded top-N plus
|
|
graceful-degradation smoke.
|
|
4. Complete the issue-core handoff by wiring activity-core to port `8765` with
|
|
`ISSUE_SINK_TYPE=rest` and one known-safe emission smoke.
|
|
5. Request explicit State Hub cutover approval for `CUST-WP-0011-T07`, or
|
|
record that WSL2 remains primary for the next operating period.
|
|
6. Run artifact-store D7.2 live MinIO-compatible evidence; Forgejo and storage
|
|
work can now inherit the finished staged-promotion gates.
|
|
7. Keep `SECRETS-WP-0003` parked until Gitea bot/package-token provisioning,
|
|
OpenBao custody, route confirmation, and a coordinated whynot-design version
|
|
bump are available.
|
|
8. Keep Forgejo cutover and State Hub HA work parked until their human decision
|
|
and drill gates are satisfied.
|
|
|
|
## Resume Commands
|
|
|
|
```bash
|
|
cd /home/worsch/the-custodian
|
|
sed -n '1,260p' workplans/CUST-WP-0051-infrastructure-stabilization-metaplan.md
|
|
sed -n '1,260p' docs/infrastructure-stabilization-pickup-checkpoint.md
|
|
sed -n '1,260p' docs/credential-custody-unblock-board.md
|
|
```
|
|
|
|
After workplan edits, sync from State Hub:
|
|
|
|
```bash
|
|
cd /home/worsch/state-hub
|
|
make fix-consistency REPO=the-custodian
|
|
```
|