239 lines
16 KiB
Markdown
239 lines
16 KiB
Markdown
# Infrastructure Stabilization Pickup Checkpoint
|
||
|
||
Updated: 2026-07-02
|
||
Coordinator workplan: `CUST-WP-0051`
|
||
|
||
## End-of-Day Checkpoint (2026-07-02 evening)
|
||
|
||
Nine workplans finished today and the Core Hub replacement lane was driven
|
||
from no deployed evidence to real data migrated with cutover blockers named.
|
||
|
||
**Finished this session:** ARTIFACT-STORE-WP-0007, RAIL-BS-WP-0008,
|
||
RAIL-BS-WP-0009, ACTIVITY-WP-0016, RAILIANCE-WP-0008, ISSUE-WP-0003,
|
||
CORE-WP-0004 (+ CUST-WP-0053 earlier). RAILIANCE-WP-0009/0010 are done bar
|
||
ops-warden catalog confirmation; NET-WP-0020-T02 is wired (needs a greenfield
|
||
slate for live proof).
|
||
|
||
**Live infrastructure now running (was blocked at session start):**
|
||
|
||
- Daily-triage robustness deployed on railiance01; bounded top-7 proven
|
||
(State Hub event `24d2d321`). Three scheduled runs (Jul 3–5) close
|
||
ACTIVITY-WP-0006 calibration by themselves.
|
||
- Credential lanes active: issue-core (`CCR-2026-0002`) and llm-connect
|
||
OpenRouter (`CCR-2026-0003`), applied via the constrained prod-applier,
|
||
both ExternalSecrets syncing. Lifecycle runbook at
|
||
`railiance-platform/docs/credential-lane-lifecycle-runbook.md`.
|
||
- issue-core REST emission live end-to-end (Gitea issue `176`) over a new
|
||
cross-machine ops-bridge lane (`remote_host` feature added to ops-bridge).
|
||
- Core Hub staging deployed on CoulombCore (`core-hub-staging`, image
|
||
`gitea.coulomb.social/coulomb/core-hub:3ed8531`): deployed API smoke +
|
||
activity-core sink smoke both green; full Inter-Hub data (28 records)
|
||
migrated idempotently.
|
||
|
||
**Open items requiring a decision (not agent-executable):**
|
||
|
||
1. **Core Hub cutover prerequisites (core-hub owner).** T03 dual-run found (a)
|
||
a blocking catalog gap — migrated widgets reference 7 `ops-*` widget types
|
||
absent from Core Hub's seeded registries because the migration bundle
|
||
omits the type registries; design choice: extend the bundle schema vs.
|
||
seed the vocabularies; (b) `/api/v2/hubs` auth-posture break (public →
|
||
protected). Both on `CORE-WP-0005-T04`; flagged via hub message `4b859f9b`.
|
||
2. **Core Hub production cutover** (`CORE-WP-0005-T04`): operator approval +
|
||
rollback plan, after prerequisite 1. Then `CORE-WP-0007` Haskell
|
||
retirement unblocks.
|
||
3. **State Hub pragmatic cutover** (`CUST-WP-0011-T07`): still the standing
|
||
operator freeze/restore/redirect approval (unchanged from prior board).
|
||
4. **ops-warden confirmations** close `RAILIANCE-WP-0009-T06` /
|
||
`RAILIANCE-WP-0010-T06` (no custodian action).
|
||
5. **NET-WP-0020-T02** greenfield live proof needs a rebuild slate.
|
||
|
||
Prior morning decision set (Core Hub staging cluster/secrets/image) is
|
||
resolved: CoulombCore + generated Secrets + Gitea registry, all executed.
|
||
|
||
## Original morning list (2026-07-02, all closed)
|
||
|
||
Every remaining execution lane converged on operator gates in the 2026-07-02
|
||
session (agent policy correctly blocks unattended production writes/reads on
|
||
railiance01, credential-bootstrap script edits, and OIDC/MFA logins). Each item
|
||
below is prepared to one command or one decision:
|
||
|
||
1. **Daily-triage robustness deploy** (`RAIL-BS-WP-0008`): image
|
||
`activity-core:railiance01-prod` is rebuilt locally from activity-core
|
||
`7612112` (T02 prompt contract included and gate-checked). Operator: run the
|
||
save/scp/import block from `activity-core/k8s/railiance/README.md`, sync the
|
||
repo *with `.git`* to `railiance01:~/activity-core` (the copy there has no
|
||
git metadata and the revision gate needs it), then
|
||
`cd ~/railiance-cluster && make deploy-activity-core-triage-robustness`.
|
||
Afterwards `make admin-sync-smoke` closes `RAIL-BS-WP-0009`.
|
||
2. **CCR approvals** (`RAILIANCE-WP-0009`/`0010`): `CCR-2026-0002`
|
||
(issue-core ingestion) and `CCR-2026-0003` (llm-connect OpenRouter) are
|
||
reviewed and binding-confirmed but still `proposed`. Approve, then
|
||
`make credential-change-applier-apply` per CCR; the issue-core
|
||
ExternalSecret already syncs, so verification is mostly confirm-not-create.
|
||
3. **Broker live evidence** (`RAILIANCE-WP-0005-T09`): needs one
|
||
KeyCape-OIDC-authenticated session to collect OpenBao audit-log references
|
||
and response-wrap unwrap-once evidence.
|
||
4. **Non-prod applier proof** (`RAILIANCE-WP-0008-T03`): mint one token from
|
||
`auth/token/roles/credential-change-nonprod-applier` and record apply +
|
||
denial probes.
|
||
5. **OpenBao unseal automation** (`NET-WP-0020-T02`, advanced 2026-07-02):
|
||
`make -C ~/net-kingdom openbao-init-unseal` exists with custody-model gate
|
||
and non-secret evidence; operator review still needed to wire it as a phase
|
||
inside `creds-bootstrap-agent.sh`, and greenfield live proof needs a rebuild
|
||
slate.
|
||
|
||
## Purpose
|
||
|
||
This checkpoint is the restart surface for the infrastructure stabilization
|
||
metaplan. It consolidates the workplan review, unblock boards, current State
|
||
Hub registration state, and the next strategic picks.
|
||
|
||
Use this file first when resuming the lane. Then open the source workplan named
|
||
in the relevant row and continue from its task state.
|
||
|
||
## Registration State
|
||
|
||
State Hub active workstreams queried on 2026-06-27:
|
||
|
||
| Workstream | Current pickup meaning |
|
||
| --- | --- |
|
||
| `artifact-store-wp-0007` | Start D7.1/D7.2 assessment and compatibility harness; D7.3 STS vending may route to NetKingdom. |
|
||
| `ihub-wp-0022` | Ops-hub evidence intake contract is aligned to live vocabulary; runtime key custody, protected widget lookup, and smoke remain. |
|
||
| `cust-wp-0047` | Now-view waits on the ops-hub Inter-Hub evidence lane, not on service inventory collection. |
|
||
| `cust-wp-0049` | Bootstrap access helper/runbook is ready; authenticated execution is operator-gated. |
|
||
| `cust-wp-0051` | This metaplan is the coordination layer for remaining cross-workplan gates. |
|
||
| `activity-wp-0016-llm-output-robustness-trust-boundary` | Repo-side output robustness bundle is prepared; live deploy/smoke proof remains. |
|
||
| `three-phoenix-ha-cluster` | HA substrate remains future critical-workload work, not the current State Hub cutover blocker. |
|
||
| `rail-ho-wp-0005` | Forgejo production migration is parked behind explicit design, SMTP, backup, runner, and cutover decisions. |
|
||
| `net-wp-0020` | OpenBao unseal/token custody remains an operator design and smoke gate. |
|
||
| `issue-wp-0003` | issue-core service is healthy; activity-core REST emission wiring remains. |
|
||
| `activity-wp-0006` | Calibration waits on the post-WP-0016 live daily-triage smoke and three clean scheduled runs. |
|
||
| `cust-wp-0038` | Full State Hub HA migration is deferred until the pragmatic railiance01 path stabilizes. |
|
||
| `cust-wp-0025` | FOS bootstrap resumes from identity integration and ops-hub evidence, not the old mega-hub scaffold. |
|
||
| `cust-wp-0011` | Active State Hub migration path; next gate is explicit cutover approval. |
|
||
|
||
Hygiene status:
|
||
|
||
- `CUST-WP-0045-cutover-runbook` is no longer active; it is a finished runbook
|
||
record, not an empty active workstream.
|
||
- `CUST-WP-0014` is reopened as `backlog`; it is no longer a done workplan with
|
||
todo task blocks.
|
||
- Completed or cancelled tasks no longer carry the stale human-needed flags
|
||
cleared during this stabilization session.
|
||
- `make fix-consistency REPO=the-custodian` still reports pre-existing C-12
|
||
orphan-row warnings, but the relevant workplan lifecycle and task states sync.
|
||
- `RAIL-BS-WP-0006-staged-promotion-lifecycle` is finished: all seven tasks
|
||
are done, the workstream is finished in State Hub, and the file frontmatter
|
||
is `status: finished`.
|
||
|
||
## Blocker Board
|
||
|
||
No live credential, access, or approval gate is unowned. Do not ask
|
||
`ops-warden` for secret values; use the route catalog, the `warden access`
|
||
assist/proxy surface where the catalog lane allows it, and the owning subsystem.
|
||
|
||
For credential-related blockers, classify the environment posture and workload
|
||
maturity first. Dev/test work can use synthetic contract doubles; production
|
||
real-value work needs owner custody, policy gates where applicable, and
|
||
non-secret evidence. See `docs/ops-warden-secret-posture-review.md`.
|
||
|
||
Do not implement ops-warden changes from this Custodian lane. New ops-warden
|
||
needs should be posted through State Hub as requirements or suggestions for the
|
||
separate ops-warden worker.
|
||
|
||
| Gate | Owner/route | Non-secret evidence to collect | Next action |
|
||
| --- | --- | --- | --- |
|
||
| State Hub pragmatic cutover | Custodian operator approval; `CUST-WP-0011-T07` | Final dump id/time, row-count comparison, chosen private endpoint, stabilization notes | Approve freeze/final restore and make railiance01 State Hub primary, or leave WSL2 primary explicitly. |
|
||
| State Hub fallback retirement | Custodian/operator approval; `CUST-WP-0038-T08` | HA failover drill id, restore drill id, stabilization pass | Keep deferred until after HA drills; do not retire WSL2 fallback early. |
|
||
| Inter-Hub ops-hub bootstrap | `inter-hub-bootstrap-ssh`, `openbao-api-key`, `ssh-cert-host-access` as needed | Hub id, manifest id, widget count, runtime key prefix only, smoke result | Legacy/fallback only. Prefer Core Hub deployed smoke; run attended Inter-Hub bootstrap only by explicit operator supersede/rollback decision. |
|
||
| Ops-hub runtime evidence key | `openbao-api-key` / OpenBao custody | OpenBao path/version or populated key count, event smoke id | Do not materialize legacy `OPS_HUB_KEY` until a deployed Core Hub smoke or explicit legacy Inter-Hub smoke is ready to use it. |
|
||
| Daily-triage live proof | activity-core deploy/runtime operator | State Hub `daily_triage` id, output-valid or partial/quarantine status, working-memory path | Bank the 2026-06-28 / 2026-06-29 / 2026-06-30 clean streak, then have the activity-core owner land/sync the in-flight WP-0016 diagnostics and prove bounded top-N plus graceful-degradation smoke. |
|
||
| activity-core to issue-core | route `activity-core-issue-sink` | `actcore-runtime-secret` has key, activity-core points to issue-core port `8765`, HTTP 201, Gitea issue id | Inject `ISSUE_CORE_API_KEY` through approved custody, set REST sink env, restart/sync, run safe emission. |
|
||
| Forgejo production design | Forgejo/operator decisions plus OpenBao/KeyCape/ops-bridge routes as needed | Decision id, SMTP smoke, backup/restore drill, package/action smoke, cutover approval id | Resolve T02 production choices before any production cutover work. |
|
||
| OpenBao unseal and credential helper | `openbao-api-key`, `railiance-infra-principals`, `ssh-cert-host-access`, `key-cape-oidc-login` | Policy names, role names, token accessor only, allow/deny smoke | `warden-sign` lane is verified/banked; broader custody profile and issuer automation remain separate operator-design gates. |
|
||
| ops-warden policy gate / warden-sign lane | `SECRETS-WP-0004` + `FLEX-WP-0007` finished; ops-warden operator posture | `decision:032b096c433ad80c`, `ttl_out_of_bounds`, backend `vault`; no token/role/secret/accessor values | No Custodian action. Keep `policy.enabled` off until testing/production maturity. |
|
||
|
||
## Daily Automation Evidence
|
||
|
||
The scheduled daily-triage runner is alive and writing State Hub plus working
|
||
memory evidence. The current blocker is bounded output-contract adoption and
|
||
live graceful-degradation proof, not scheduling or sink reachability.
|
||
|
||
Latest clean scheduled streak:
|
||
|
||
- 2026-06-28: event `f0d8477e-1db9-4c07-bb8c-d28cbb868abc`, schema-valid daily
|
||
triage, working memory written.
|
||
- 2026-06-29: event `176d2ea7-f0e3-48cd-999b-4ab6055c6a55`, schema-valid daily
|
||
triage, working memory written.
|
||
- 2026-06-30: event `27d695b2-a537-481b-ada6-ca84ec24cd96`, schema-valid daily
|
||
triage, working memory written.
|
||
|
||
Latest failed scheduled runs before the clean streak:
|
||
|
||
- 2026-06-26: event `97fd20a0-eee0-45ea-8290-6d91874e1515`, validation failed
|
||
at char 5268, working memory written.
|
||
- 2026-06-27: event `c5ab50a8-404b-4e30-849f-841b059ace65`, validation failed
|
||
at char 5246, working memory written.
|
||
|
||
Bank the three-run calibration streak, but keep the WP-0016 live-proof gate open
|
||
until the bounded top-N contract and graceful-degradation smoke are proven. The
|
||
activity-core worktree currently has in-flight uncommitted ACTIVITY-WP-0016
|
||
and ACTIVITY-WP-0018/0019 changes, so Custodian should wait for that owner to
|
||
commit/sync or explicitly hand off before treating those files as source truth.
|
||
Use activity-core repo-native automation status surface once it lands; do not
|
||
use assistant-provided scheduling as operational evidence.
|
||
|
||
## Production Service Summary
|
||
|
||
| Surface | Stable fact | Remaining gate |
|
||
| --- | --- | --- |
|
||
| State Hub | Pragmatic railiance01 path has image, manifests, empty deploy, migrations, restored WSL2 data, row-count comparison, and healthy API through `CUST-WP-0011-T06`. | `CUST-WP-0011-T07` cutover approval, then stabilization; HA path stays deferred. |
|
||
| Inter-Hub / Core Hub | Public `https://hub.coulomb.social/api/v2/hubs` exposes `ops-hub`; `CORE-WP-0008` finished the Core Hub API smoke harness, activity-core sink, staging profile, CLI wrappers, UI backlog, and Custodian handoff. | Run deployed Core Hub smoke, staging import, activity-core sink smoke, and readiness summary; keep Haskell Inter-Hub only for migration/rollback proof. |
|
||
| ops-hub evidence | `CUST-WP-0025-T14` is done with the Core Hub ops evidence contract spec. `CUST-WP-0025-T13` through `T19` now use Core Hub API/CLI/UI gates; `CUST-WP-0047` and `CUST-WP-0049` remain legacy/fallback records. | Execute `CUST-WP-0025-T16`, `T17`, and `T18`; close legacy Inter-Hub waits only through deployed Core Hub evidence or explicit supersede decision. |
|
||
| issue-core | ArgoCD service is healthy on port `8765`; image `0.2.1`; ExternalSecret Ready; authenticated smoke created Gitea issue `175`. | activity-core still needs `ISSUE_CORE_API_KEY`, URL port `8765`, `ISSUE_SINK_TYPE=rest`, and a safe emission smoke. |
|
||
| Forgejo | Migration inventory/design lane is active but pre-cutover. | Production design decisions, SMTP/email recovery, package registry, Actions, backup/restore, migration drill, cutover approval. |
|
||
| artifact-store | D7.1 is done; D7.2 has an opt-in live MinIO compatibility harness and manual smoke docs. No live secret handoff is recorded. | Run D7.2 against an approved MinIO-compatible endpoint, then route D7.3 STS vending through identity/platform custody before changing credential behavior. |
|
||
| secrets-engine | `SECRETS-WP-0004` is finished: the scoped `warden-sign` lane supported the vault-backed policy-gate smoke without exposing token material. `SECRETS-WP-0003` remains active for the real whynot-design npm publish pilot. | Finish or park `SECRETS-WP-0003` behind Gitea bot/package-token provisioning, OpenBao custody, ops-warden route confirmation, and real package publish evidence. |
|
||
| FOS hub | Old NK-WP-0001 Keycloak prerequisite is cancelled; NK-WP-0002 local identity, IAM Profile v0.2, the Core Hub FastAPI IAM Profile integration test, and Core Hub operator UI first screens are done; hub-core extraction/dev-hub work is done; CUST-WP-0025 Phase 3 has been rewritten for Core Hub. | Execute the remaining Core Hub deployed evidence and cutover gates: `CUST-WP-0025-T16` and `T17`. |
|
||
|
||
## Next-Pick List
|
||
|
||
1. Execute the remaining rewritten `CUST-WP-0025` Core Hub gates: deployed
|
||
smoke and activity-core proof (`T16`) and cutover decision coupling (`T17`).
|
||
T03, T14, and T18 are complete as the identity integration template, ops
|
||
evidence/read-model contract, and operator UI first-screen gates.
|
||
2. Keep `CUST-WP-0047` and `CUST-WP-0049` as legacy evidence/fallback until
|
||
Core Hub deployed smoke evidence or an explicit supersede decision closes
|
||
them.
|
||
3. Bank the 2026-06-28 / 2026-06-29 / 2026-06-30 clean daily-triage
|
||
streak for calibration, then have the activity-core owner land/sync the
|
||
in-flight WP-0016 diagnostics/status work and prove the bounded top-N plus
|
||
graceful-degradation smoke.
|
||
4. Complete the issue-core handoff by wiring activity-core to port `8765` with
|
||
`ISSUE_SINK_TYPE=rest` and one known-safe emission smoke.
|
||
5. Request explicit State Hub cutover approval for `CUST-WP-0011-T07`, or
|
||
record that WSL2 remains primary for the next operating period.
|
||
6. Run artifact-store D7.2 live MinIO-compatible evidence; Forgejo and storage
|
||
work can now inherit the finished staged-promotion gates.
|
||
7. Keep `SECRETS-WP-0003` parked until Gitea bot/package-token provisioning,
|
||
OpenBao custody, route confirmation, and a coordinated whynot-design version
|
||
bump are available.
|
||
8. Keep Forgejo cutover and State Hub HA work parked until their human decision
|
||
and drill gates are satisfied.
|
||
|
||
## Resume Commands
|
||
|
||
```bash
|
||
cd /home/worsch/the-custodian
|
||
sed -n '1,260p' workplans/CUST-WP-0051-infrastructure-stabilization-metaplan.md
|
||
sed -n '1,260p' docs/infrastructure-stabilization-pickup-checkpoint.md
|
||
sed -n '1,260p' docs/credential-custody-unblock-board.md
|
||
```
|
||
|
||
After workplan edits, sync from State Hub:
|
||
|
||
```bash
|
||
cd /home/worsch/state-hub
|
||
make fix-consistency REPO=the-custodian
|
||
```
|