generated from coulomb/repo-seed
docs(ACTIVITY-WP-0014): T01 root cause — State Hub Connection refused, not misfire
Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow 365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed at the report sink with '[Errno 111] Connection refused' posting to State Hub. railiance01 reaches State Hub via a reverse tunnel back to the workstation, which is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01 done; add T05 for resilient sinks/resolvers as the real incident fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -64,18 +64,45 @@ Proposed mapping to a new `misfire_policy` value set (names open to review):
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0014-T01
|
id: ACTIVITY-WP-0014-T01
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
||||||
```
|
```
|
||||||
|
|
||||||
Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`,
|
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
|
||||||
`bridge up state-hub-mcp-coulombcore`; tunnel to railiance01 was down from the
|
defined for railiance01; the documented access path is SSH to the host).
|
||||||
workstation during diagnosis). Inspect the live Temporal schedule
|
|
||||||
`activity-schedule-<daily-triage-id>`: paused state, configured catchup window,
|
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
|
||||||
recent action/fire history, and worker pod status. Confirm whether the
|
|
||||||
06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if
|
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
|
||||||
calibration evidence is still wanted. Record findings in the workplan.
|
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
|
||||||
|
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
|
||||||
|
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
|
||||||
|
So fires were **not** silently dropped — my original "no catchup window → silent
|
||||||
|
drop" hypothesis does not hold; the server default is already 365d.
|
||||||
|
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
|
||||||
|
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
|
||||||
|
refused'`. The run produced a report but could not deliver it to State Hub, so
|
||||||
|
no `daily_triage` progress event (not even a "could not run" one) was posted →
|
||||||
|
the silence. The 06-22 fire has no execution in retention (bridge likely down
|
||||||
|
then too / schedule update window at `LastUpdateAt 1d ago`).
|
||||||
|
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
|
||||||
|
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
|
||||||
|
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
|
||||||
|
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
|
||||||
|
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
|
||||||
|
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
|
||||||
|
33 `consistency_sweep`).
|
||||||
|
|
||||||
|
**Implication:** the trigger *is* independent of the laptop, but the triage's
|
||||||
|
**data dependencies (State Hub context resolution + report delivery) still route
|
||||||
|
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
|
||||||
|
misfire policies are still good robustness, but the real fix is (a) State Hub
|
||||||
|
reachable from railiance01 independent of the workstation, and/or (b) sinks/
|
||||||
|
resolvers resilient to transient State Hub unavailability (retry/backoff,
|
||||||
|
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
|
||||||
|
below. Backfill deferred: a replay only succeeds while the workstation State Hub
|
||||||
|
is reachable.
|
||||||
|
|
||||||
## Implement explicit misfire recovery modes
|
## Implement explicit misfire recovery modes
|
||||||
|
|
||||||
@@ -122,3 +149,27 @@ Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage`
|
|||||||
multi-day outage should not flood the triage feed). Update the Railiance runtime
|
multi-day outage should not flood the triage feed). Update the Railiance runtime
|
||||||
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
|
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
|
||||||
guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
|
guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
|
||||||
|
|
||||||
|
## Resilient State Hub sinks/resolvers (real incident fix)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T05
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
|
||||||
|
**`Connection refused` at the report sink** (and chronic resolver timeouts) because
|
||||||
|
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
|
||||||
|
is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
|
||||||
|
the same way. Make activity-core resilient to transient State Hub unavailability:
|
||||||
|
|
||||||
|
- Report sinks should retry with backoff and **not hard-fail the workflow** when
|
||||||
|
the only failure is transient State Hub delivery; preserve the generated report
|
||||||
|
(working-memory note + a deferred/outbox state-hub-progress) for later flush.
|
||||||
|
- Required State Hub context resolvers should retry/backoff and surface a clear,
|
||||||
|
single diagnostic rather than a bare `timed out`.
|
||||||
|
- Separately (out of this repo): give railiance01 a State Hub endpoint that does
|
||||||
|
not depend on the workstation being awake, or run the triage at a time the
|
||||||
|
workstation is reliably up. Owner decision needed.
|
||||||
|
|||||||
Reference in New Issue
Block a user