Make Temporal activity timeout env-configurable (ADHOC-2026-06-01-T03)

The CUST-WP-0045 daily triage canary on 2026-06-01 hit a BrokenPipeError
on the llm-connect side. Two 5-minute timeouts were racing:

- _ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py
- LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py

The 10KB curated digest + max_depth:2 + JSON schema enforcement pushed
Claude past 5 minutes. Whichever timer fired first killed the httpx call;
the model's late response arrived to a closed socket.

Read _ACTIVITY_TIMEOUT from ACTIVITY_TIMEOUT_SECONDS env (default 900 —
15 minutes) so judgement-call activities have headroom for slow LLM runs.
Operators should also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 so
httpx still times out slightly before Temporal, preserving the
clean-error contract.

Tests: 120 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-02 08:10:24 +02:00
parent a8d3cc2782
commit c79d0980a9
2 changed files with 34 additions and 1 deletions

View File

@@ -11,6 +11,7 @@ Workflow IDs follow the conventions in docs/conventions.md:
from __future__ import annotations from __future__ import annotations
import os
import uuid import uuid
from datetime import timedelta from datetime import timedelta
@@ -42,7 +43,9 @@ _RETRY_POLICY = RetryPolicy(
maximum_attempts=10, maximum_attempts=10,
) )
_ACTIVITY_TIMEOUT = timedelta(minutes=5) _ACTIVITY_TIMEOUT = timedelta(
seconds=int(os.environ.get("ACTIVITY_TIMEOUT_SECONDS", "900"))
)
_TASK_QUEUE = "task-execution-tq" _TASK_QUEUE = "task-execution-tq"

View File

@@ -167,3 +167,33 @@ Done when either: (a) rule action fields interpolate `context.*`
expressions and a stale-repo workflow run emits a TaskSpec with the actual expressions and a stale-repo workflow run emits a TaskSpec with the actual
repo slug, or (b) a recorded decision explicitly defers/declines the change repo slug, or (b) a recorded decision explicitly defers/declines the change
with reasoning. with reasoning.
### T03 - Make activity-core's Temporal activity timeout env-configurable
```task
id: ADHOC-2026-06-01-T03
status: done
priority: low
state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1"
```
Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily
triage instruction call hit `BrokenPipeError` on the llm-connect side
because two 5-minute timeouts were racing:
- `_ACTIVITY_TIMEOUT = timedelta(minutes=5)` in `workflows.py`
- `LLM_CONNECT_TIMEOUT_SECONDS` default `300` in `llm_client.py`
The 10KB curated digest + `max_depth: 2` + JSON schema enforcement pushed
Claude past 5 minutes. Whichever timer fired first killed the httpx call,
and the model's late response arrived to a closed socket.
Fix: read `_ACTIVITY_TIMEOUT` from env `ACTIVITY_TIMEOUT_SECONDS` (default
`900` — 15 minutes), so the Temporal activity outlives a normal slow LLM
run. Operators are expected to also widen httpx via
`LLM_CONNECT_TIMEOUT_SECONDS=840` (or similar) so httpx still times out
slightly *before* Temporal, preserving the clean-error contract.
The activity timeout default is now larger by design — Temporal will still
heartbeat and Temporal-side cancellation still works; this only widens the
upper bound for long judgment-call activities like the daily triage.