generated from coulomb/repo-seed
Make Temporal activity timeout env-configurable (ADHOC-2026-06-01-T03)
The CUST-WP-0045 daily triage canary on 2026-06-01 hit a BrokenPipeError on the llm-connect side. Two 5-minute timeouts were racing: - _ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py - LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py The 10KB curated digest + max_depth:2 + JSON schema enforcement pushed Claude past 5 minutes. Whichever timer fired first killed the httpx call; the model's late response arrived to a closed socket. Read _ACTIVITY_TIMEOUT from ACTIVITY_TIMEOUT_SECONDS env (default 900 — 15 minutes) so judgement-call activities have headroom for slow LLM runs. Operators should also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 so httpx still times out slightly before Temporal, preserving the clean-error contract. Tests: 120 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -11,6 +11,7 @@ Workflow IDs follow the conventions in docs/conventions.md:
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
import uuid
|
import uuid
|
||||||
from datetime import timedelta
|
from datetime import timedelta
|
||||||
|
|
||||||
@@ -42,7 +43,9 @@ _RETRY_POLICY = RetryPolicy(
|
|||||||
maximum_attempts=10,
|
maximum_attempts=10,
|
||||||
)
|
)
|
||||||
|
|
||||||
_ACTIVITY_TIMEOUT = timedelta(minutes=5)
|
_ACTIVITY_TIMEOUT = timedelta(
|
||||||
|
seconds=int(os.environ.get("ACTIVITY_TIMEOUT_SECONDS", "900"))
|
||||||
|
)
|
||||||
_TASK_QUEUE = "task-execution-tq"
|
_TASK_QUEUE = "task-execution-tq"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -167,3 +167,33 @@ Done when either: (a) rule action fields interpolate `context.*`
|
|||||||
expressions and a stale-repo workflow run emits a TaskSpec with the actual
|
expressions and a stale-repo workflow run emits a TaskSpec with the actual
|
||||||
repo slug, or (b) a recorded decision explicitly defers/declines the change
|
repo slug, or (b) a recorded decision explicitly defers/declines the change
|
||||||
with reasoning.
|
with reasoning.
|
||||||
|
|
||||||
|
### T03 - Make activity-core's Temporal activity timeout env-configurable
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ADHOC-2026-06-01-T03
|
||||||
|
status: done
|
||||||
|
priority: low
|
||||||
|
state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1"
|
||||||
|
```
|
||||||
|
|
||||||
|
Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily
|
||||||
|
triage instruction call hit `BrokenPipeError` on the llm-connect side
|
||||||
|
because two 5-minute timeouts were racing:
|
||||||
|
|
||||||
|
- `_ACTIVITY_TIMEOUT = timedelta(minutes=5)` in `workflows.py`
|
||||||
|
- `LLM_CONNECT_TIMEOUT_SECONDS` default `300` in `llm_client.py`
|
||||||
|
|
||||||
|
The 10KB curated digest + `max_depth: 2` + JSON schema enforcement pushed
|
||||||
|
Claude past 5 minutes. Whichever timer fired first killed the httpx call,
|
||||||
|
and the model's late response arrived to a closed socket.
|
||||||
|
|
||||||
|
Fix: read `_ACTIVITY_TIMEOUT` from env `ACTIVITY_TIMEOUT_SECONDS` (default
|
||||||
|
`900` — 15 minutes), so the Temporal activity outlives a normal slow LLM
|
||||||
|
run. Operators are expected to also widen httpx via
|
||||||
|
`LLM_CONNECT_TIMEOUT_SECONDS=840` (or similar) so httpx still times out
|
||||||
|
slightly *before* Temporal, preserving the clean-error contract.
|
||||||
|
|
||||||
|
The activity timeout default is now larger by design — Temporal will still
|
||||||
|
heartbeat and Temporal-side cancellation still works; this only widens the
|
||||||
|
upper bound for long judgment-call activities like the daily triage.
|
||||||
|
|||||||
Reference in New Issue
Block a user