From c79d0980a9b214e31540be441d56967a64f575c0 Mon Sep 17 00:00:00 2001 From: tegwick Date: Tue, 2 Jun 2026 08:10:24 +0200 Subject: [PATCH] Make Temporal activity timeout env-configurable (ADHOC-2026-06-01-T03) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The CUST-WP-0045 daily triage canary on 2026-06-01 hit a BrokenPipeError on the llm-connect side. Two 5-minute timeouts were racing: - _ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py - LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py The 10KB curated digest + max_depth:2 + JSON schema enforcement pushed Claude past 5 minutes. Whichever timer fired first killed the httpx call; the model's late response arrived to a closed socket. Read _ACTIVITY_TIMEOUT from ACTIVITY_TIMEOUT_SECONDS env (default 900 — 15 minutes) so judgement-call activities have headroom for slow LLM runs. Operators should also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 so httpx still times out slightly before Temporal, preserving the clean-error contract. Tests: 120 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 --- src/activity_core/workflows.py | 5 ++++- workplans/ADHOC-2026-06-01.md | 30 ++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/src/activity_core/workflows.py b/src/activity_core/workflows.py index c2f1ee4..afc84ae 100644 --- a/src/activity_core/workflows.py +++ b/src/activity_core/workflows.py @@ -11,6 +11,7 @@ Workflow IDs follow the conventions in docs/conventions.md: from __future__ import annotations +import os import uuid from datetime import timedelta @@ -42,7 +43,9 @@ _RETRY_POLICY = RetryPolicy( maximum_attempts=10, ) -_ACTIVITY_TIMEOUT = timedelta(minutes=5) +_ACTIVITY_TIMEOUT = timedelta( + seconds=int(os.environ.get("ACTIVITY_TIMEOUT_SECONDS", "900")) +) _TASK_QUEUE = "task-execution-tq" diff --git a/workplans/ADHOC-2026-06-01.md b/workplans/ADHOC-2026-06-01.md index 8435379..1c2c716 100644 --- a/workplans/ADHOC-2026-06-01.md +++ b/workplans/ADHOC-2026-06-01.md @@ -167,3 +167,33 @@ Done when either: (a) rule action fields interpolate `context.*` expressions and a stale-repo workflow run emits a TaskSpec with the actual repo slug, or (b) a recorded decision explicitly defers/declines the change with reasoning. + +### T03 - Make activity-core's Temporal activity timeout env-configurable + +```task +id: ADHOC-2026-06-01-T03 +status: done +priority: low +state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1" +``` + +Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily +triage instruction call hit `BrokenPipeError` on the llm-connect side +because two 5-minute timeouts were racing: + +- `_ACTIVITY_TIMEOUT = timedelta(minutes=5)` in `workflows.py` +- `LLM_CONNECT_TIMEOUT_SECONDS` default `300` in `llm_client.py` + +The 10KB curated digest + `max_depth: 2` + JSON schema enforcement pushed +Claude past 5 minutes. Whichever timer fired first killed the httpx call, +and the model's late response arrived to a closed socket. + +Fix: read `_ACTIVITY_TIMEOUT` from env `ACTIVITY_TIMEOUT_SECONDS` (default +`900` — 15 minutes), so the Temporal activity outlives a normal slow LLM +run. Operators are expected to also widen httpx via +`LLM_CONNECT_TIMEOUT_SECONDS=840` (or similar) so httpx still times out +slightly *before* Temporal, preserving the clean-error contract. + +The activity timeout default is now larger by design — Temporal will still +heartbeat and Temporal-side cancellation still works; this only widens the +upper bound for long judgment-call activities like the daily triage.