Add project scaffold: contracts, schemas, docker-compose, workplans

Phase 0 contracts (event envelope, ActivityDefinition, idempotency doc, naming conventions) and Phase 1 Temporal cluster setup (docker-compose.dev.yml, Temporal dynamic config) are complete. Includes Pydantic models, JSON schemas, wiki architecture docs, and ADR-001 workplan files for both workstreams. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 22:45:40 +01:00
parent 098c13febd
commit 6f9132314f
16 changed files with 2652 additions and 0 deletions
--- a/wiki/ActivityCorePlan_chtgpt.md
+++ b/wiki/ActivityCorePlan_chtgpt.md
@@ -0,0 +1,214 @@
+This is a protplan that should be digested and compared to form an actionable workplan.
+
+If you’re leaning toward **self-hosted Temporal**, your “Activity = event-driven task factory” maps almost perfectly to:
+
+* **Temporal Schedules** → create a durable, centrally managed trigger that starts a workflow on a cadence (better than “cron workflows”). ([docs.temporal.io][1])
+* **Workflows** → your “activity processor” (evaluate context + decide 0..N tasks)
+* **Activities** → your concrete “tasks” (side effects, API calls, jobs, human-task creation, etc.)
+* **Signals / events** → external event ingestion into running workflows (or into starter workflows)
+
+And Temporal’s server side is explicitly designed for HA scaling across its core subsystems (Frontend/History/Matching/Worker). ([docs.temporal.io][2])
+
+Below is a consolidated, practical workplan to set up a **backbone service** for a robust event-driven architecture, using Temporal as the orchestration spine.
+
+---
+
+## Consolidated backbone architecture (Temporal-centered EDA)
+
+### Backbone components
+
+1. **Temporal Service (server)**
+
+   * Temporal Server (Frontend, History, Matching, Worker services) ([docs.temporal.io][2])
+   * Persistence store (SQL or Cassandra) + Visibility store (SQL and/or Elasticsearch depending on features) ([docs.temporal.io][3])
+2. **Temporal Workers (your code)**
+
+   * “Activity Orchestrator Workflows” (your Activity runtime)
+   * Activities (task executors / integrators)
+3. **Event ingress/egress**
+
+   * Ingress: broker subscriptions → “event router” → Temporal (start workflow / signal workflow)
+   * Egress: Temporal activities publish domain events to broker
+4. **Admin + Observability**
+
+   * Temporal Web UI (ops visibility, schedules page, etc.) ([docs.temporal.io][4])
+   * Prometheus/Grafana + logs + tracing (OpenTelemetry if you want end-to-end)
+
+---
+
+## Workplan (phased, production-minded)
+
+### Phase 0 — Decide the minimum “contract” for your EDA
+
+**Deliverable:** a stable event & workflow contract so everything stays modular.
+
+* **Event envelope (internal standard)**:
+  `event_id`, `type`, `source`, `occurred_at`, `subject`, `trace_id`, `schema_version`, `payload`
+* **Idempotency standard**:
+
+  * Every inbound event has a stable `event_id`
+  * Every scheduled run has stable `(activity_id, scheduled_for)`
+* **Naming/partitioning conventions**:
+
+  * Temporal **Namespace** strategy (e.g., `prod`, `stage`, or per-tenant)
+  * Task Queues per service boundary (e.g., `billing-tq`, `notifications-tq`)
+
+---
+
+### Phase 1 — Stand up Temporal Service on Kubernetes (self-hosted)
+
+**Deliverable:** a working Temporal cluster with persistence + UI.
+
+1. **Provision persistence + visibility dependencies**
+
+   * Choose **PostgreSQL/MySQL** (common) or Cassandra, plus optional Elasticsearch for advanced visibility. Temporal self-hosted deployments need you to provide these stores. ([docs.temporal.io][5])
+2. **Deploy Temporal via official Helm chart**
+
+   * Temporal maintains official Helm charts for Kubernetes deployments. ([docs.temporal.io][5])
+3. **Deploy Temporal Web UI**
+
+   * Enable the UI so you can inspect workflows and schedules. ([docs.temporal.io][4])
+4. **Production hardening basics**
+
+   * NetworkPolicies, PodSecurity, resource limits, HPA
+   * Backups for DB/ES
+   * Separate node pools if needed for noisy workloads
+
+*Note:* `temporalio/auto-setup` is excellent for dev or quick bootstrap (Docker), but for production you typically run server components + managed/provisioned DB/ES explicitly. ([Docker Hub][6])
+
+---
+
+### Phase 2 — Establish the “Activity Orchestrator” as a workflow pattern
+
+**Deliverable:** one end-to-end ActivityDefinition that spawns tasks robustly.
+
+Implement this canonical workflow:
+
+**Workflow: `RunActivity(activity_id, trigger)`**
+
+1. Load `ActivityDefinition` (versioned)
+2. Resolve context snapshot (query DB/APIs)
+3. Evaluate rules → decide `TaskInstances[]`
+4. Execute tasks as Temporal **activities** (or create “human tasks” in your DB)
+5. Emit `TaskCreated` / `TaskCompleted` events (activities publish to broker)
+6. Record run audit (context hash, produced tasks, version)
+
+**Key guardrails**
+
+* **Idempotency:** use deterministic workflow IDs for scheduled runs:
+  `workflow_id = activity_id + ":" + scheduled_for`
+* **Exactly-once effect:** for side effects, prefer **outbox** in your DB or make activities idempotent (store `event_id` / `task_instance_id`).
+
+---
+
+### Phase 3 — Replace cron with Temporal Schedules (first-class triggers)
+
+**Deliverable:** schedules are managed in Temporal, not in random cronjobs.
+
+* Use **Temporal Schedules** to start `RunActivity(...)` at times/intervals (and manage them centrally). ([docs.temporal.io][1])
+* Store your “human editable schedule spec” in your ActivityRegistry, but **materialize** it into Temporal schedules.
+* Decide “missed run” policy:
+
+  * catch up (bounded)
+  * skip
+  * compress (run once with widened context)
+
+This is the cleanest alignment with your research draft: “timer ingress → trigger event → processor → spawn tasks”, except Temporal gives you durable state, retries, and execution history by default.
+
+---
+
+### Phase 4 — Add external events: broker → Temporal
+
+**Deliverable:** event-driven triggers land reliably in Temporal.
+
+1. Introduce an **Event Router** service:
+
+   * Subscribes to Kafka/NATS/Rabbit/etc.
+   * Validates schema + authN/authZ
+   * Applies routing rules: `(event.type, filters) -> activity_id(s)`
+2. For each match, it either:
+
+   * **Starts** `RunActivity(activity_id, trigger_event)` (if no long-lived instance)
+   * Or **Signals** an existing workflow instance (if you have “stateful ongoing activities”)
+
+**Rule of thumb**
+
+* If the “activity” is inherently recurring and stateless per run → start per trigger.
+* If the “activity” is an ongoing coordination process (state machine) → signal a long-lived workflow.
+
+---
+
+### Phase 5 — Observability and operability as first-class product
+
+**Deliverable:** you can run this for months without fear.
+
+* Temporal UI for inspection (workflows + schedules). ([docs.temporal.io][4])
+* Metrics dashboards:
+
+  * schedule trigger rate, workflow start rate
+  * activity failures, retries, latency
+  * queue depth by task queue
+* Alerts:
+
+  * sustained workflow task failures
+  * persistence latency spikes
+  * DLQ / poison message rate (in broker layer)
+* Runbooks:
+
+  * how to replay/repair
+  * how to disable an activity safely
+  * how to backfill events
+
+---
+
+### Phase 6 — Security hardening (when you’re past MVP)
+
+**Deliverable:** “robust backbone” in the real sense.
+
+* mTLS between workers and Temporal (and between services)
+* Namespace isolation (and per-namespace access controls if applicable)
+* Event authentication (signed events / broker ACLs)
+* Secrets management (KMS/Vault/SealedSecrets)
+
+---
+
+## How this consolidates with your draft research
+
+Your draft blueprint is structurally correct; the consolidation with Temporal is mainly:
+
+* **Trigger Layer** becomes:
+
+  * Temporal **Schedules** for time triggers ([docs.temporal.io][1])
+  * Broker + Event Router for external events
+* **Activity Processor** becomes a **Workflow** (durable, replayable, stateful)
+* **Task Queue & Execution** becomes Temporal **Task Queues** + workers (no separate Celery/RQ needed unless you have reasons)
+* **Durability / retries / history** are native properties of Temporal’s server subsystems ([docs.temporal.io][2])
+
+---
+
+## Recommended “backbone MVP” scope (small but real)
+
+If you want the minimum slice that proves the architecture:
+
+1. Temporal on K8s via Helm (Postgres + UI) ([docs.temporal.io][5])
+2. ActivityRegistry (Postgres table)
+3. `RunActivity` workflow + 2 example activities:
+
+   * one “reads context”
+   * one “creates tasks” (writes TaskInstances into DB)
+4. One Temporal Schedule that triggers every minute/hour ([docs.temporal.io][1])
+5. One external event route (broker → router → start workflow)
+
+That’s enough to call it a **robust event-driven backbone**.
+
+---
+
+If you tell me your preferred broker (Kafka vs NATS vs RabbitMQ) and your worker language (Go/TS/Java/Python), I’ll turn the plan into a concrete component diagram + Kubernetes deployment layout (namespaces, services, HPAs) + the exact “ActivityDefinition” schema I’d use.
+
+[1]: https://docs.temporal.io/evaluate/development-production-features/schedules?utm_source=chatgpt.com "Schedules - Temporal feature"
+[2]: https://docs.temporal.io/temporal-service/temporal-server?utm_source=chatgpt.com "Temporal Server | Temporal Platform Documentation"
+[3]: https://docs.temporal.io/temporal-service?utm_source=chatgpt.com "Temporal Service | Temporal Platform Documentation"
+[4]: https://docs.temporal.io/web-ui?utm_source=chatgpt.com "Temporal Web UI | Temporal Platform Documentation"
+[5]: https://docs.temporal.io/self-hosted-guide/deployment?utm_source=chatgpt.com "Deploying a Temporal Service"
+[6]: https://hub.docker.com/r/temporalio/auto-setup?utm_source=chatgpt.com "temporalio/auto-setup - Docker Image"
+
--- a/wiki/ActivityCorePlan_grok.md
+++ b/wiki/ActivityCorePlan_grok.md
@@ -0,0 +1,203 @@
+# This is a protoplan to be compared with others and then compiled into an actionalble plan.
+
+
+**Comparison of my initial blueprint with your provided research**  
+Your research description (Activity = event-driven “task factory” with trigger + context resolver + task templates) maps **almost 1:1** to the architecture I outlined earlier.  
+
+Key alignments:  
+- **Core objects** (ActivityDefinition, TriggerEvent, TaskInstance) = my Activity Registry + internal TriggerEvent + spawned tasks.  
+- **Services** (Triggering layer → Activity Orchestrator → Task execution layer + Storage) = my Trigger Layer + Activity Processor + Task Queue/Workers + Storage.  
+- **Data flow, non-negotiables** (idempotency via `activity_id + scheduled_for`, misfire policy, versioning, timezone, observability) and the three implementation levels are identical.  
+- **Suggested reference architectures** Option A (Celery Beat etc.) vs Option B (Temporal) directly match my “lightweight” vs “workflow-engine” paths.  
+
+**Temporal self-hosted is the clear winner** for a production-grade backbone. It implements **Level 3 / Option B** natively and “for free”:  
+- Durable workflows = your Activity Orchestrator (state, retries, history survive restarts/crashes).  
+- Temporal **Schedules** = superior time-based triggering (built-in jitter, backfill/misfire handling, pause/resume, overlap policies, timezone support).  
+- Signals + Client.start_workflow = external event triggering with built-in deduplication.  
+- Workflow ID uniqueness + deterministic execution = perfect idempotency and versioning.  
+- Temporal UI + metrics + visibility = observability out of the box.  
+- No need for separate Celery/Quartz/APScheduler/Beat — one system handles everything reliably at scale.  
+
+Your research’s “practical next steps” are exactly what we will execute below, but powered by Temporal instead of lighter schedulers.
+
+### Consolidated Architecture (research objects mapped to Temporal)
+
+| Research Concept          | Temporal Implementation                              | Notes |
+|---------------------------|------------------------------------------------------|-------|
+| ActivityDefinition        | Your app DB (Postgres etc.) + versioned record       | Workflow loads it via Activity (safe side-effect) |
+| TriggerEvent              | Temporal Schedule (time) or Client.start_workflow/signal (events) | Schedule ID = activity_id |
+| Activity Orchestrator     | Durable Workflow (RunActivityWorkflow)               | Queries context, evaluates rules, spawns tasks |
+| TaskInstance              | Child Workflow or Activity call (parametrized)       | Full history, retries, observability |
+| Storage (registry, run log, tasks) | Your app DB + Temporal history (automatic)     | Temporal stores every run/context snapshot |
+| Idempotency / Misfire / Versioning | Workflow ID + Schedule policies + patching | All built-in |
+| Observability             | Temporal Web UI + Prometheus + your run-log table    | Zero extra work |
+
+**High-level flow**  
+External event or Schedule fires → Temporal starts/runs Orchestrator Workflow → Workflow loads ActivityDefinition → Activity queries “current situation” → Workflow spawns 0..N parametrized child workflows/activities → Everything durable + auditable.
+
+### Consolidated Workplan: Set Up the Backbone Service (Self-Hosted Temporal)
+
+**Goal**: A robust, event-driven “task factory” service that you can run today and scale forever without rewriting.
+
+#### Phase 0: Prerequisites (1–2 hours)
+- Docker + Docker Compose (for dev) or Kubernetes (for prod).
+- PostgreSQL (or MySQL/Cassandra) for Temporal persistence + one for your app data (ActivityDefinitions, tasks, run logs).
+- Choose SDK: I’ll use **Python** below (mature, matches your research examples). Other SDKs (Go, TypeScript, Java, .NET) follow the same patterns — tell me if you prefer another.
+
+#### Phase 1: Deploy Self-Hosted Temporal Cluster (30–60 min)
+**Dev / Quickstart** (recommended first):
+```bash
+git clone https://github.com/temporalio/docker-compose.git
+cd docker-compose
+docker compose up -d
+```
+- Temporal gRPC: `127.0.0.1:7233`  
+- Temporal Web UI: `http://127.0.0.1:8080` (inspect workflows, schedules, history)  
+- Default namespace: `default` (create more if needed).
+
+**Production / Kubernetes**:
+Use official Helm charts: https://github.com/temporalio/helm-charts  
+Follow the production checklist (security, visibility with Elasticsearch, monitoring, encryption, backups).
+
+Connect your workers/clients to the cluster. The same code runs unchanged on Temporal Cloud later if you ever migrate.
+
+#### Phase 2: Define Core Domain Model (your app DB)
+Create these tables (Postgres example):
+
+```sql
+CREATE TABLE activity_definitions (
+    id UUID PRIMARY KEY,
+    name TEXT,
+    enabled BOOLEAN DEFAULT true,
+    trigger_type TEXT, -- 'cron' | 'event'
+    trigger_config JSONB, -- cron expr, interval, timezone, jitter, misfire_policy
+    context_sources JSONB,
+    task_templates JSONB[],
+    dedupe_key_strategy TEXT,
+    version INT
+);
+
+CREATE TABLE activity_runs (
+    run_id UUID PRIMARY KEY,
+    activity_id UUID,
+    scheduled_for TIMESTAMPTZ,
+    fired_at TIMESTAMPTZ,
+    context_snapshot JSONB,
+    tasks_spawned INT,
+    version_used INT
+);
+```
+
+(Plus a tasks table if you want human-facing tasks.)
+
+#### Phase 3: Implement the Orchestrator Workflow (the “task factory”)
+This is your Activity Processor — durable by design.
+
+**Python skeleton** (in `workflows.py`):
+
+```python
+from temporalio import workflow
+from temporalio.exceptions import ApplicationError
+import uuid
+from datetime import timedelta
+
+@workflow.defn
+class RunActivityWorkflow:
+    @workflow.run
+    async def run(self, activity_id: str, trigger_event: dict):
+        # Step 1: Load definition (via Activity — safe DB call)
+        activity_def = await workflow.execute_activity(
+            load_activity_definition,
+            activity_id,
+            start_to_close_timeout=timedelta(seconds=10)
+        )
+
+        # Step 2: Resolve current situation (context)
+        context = await workflow.execute_activity(
+            resolve_context,
+            (activity_def["context_sources"], trigger_event),
+            start_to_close_timeout=timedelta(seconds=30)
+        )
+
+        # Step 3: Evaluate rules & instantiate tasks
+        task_instances = evaluate_templates(activity_def["task_templates"], context)  # pure Python
+
+        # Step 4: Spawn parametrized tasks (child workflows or activities)
+        for task in task_instances:
+            await workflow.start_child_workflow(
+                task["type"],  # e.g. "SendEmailWorkflow" or generic TaskExecutor
+                task["params"],
+                id=f"task-{uuid.uuid4()}",
+                task_queue="task-execution-queue"
+            )
+
+        # Step 5: Log run (via Activity)
+        await workflow.execute_activity(
+            log_run,
+            {"activity_id": activity_id, "context": context, "tasks_spawned": len(task_instances)},
+            start_to_close_timeout=timedelta(seconds=5)
+        )
+```
+
+Implement the Activities (`activities.py`) for DB access, context resolution, etc.
+
+#### Phase 4: Set Up Time-Based Triggers (Temporal Schedules)
+Register a Schedule per enabled ActivityDefinition (run once at startup or on definition change).
+
+**Python example** (using Temporal Client):
+
+```python
+await client.create_schedule(
+    schedule_id=f"activity-schedule-{activity_id}",
+    schedule=Schedule(
+        action=ScheduleActionStartWorkflow(
+            RunActivityWorkflow.run,
+            args=[activity_id, {"scheduled_for": datetime.now()}],
+            id=f"wf-{activity_id}-{scheduled_for_iso}",
+            task_queue="orchestrator-queue"
+        ),
+        spec=ScheduleSpec(
+            cron_expressions=[activity_def["trigger_config"]["cron"]],
+            timezone=activity_def["trigger_config"]["timezone"],
+            jitter=timedelta(minutes=1) if needed,
+        ),
+        state=ScheduleState(paused=not activity_def["enabled"])
+    )
+)
+```
+
+- Misfire/backfill: Use `handle.backfill(...)`  
+- Pause/resume: `handle.pause()` / `handle.unpause()`  
+- Manual trigger: `handle.trigger()`
+
+#### Phase 5: Event-Driven Triggers (external events)
+Build a tiny “Event Ingress” service (microservice or even a Temporal Activity):
+- On incoming event → compute unique workflow ID (`f"activity-{activity_id}-{event_key}"`)
+- `client.start_workflow(RunActivityWorkflow.run, ... , id=unique_id)`  
+→ Temporal guarantees exactly-once execution.
+
+#### Phase 6: Task Execution Layer
+- Automatable tasks → Temporal Activities or child workflows (retries, heartbeats, timeouts built-in).
+- Human tasks → Persist to your DB + notify UI (or use Temporal for human-in-the-loop workflows).
+
+#### Phase 7: Observability, Admin & Production Hardening
+- Use Temporal Web UI (search runs by activity_id, view context snapshots, replay history).
+- Add Prometheus metrics + your run-log table.
+- Admin API: CRUD ActivityDefinitions → recreate/update Schedules.
+- Idempotency, versioning, rate limits: all handled by Temporal + your dedupe strategy.
+- Monitoring: Temporal production checklist (visibility, encryption, backups).
+
+### Timeline & Effort Estimate
+- Day 1: Deploy cluster + basic Workflow (Phase 1–3) → you can already trigger manually.  
+- Day 2: Schedules + event ingress (Phase 4–5).  
+- Day 3: Task spawning + observability (Phase 6–7).  
+- Week 2: Production hardening, testing misfires, scaling workers.
+
+This gives you a **single, durable, observable backbone** that replaces every scheduler/queue you were considering and satisfies every non-negotiable in your research.
+
+**Next step for you**: Tell me  
+1. Your preferred language (Python/Go/TypeScript/Java/.NET)?  
+2. Are you on Kubernetes already (Helm) or prefer plain Docker?  
+3. Do you want the full minimal repo structure + DB migration scripts + complete working example (I can provide a GitHub-ready skeleton next)?
+
+I’ll deliver the exact files/commands you can run today. This is the robust event-driven architecture you asked for — built once, reliable forever. Ready when you are!