Phase 0 contracts (event envelope, ActivityDefinition, idempotency doc, naming conventions) and Phase 1 Temporal cluster setup (docker-compose.dev.yml, Temporal dynamic config) are complete. Includes Pydantic models, JSON schemas, wiki architecture docs, and ADR-001 workplan files for both workstreams. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.9 KiB
This is a protplan that should be digested and compared to form an actionable workplan.
If you’re leaning toward self-hosted Temporal, your “Activity = event-driven task factory” maps almost perfectly to:
- Temporal Schedules → create a durable, centrally managed trigger that starts a workflow on a cadence (better than “cron workflows”). (docs.temporal.io)
- Workflows → your “activity processor” (evaluate context + decide 0..N tasks)
- Activities → your concrete “tasks” (side effects, API calls, jobs, human-task creation, etc.)
- Signals / events → external event ingestion into running workflows (or into starter workflows)
And Temporal’s server side is explicitly designed for HA scaling across its core subsystems (Frontend/History/Matching/Worker). (docs.temporal.io)
Below is a consolidated, practical workplan to set up a backbone service for a robust event-driven architecture, using Temporal as the orchestration spine.
Consolidated backbone architecture (Temporal-centered EDA)
Backbone components
-
Temporal Service (server)
- Temporal Server (Frontend, History, Matching, Worker services) (docs.temporal.io)
- Persistence store (SQL or Cassandra) + Visibility store (SQL and/or Elasticsearch depending on features) (docs.temporal.io)
-
Temporal Workers (your code)
- “Activity Orchestrator Workflows” (your Activity runtime)
- Activities (task executors / integrators)
-
Event ingress/egress
- Ingress: broker subscriptions → “event router” → Temporal (start workflow / signal workflow)
- Egress: Temporal activities publish domain events to broker
-
Admin + Observability
- Temporal Web UI (ops visibility, schedules page, etc.) (docs.temporal.io)
- Prometheus/Grafana + logs + tracing (OpenTelemetry if you want end-to-end)
Workplan (phased, production-minded)
Phase 0 — Decide the minimum “contract” for your EDA
Deliverable: a stable event & workflow contract so everything stays modular.
-
Event envelope (internal standard):
event_id,type,source,occurred_at,subject,trace_id,schema_version,payload -
Idempotency standard:
- Every inbound event has a stable
event_id - Every scheduled run has stable
(activity_id, scheduled_for)
- Every inbound event has a stable
-
Naming/partitioning conventions:
- Temporal Namespace strategy (e.g.,
prod,stage, or per-tenant) - Task Queues per service boundary (e.g.,
billing-tq,notifications-tq)
- Temporal Namespace strategy (e.g.,
Phase 1 — Stand up Temporal Service on Kubernetes (self-hosted)
Deliverable: a working Temporal cluster with persistence + UI.
-
Provision persistence + visibility dependencies
- Choose PostgreSQL/MySQL (common) or Cassandra, plus optional Elasticsearch for advanced visibility. Temporal self-hosted deployments need you to provide these stores. (docs.temporal.io)
-
Deploy Temporal via official Helm chart
- Temporal maintains official Helm charts for Kubernetes deployments. (docs.temporal.io)
-
Deploy Temporal Web UI
- Enable the UI so you can inspect workflows and schedules. (docs.temporal.io)
-
Production hardening basics
- NetworkPolicies, PodSecurity, resource limits, HPA
- Backups for DB/ES
- Separate node pools if needed for noisy workloads
Note: temporalio/auto-setup is excellent for dev or quick bootstrap (Docker), but for production you typically run server components + managed/provisioned DB/ES explicitly. (Docker Hub)
Phase 2 — Establish the “Activity Orchestrator” as a workflow pattern
Deliverable: one end-to-end ActivityDefinition that spawns tasks robustly.
Implement this canonical workflow:
Workflow: RunActivity(activity_id, trigger)
- Load
ActivityDefinition(versioned) - Resolve context snapshot (query DB/APIs)
- Evaluate rules → decide
TaskInstances[] - Execute tasks as Temporal activities (or create “human tasks” in your DB)
- Emit
TaskCreated/TaskCompletedevents (activities publish to broker) - Record run audit (context hash, produced tasks, version)
Key guardrails
- Idempotency: use deterministic workflow IDs for scheduled runs:
workflow_id = activity_id + ":" + scheduled_for - Exactly-once effect: for side effects, prefer outbox in your DB or make activities idempotent (store
event_id/task_instance_id).
Phase 3 — Replace cron with Temporal Schedules (first-class triggers)
Deliverable: schedules are managed in Temporal, not in random cronjobs.
-
Use Temporal Schedules to start
RunActivity(...)at times/intervals (and manage them centrally). (docs.temporal.io) -
Store your “human editable schedule spec” in your ActivityRegistry, but materialize it into Temporal schedules.
-
Decide “missed run” policy:
- catch up (bounded)
- skip
- compress (run once with widened context)
This is the cleanest alignment with your research draft: “timer ingress → trigger event → processor → spawn tasks”, except Temporal gives you durable state, retries, and execution history by default.
Phase 4 — Add external events: broker → Temporal
Deliverable: event-driven triggers land reliably in Temporal.
-
Introduce an Event Router service:
- Subscribes to Kafka/NATS/Rabbit/etc.
- Validates schema + authN/authZ
- Applies routing rules:
(event.type, filters) -> activity_id(s)
-
For each match, it either:
- Starts
RunActivity(activity_id, trigger_event)(if no long-lived instance) - Or Signals an existing workflow instance (if you have “stateful ongoing activities”)
- Starts
Rule of thumb
- If the “activity” is inherently recurring and stateless per run → start per trigger.
- If the “activity” is an ongoing coordination process (state machine) → signal a long-lived workflow.
Phase 5 — Observability and operability as first-class product
Deliverable: you can run this for months without fear.
-
Temporal UI for inspection (workflows + schedules). (docs.temporal.io)
-
Metrics dashboards:
- schedule trigger rate, workflow start rate
- activity failures, retries, latency
- queue depth by task queue
-
Alerts:
- sustained workflow task failures
- persistence latency spikes
- DLQ / poison message rate (in broker layer)
-
Runbooks:
- how to replay/repair
- how to disable an activity safely
- how to backfill events
Phase 6 — Security hardening (when you’re past MVP)
Deliverable: “robust backbone” in the real sense.
- mTLS between workers and Temporal (and between services)
- Namespace isolation (and per-namespace access controls if applicable)
- Event authentication (signed events / broker ACLs)
- Secrets management (KMS/Vault/SealedSecrets)
How this consolidates with your draft research
Your draft blueprint is structurally correct; the consolidation with Temporal is mainly:
-
Trigger Layer becomes:
- Temporal Schedules for time triggers (docs.temporal.io)
- Broker + Event Router for external events
-
Activity Processor becomes a Workflow (durable, replayable, stateful)
-
Task Queue & Execution becomes Temporal Task Queues + workers (no separate Celery/RQ needed unless you have reasons)
-
Durability / retries / history are native properties of Temporal’s server subsystems (docs.temporal.io)
Recommended “backbone MVP” scope (small but real)
If you want the minimum slice that proves the architecture:
-
Temporal on K8s via Helm (Postgres + UI) (docs.temporal.io)
-
ActivityRegistry (Postgres table)
-
RunActivityworkflow + 2 example activities:- one “reads context”
- one “creates tasks” (writes TaskInstances into DB)
-
One Temporal Schedule that triggers every minute/hour (docs.temporal.io)
-
One external event route (broker → router → start workflow)
That’s enough to call it a robust event-driven backbone.
If you tell me your preferred broker (Kafka vs NATS vs RabbitMQ) and your worker language (Go/TS/Java/Python), I’ll turn the plan into a concrete component diagram + Kubernetes deployment layout (namespaces, services, HPAs) + the exact “ActivityDefinition” schema I’d use.