activity-core/wiki/ActivityCorePlan_chtgpt.md at ea5fbe0bf347a91223a54ce25bdf4c618dda6be0

Files

tegwick 6f9132314f Add project scaffold: contracts, schemas, docker-compose, workplans

Phase 0 contracts (event envelope, ActivityDefinition, idempotency doc,
naming conventions) and Phase 1 Temporal cluster setup (docker-compose.dev.yml,
Temporal dynamic config) are complete. Includes Pydantic models, JSON schemas,
wiki architecture docs, and ADR-001 workplan files for both workstreams.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-04 22:45:40 +01:00

8.9 KiB

Raw Blame History

This is a protplan that should be digested and compared to form an actionable workplan.

If you’re leaning toward self-hosted Temporal, your “Activity = event-driven task factory” maps almost perfectly to:

Temporal Schedules → create a durable, centrally managed trigger that starts a workflow on a cadence (better than “cron workflows”). (docs.temporal.io)
Workflows → your “activity processor” (evaluate context + decide 0..N tasks)
Activities → your concrete “tasks” (side effects, API calls, jobs, human-task creation, etc.)
Signals / events → external event ingestion into running workflows (or into starter workflows)

And Temporal’s server side is explicitly designed for HA scaling across its core subsystems (Frontend/History/Matching/Worker). (docs.temporal.io)

Below is a consolidated, practical workplan to set up a backbone service for a robust event-driven architecture, using Temporal as the orchestration spine.

Consolidated backbone architecture (Temporal-centered EDA)

Backbone components

Temporal Service (server)
- Temporal Server (Frontend, History, Matching, Worker services) (docs.temporal.io)
- Persistence store (SQL or Cassandra) + Visibility store (SQL and/or Elasticsearch depending on features) (docs.temporal.io)
Temporal Workers (your code)
- “Activity Orchestrator Workflows” (your Activity runtime)
- Activities (task executors / integrators)
Event ingress/egress
- Ingress: broker subscriptions → “event router” → Temporal (start workflow / signal workflow)
- Egress: Temporal activities publish domain events to broker
Admin + Observability
- Temporal Web UI (ops visibility, schedules page, etc.) (docs.temporal.io)
- Prometheus/Grafana + logs + tracing (OpenTelemetry if you want end-to-end)

Workplan (phased, production-minded)

Phase 0 — Decide the minimum “contract” for your EDA

Deliverable: a stable event & workflow contract so everything stays modular.

Event envelope (internal standard): event_id, type, source, occurred_at, subject, trace_id, schema_version, payload
Idempotency standard:
- Every inbound event has a stable event_id
- Every scheduled run has stable (activity_id, scheduled_for)
Naming/partitioning conventions:
- Temporal Namespace strategy (e.g., prod, stage, or per-tenant)
- Task Queues per service boundary (e.g., billing-tq, notifications-tq)

Phase 1 — Stand up Temporal Service on Kubernetes (self-hosted)

Deliverable: a working Temporal cluster with persistence + UI.

Provision persistence + visibility dependencies
- Choose PostgreSQL/MySQL (common) or Cassandra, plus optional Elasticsearch for advanced visibility. Temporal self-hosted deployments need you to provide these stores. (docs.temporal.io)
Deploy Temporal via official Helm chart
- Temporal maintains official Helm charts for Kubernetes deployments. (docs.temporal.io)
Deploy Temporal Web UI
- Enable the UI so you can inspect workflows and schedules. (docs.temporal.io)
Production hardening basics
- NetworkPolicies, PodSecurity, resource limits, HPA
- Backups for DB/ES
- Separate node pools if needed for noisy workloads

Note: temporalio/auto-setup is excellent for dev or quick bootstrap (Docker), but for production you typically run server components + managed/provisioned DB/ES explicitly. (Docker Hub)

Phase 2 — Establish the “Activity Orchestrator” as a workflow pattern

Deliverable: one end-to-end ActivityDefinition that spawns tasks robustly.

Implement this canonical workflow:

Workflow: RunActivity(activity_id, trigger)

Load ActivityDefinition (versioned)
Resolve context snapshot (query DB/APIs)
Evaluate rules → decide TaskInstances[]
Execute tasks as Temporal activities (or create “human tasks” in your DB)
Emit TaskCreated / TaskCompleted events (activities publish to broker)
Record run audit (context hash, produced tasks, version)

Key guardrails

Idempotency: use deterministic workflow IDs for scheduled runs: workflow_id = activity_id + ":" + scheduled_for
Exactly-once effect: for side effects, prefer outbox in your DB or make activities idempotent (store event_id / task_instance_id).

Phase 3 — Replace cron with Temporal Schedules (first-class triggers)

Deliverable: schedules are managed in Temporal, not in random cronjobs.

Use Temporal Schedules to start RunActivity(...) at times/intervals (and manage them centrally). (docs.temporal.io)
Store your “human editable schedule spec” in your ActivityRegistry, but materialize it into Temporal schedules.
Decide “missed run” policy:
- catch up (bounded)
- skip
- compress (run once with widened context)

This is the cleanest alignment with your research draft: “timer ingress → trigger event → processor → spawn tasks”, except Temporal gives you durable state, retries, and execution history by default.

Phase 4 — Add external events: broker → Temporal

Deliverable: event-driven triggers land reliably in Temporal.

Introduce an Event Router service:
- Subscribes to Kafka/NATS/Rabbit/etc.
- Validates schema + authN/authZ
- Applies routing rules: (event.type, filters) -> activity_id(s)
For each match, it either:
- Starts RunActivity(activity_id, trigger_event) (if no long-lived instance)
- Or Signals an existing workflow instance (if you have “stateful ongoing activities”)

Rule of thumb

If the “activity” is inherently recurring and stateless per run → start per trigger.
If the “activity” is an ongoing coordination process (state machine) → signal a long-lived workflow.

Phase 5 — Observability and operability as first-class product

Deliverable: you can run this for months without fear.

Temporal UI for inspection (workflows + schedules). (docs.temporal.io)
Metrics dashboards:
- schedule trigger rate, workflow start rate
- activity failures, retries, latency
- queue depth by task queue
Alerts:
- sustained workflow task failures
- persistence latency spikes
- DLQ / poison message rate (in broker layer)
Runbooks:
- how to replay/repair
- how to disable an activity safely
- how to backfill events

Phase 6 — Security hardening (when you’re past MVP)

Deliverable: “robust backbone” in the real sense.

mTLS between workers and Temporal (and between services)
Namespace isolation (and per-namespace access controls if applicable)
Event authentication (signed events / broker ACLs)
Secrets management (KMS/Vault/SealedSecrets)

How this consolidates with your draft research

Your draft blueprint is structurally correct; the consolidation with Temporal is mainly:

Trigger Layer becomes:
- Temporal Schedules for time triggers (docs.temporal.io)
- Broker + Event Router for external events
Activity Processor becomes a Workflow (durable, replayable, stateful)
Task Queue & Execution becomes Temporal Task Queues + workers (no separate Celery/RQ needed unless you have reasons)
Durability / retries / history are native properties of Temporal’s server subsystems (docs.temporal.io)

Recommended “backbone MVP” scope (small but real)

If you want the minimum slice that proves the architecture:

Temporal on K8s via Helm (Postgres + UI) (docs.temporal.io)
ActivityRegistry (Postgres table)
RunActivity workflow + 2 example activities:
- one “reads context”
- one “creates tasks” (writes TaskInstances into DB)
One Temporal Schedule that triggers every minute/hour (docs.temporal.io)
One external event route (broker → router → start workflow)

That’s enough to call it a robust event-driven backbone.

If you tell me your preferred broker (Kafka vs NATS vs RabbitMQ) and your worker language (Go/TS/Java/Python), I’ll turn the plan into a concrete component diagram + Kubernetes deployment layout (namespaces, services, HPAs) + the exact “ActivityDefinition” schema I’d use.

8.9 KiB Raw Blame History Unescape Escape