diff --git a/.gitignore b/.gitignore index 36b13f1..1bc74a0 100644 --- a/.gitignore +++ b/.gitignore @@ -171,6 +171,9 @@ cython_debug/ # Ruff stuff: .ruff_cache/ +# email-connect local runtime artifacts +.email-connect/ +reports/*.csv + # PyPI configuration file .pypirc - diff --git a/README.md b/README.md index fcd7b8f..73c050a 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,34 @@ -# repo-seed +# email-connect -A git repository template to bootstrap coulomb projects from. \ No newline at end of file +Headless, provider-neutral email communication and evidence service. + +The first implementation slice is the Mailbox Evidence Scanner MVP: scan a +return mailbox or fixture directory, classify inbound email-channel evidence, +store scan state locally, and generate timestamped CSV reports without +overclaiming delivery, awareness, or coordination success. + +## Quickstart + +```bash +PYTHONPATH=src python3 -m unittest discover -s tests +PYTHONPATH=src python3 -m email_connect.cli adapter-descriptor +PYTHONPATH=src python3 -m email_connect.cli scan-mailbox --config config/mailbox.example.yml --out reports/ +``` + +The example config uses `tests/fixtures/mailbox` as a mailbox source. Runtime +state is written to `.email-connect/state.sqlite`; generated CSV reports are +written to `reports/`. + +## Current Scope + +- Coordination-engine spec review and references. +- Initial adapter descriptor, capability profile, evidence ceiling, and + limitations. +- Conservative mailbox message parser and evidence mapper. +- SQLite state store with message and evidence deduplication. +- CSV report generation. +- Golden fixture tests for hard bounce, soft bounce, out-of-office, and human + reply signals. + +IMAP access, provider webhooks, outbound sending, suppression workflows, and a +UI are planned work, not complete in this first slice. diff --git a/config/mailbox.example.yml b/config/mailbox.example.yml new file mode 100644 index 0000000..5841f8f --- /dev/null +++ b/config/mailbox.example.yml @@ -0,0 +1,31 @@ +mailbox: + id: return-mailbox-default + protocol: fixture + host: null + port: 993 + tls: true + username_env: EMAIL_CONNECT_IMAP_USER + password_env: EMAIL_CONNECT_IMAP_PASSWORD + folder: INBOX + +source: + fixture_dir: tests/fixtures/mailbox + +scan: + mode: incremental + max_messages_per_run: 5000 + since: null + include_seen: true + mark_seen: false + store_raw_headers: true + store_raw_body: false + store_raw_message_ref: true + +storage: + path: .email-connect/state.sqlite + +reports: + output_dir: reports + include_all_evidence: true + include_unknown_messages: true + timestamp_timezone: UTC diff --git a/docs/coordination-engine-email-spec-review.md b/docs/coordination-engine-email-spec-review.md new file mode 100644 index 0000000..f3ef068 --- /dev/null +++ b/docs/coordination-engine-email-spec-review.md @@ -0,0 +1,46 @@ +# coordination-engine Email Spec Review + +## Source Specs + +Reviewed local coordination-engine specifications: + +- `/home/worsch/coordination-engine/spec/EmailAdapterSpecification.md` +- `/home/worsch/coordination-engine/spec/AdapterInterfaceSpecification.md` +- `/home/worsch/coordination-engine/spec/RuntimeArchitectureAndAdapterSubsystem.md` +- `/home/worsch/coordination-engine/spec/ProductRequirementsDocument.md` + +These are referenced, not copied, so `email-connect` stays aligned with the +authoritative coordination-engine checkout. + +## Contract Points For email-connect + +- `email-connect` is a `notification`, `communication`, and `interaction` + adapter. +- It reports email-channel evidence and advisory assessment. It does not own + coordination case result evaluation. +- Ambiguous provider or mailbox signals must map to the weakest safe normalized + event. +- Provider and mailbox events must preserve raw references and native status + mappings. +- The adapter must expose or enable production of normalized `EvidenceEvent` + records. +- The adapter should expose endpoint quality diagnostics, but endpoint quality + is not coordination success. +- Email's positive evidence ceiling is limited: email tracking cannot prove + human awareness, identity, authority, payload access, or non-repudiation. +- Golden tests must include overclaim prevention, especially "provider + delivered" not becoming awareness or payload delivery. + +## First Implementation Implications + +- The mailbox scanner emits `EmailEvidenceCandidate` rows that are shaped to + become coordination-engine `EvidenceEvent` records later. +- Classifiers preserve `unknown_return_message` and `parse_failed` instead of + silently discarding uncertainty. +- Out-of-office replies update diagnostics only; they do not prove awareness or + reachability. +- Human replies are email-channel success signals, but not legal acceptance or + coordination result satisfaction. +- CSV reports include event type, assessment category/subclass, confidence, + evidence strength, observed time, occurred time, raw reference, and + deduplication key for auditability. diff --git a/docs/email-evidence-canon.md b/docs/email-evidence-canon.md new file mode 100644 index 0000000..fda8b98 --- /dev/null +++ b/docs/email-evidence-canon.md @@ -0,0 +1,48 @@ +# Email Evidence Canon + +## Rule + +Email events are evidence, not result satisfaction. + +The scanner reports email-channel facts and uncertainty. A downstream +coordination runtime decides whether those facts satisfy a coordination case. + +## Initial Message Classes + +- `hard_bounce` +- `soft_bounce` +- `delayed_delivery_notice` +- `final_delivery_failure` +- `out_of_office` +- `human_reply` +- `complaint_or_abuse` +- `unsubscribe_or_opt_out` +- `challenge_response` +- `unknown_return_message` +- `unrelated_message` +- `parse_failed` + +## Initial Normalized Events + +| Message class | Normalized event | Assessment | +| --- | --- | --- | +| `hard_bounce` | `notification.endpoint.rejected_permanent` | `fail.hard_bounce` | +| `soft_bounce` | `notification.endpoint.rejected_temporary` | `undef.deferred` | +| `delayed_delivery_notice` | `notification.endpoint.deferred` | `undef.deferred` | +| `final_delivery_failure` | `notification.endpoint.rejected_permanent` | `fail.expired_without_delivery` | +| `out_of_office` | `interaction.out_of_office_received` | `undef.out_of_office` | +| `human_reply` | `interaction.reply_received` | `success.reply_received` | +| `complaint_or_abuse` | `notification.channel.complaint_received` | `fail.complaint_received` | +| `unsubscribe_or_opt_out` | `notification.channel.unsubscribe_received` | `fail.unsubscribed` | +| `unknown_return_message` | `notification.endpoint.unknown` | `undef.conflicting_evidence` | +| `challenge_response` | `interaction.unverified_actor_interaction` | `undef.identity_uncertain` | + +## Overclaim Prevention + +- No bounce found does not mean delivery success. +- Provider acceptance does not mean endpoint acceptance. +- Endpoint acceptance does not mean inbox placement. +- Out-of-office does not prove recipient awareness or action. +- Human reply does not prove legal acceptance. +- Unknown return messages remain visible. +- Scanner and proxy interactions must stay below identity-bound interaction. diff --git a/docs/initial-runtime-architecture.md b/docs/initial-runtime-architecture.md new file mode 100644 index 0000000..a7fccb9 --- /dev/null +++ b/docs/initial-runtime-architecture.md @@ -0,0 +1,98 @@ +# Initial Runtime Architecture + +## Status + +This is the first implementation architecture for the mailbox evidence scanner +slice. It is intentionally small and stdlib-only so the repo can run before a +larger service stack is chosen. + +## Service Boundary + +The first slice is a CLI scanner: + +```text +email-connect scan-mailbox --config config/mailbox.example.yml --out reports/ +``` + +It scans an inbound return mailbox source, classifies messages, stores scan +state in SQLite, and writes timestamped CSV evidence reports. + +The initial source implementation supports fixture directories. The IMAP +connector remains the next mailbox boundary to complete under +`EMAIL-WP-0002-T02`. + +## Package Layout + +```text +src/email_connect/ + adapter_contract.py # coordination-engine descriptor and evidence ceiling + cli.py # command line entry points + config.py # config loading + evidence.py # native class to normalized evidence mapping + models.py # mailbox, parse, evidence, endpoint quality dataclasses + parser.py # MIME/header parsing and conservative classification + reporting.py # CSV report generation + scanner.py # scan orchestration + storage.py # SQLite state store +``` + +## Persistence + +SQLite is the MVP store. The initial schema includes: + +- `mailbox_scans` +- `mailbox_messages` +- `parsed_messages` +- `evidence_candidates` + +Message deduplication is keyed by mailbox ID, IMAP UID when present, message ID, +received timestamp, sender, subject hash, and body hash. Evidence +deduplication follows the workplan fields: message, parser version, normalized +event, affected recipient, original message, SMTP/enhanced status, and reason. + +## Evidence Mapping + +Parser output is represented as `ParsedMailboxMessage`. The mapper converts it +to `EmailEvidenceCandidate` using coordination-engine event names and advisory +assessment classes. + +Examples: + +- `hard_bounce` -> `notification.endpoint.rejected_permanent` +- `soft_bounce` -> `notification.endpoint.rejected_temporary` +- `out_of_office` -> `interaction.out_of_office_received` +- `human_reply` -> `interaction.reply_received` + +The mapper does not emit evidence for unrelated messages. Unknown return +messages stay visible as `notification.endpoint.unknown`. + +## coordination-engine Alignment + +The implementation keeps these coordination-engine concepts explicit: + +- adapter descriptor +- adapter capability profile +- evidence ceiling +- advisory assessment +- endpoint quality update shape +- event observation and raw reference preservation +- golden tests for overclaim prevention + +Email evidence remains below the coordination result layer. The scanner does +not infer inbox placement, human awareness, legal acceptance, payload access, or +case success. + +## Provider Boundary + +Provider webhook ingestion and outbound send APIs are deliberately outside this +slice. The mailbox scanner uses the same evidence model so future provider +events can enter through a parallel ingestion path and converge at the same +normalization layer. + +## Development Commands + +```bash +PYTHONPATH=src python3 -m unittest discover -s tests +PYTHONPATH=src python3 -m email_connect.cli adapter-descriptor +PYTHONPATH=src python3 -m email_connect.cli scan-mailbox --config config/mailbox.example.yml --out reports/ +``` diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..26f68df --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,19 @@ +[build-system] +requires = ["setuptools>=68"] +build-backend = "setuptools.build_meta" + +[project] +name = "email-connect" +version = "0.1.0" +description = "Provider-neutral email communication and evidence service." +readme = "README.md" +requires-python = ">=3.11" +license = { file = "LICENSE" } +authors = [{ name = "Coulomb" }] +dependencies = [] + +[project.scripts] +email-connect = "email_connect.cli:main" + +[tool.setuptools.packages.find] +where = ["src"] diff --git a/reports/.gitkeep b/reports/.gitkeep new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/reports/.gitkeep @@ -0,0 +1 @@ + diff --git a/src/email_connect/__init__.py b/src/email_connect/__init__.py new file mode 100644 index 0000000..2672ef3 --- /dev/null +++ b/src/email_connect/__init__.py @@ -0,0 +1,3 @@ +"""email-connect package.""" + +__version__ = "0.1.0" diff --git a/src/email_connect/adapter_contract.py b/src/email_connect/adapter_contract.py new file mode 100644 index 0000000..f2ba3ed --- /dev/null +++ b/src/email_connect/adapter_contract.py @@ -0,0 +1,96 @@ +from __future__ import annotations + +ADAPTER_ID = "email-connect" +ADAPTER_CONTRACT_VERSION = "1.1" + +EMITTED_EVENT_TYPES = [ + "notification.endpoint.rejected_permanent", + "notification.endpoint.rejected_temporary", + "notification.endpoint.deferred", + "notification.channel.complaint_received", + "notification.channel.unsubscribe_received", + "interaction.reply_received", + "interaction.out_of_office_received", + "notification.endpoint.unknown", +] + + +def adapter_descriptor() -> dict: + """Return the initial coordination-engine adapter descriptor.""" + + return { + "adapter_id": ADAPTER_ID, + "adapter_name": "email-connect", + "adapter_version": "0.1.0", + "adapter_contract_version": ADAPTER_CONTRACT_VERSION, + "adapter_types": ["notification", "communication", "interaction"], + "deployment_mode": "external", + "capability_profile": { + "can_notify": True, + "can_publish": False, + "can_deliver_payload": False, + "can_collect_payload": False, + "can_grant_access": False, + "can_revoke_access": False, + "can_observe_interaction": True, + "can_observe_identity": False, + "can_observe_authority": False, + "can_emit_delivery_receipts": False, + "can_emit_display_or_read_receipts": False, + "can_emit_return_or_failure_events": True, + "can_emit_late_events": True, + "can_cancel_after_dispatch": False, + "supports_webhooks": False, + "supports_polling": True, + "supports_idempotency": "adapter_managed", + "supports_endpoint_quality": True, + "supports_native_status_mapping": True, + "supports_golden_tests": True, + "limitations": [ + "Mailbox evidence cannot prove inbox placement.", + "Reply, open, and click signals do not prove coordination result satisfaction.", + ], + }, + "supported_actions": [], + "emitted_event_types": EMITTED_EVENT_TYPES, + "supported_channels": ["email"], + "supported_endpoint_types": ["email_address"], + "evidence_profile": { + "native_status_mapping_ref": "src/email_connect/evidence.py", + "golden_tests_ref": "tests/fixtures", + }, + "evidence_ceiling": { + "max_positive_event": "interaction.reply_received", + "max_positive_strength": "medium", + "can_prove_human_awareness": False, + "can_prove_payload_access": False, + "can_prove_payload_delivery": False, + "can_prove_identity": False, + "can_prove_authority": False, + "can_prove_non_repudiation": False, + "limitations": [ + "Email cannot reliably prove intended-recipient awareness.", + "Mailbox scanning observes return-channel evidence only.", + ], + }, + "assurance_capability": { + "awareness_assurance": "weak", + "delivery_assurance": "none", + "identity_assurance": "none", + "authority_assurance": "none", + "non_repudiation_assurance": "none", + }, + "identity_profile": { + "can_identify_actor": False, + "identity_limitations": ["Inbound mailbox evidence is not authenticated actor evidence."], + }, + "late_event_policy": { + "accepts_late_events": True, + "late_event_notes": ["Bounces and replies may arrive after earlier acceptance evidence."], + }, + "limitations": [ + "No outbound sending in the first mailbox-scanner MVP.", + "No provider webhook integration in the first mailbox-scanner MVP.", + "No legal delivery assessment.", + ], + } diff --git a/src/email_connect/cli.py b/src/email_connect/cli.py new file mode 100644 index 0000000..bc8c9ac --- /dev/null +++ b/src/email_connect/cli.py @@ -0,0 +1,55 @@ +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +from .adapter_contract import adapter_descriptor +from .config import load_config +from .scanner import scan_mailbox + + +def main(argv: list[str] | None = None) -> int: + parser = argparse.ArgumentParser(prog="email-connect") + sub = parser.add_subparsers(dest="command", required=True) + + scan = sub.add_parser("scan-mailbox", help="Scan a return mailbox or fixture directory.") + scan.add_argument("--config", required=True) + scan.add_argument("--out", default=None) + scan.add_argument("--full-rescan", action="store_true") + scan.add_argument("--since", default=None) + scan.add_argument("--report-only-new", action="store_true") + scan.add_argument("--dry-run", action="store_true") + scan.add_argument("--fixture-dir", default=None) + + sub.add_parser("adapter-descriptor", help="Print the initial coordination-engine adapter descriptor.") + + args = parser.parse_args(argv) + if args.command == "adapter-descriptor": + print(json.dumps(adapter_descriptor(), indent=2, sort_keys=True)) + return 0 + + if args.command == "scan-mailbox": + config = load_config(args.config) + result = scan_mailbox( + config, + output_dir=args.out, + full_rescan=args.full_rescan, + report_only_new=args.report_only_new, + dry_run=args.dry_run, + fixture_dir=args.fixture_dir, + ) + print(f"scan_id={result.scan.scan_id}") + print(f"messages_seen={result.scan.messages_seen}") + print(f"messages_new={result.scan.messages_new}") + print(f"messages_parsed={result.scan.messages_parsed}") + print(f"evidence_events_created={result.scan.evidence_events_created}") + if result.report_path: + print(f"report_path={Path(result.report_path)}") + return 0 + + return 2 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/src/email_connect/config.py b/src/email_connect/config.py new file mode 100644 index 0000000..da4716a --- /dev/null +++ b/src/email_connect/config.py @@ -0,0 +1,140 @@ +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +from typing import Any + + +@dataclass(frozen=True) +class MailboxConfig: + id: str + protocol: str = "imap" + host: str | None = None + port: int = 993 + tls: bool = True + username_env: str | None = None + password_env: str | None = None + folder: str = "INBOX" + + +@dataclass(frozen=True) +class ScanConfig: + mode: str = "incremental" + max_messages_per_run: int = 5000 + since: str | None = None + include_seen: bool = True + mark_seen: bool = False + store_raw_headers: bool = True + store_raw_body: bool = False + store_raw_message_ref: bool = True + + +@dataclass(frozen=True) +class StorageConfig: + path: str = ".email-connect/state.sqlite" + + +@dataclass(frozen=True) +class ReportsConfig: + output_dir: str = "reports" + include_all_evidence: bool = True + include_unknown_messages: bool = True + timestamp_timezone: str = "UTC" + + +@dataclass(frozen=True) +class SourceConfig: + fixture_dir: str | None = None + + +@dataclass(frozen=True) +class AppConfig: + mailbox: MailboxConfig + scan: ScanConfig + storage: StorageConfig + reports: ReportsConfig + source: SourceConfig = SourceConfig() + + +def load_config(path: str | Path) -> AppConfig: + data = _load_mapping(Path(path)) + mailbox = data.get("mailbox", {}) + scan = data.get("scan", {}) + storage = data.get("storage", {}) + reports = data.get("reports", {}) + source = data.get("source", {}) + return AppConfig( + mailbox=MailboxConfig( + id=str(mailbox.get("id", "return-mailbox-default")), + protocol=str(mailbox.get("protocol", "imap")), + host=mailbox.get("host"), + port=int(mailbox.get("port", 993)), + tls=bool(mailbox.get("tls", True)), + username_env=mailbox.get("username_env"), + password_env=mailbox.get("password_env"), + folder=str(mailbox.get("folder", "INBOX")), + ), + scan=ScanConfig( + mode=str(scan.get("mode", "incremental")), + max_messages_per_run=int(scan.get("max_messages_per_run", 5000)), + since=scan.get("since"), + include_seen=bool(scan.get("include_seen", True)), + mark_seen=bool(scan.get("mark_seen", False)), + store_raw_headers=bool(scan.get("store_raw_headers", True)), + store_raw_body=bool(scan.get("store_raw_body", False)), + store_raw_message_ref=bool(scan.get("store_raw_message_ref", True)), + ), + storage=StorageConfig(path=str(storage.get("path", ".email-connect/state.sqlite"))), + reports=ReportsConfig( + output_dir=str(reports.get("output_dir", "reports")), + include_all_evidence=bool(reports.get("include_all_evidence", True)), + include_unknown_messages=bool(reports.get("include_unknown_messages", True)), + timestamp_timezone=str(reports.get("timestamp_timezone", "UTC")), + ), + source=SourceConfig(fixture_dir=source.get("fixture_dir")), + ) + + +def _load_mapping(path: Path) -> dict[str, Any]: + text = path.read_text(encoding="utf-8") + try: + import yaml # type: ignore + + loaded = yaml.safe_load(text) + return loaded or {} + except ModuleNotFoundError: + return _parse_simple_yaml(text) + + +def _parse_simple_yaml(text: str) -> dict[str, Any]: + """Parse the small YAML subset used by config/mailbox.example.yml.""" + + result: dict[str, Any] = {} + current: dict[str, Any] | None = None + for raw_line in text.splitlines(): + line = raw_line.split("#", 1)[0].rstrip() + if not line.strip(): + continue + if not line.startswith(" ") and line.endswith(":"): + key = line[:-1].strip() + current = {} + result[key] = current + continue + if current is None or ":" not in line: + raise ValueError(f"Unsupported YAML line: {raw_line}") + key, value = line.strip().split(":", 1) + current[key.strip()] = _parse_scalar(value.strip()) + return result + + +def _parse_scalar(value: str) -> Any: + if value == "" or value == "null": + return None + if value in {"true", "false"}: + return value == "true" + if value.startswith('"') and value.endswith('"'): + return value[1:-1] + try: + return int(value) + except ValueError: + return value diff --git a/src/email_connect/evidence.py b/src/email_connect/evidence.py new file mode 100644 index 0000000..3dbf2bd --- /dev/null +++ b/src/email_connect/evidence.py @@ -0,0 +1,151 @@ +from __future__ import annotations + +from dataclasses import dataclass +from datetime import datetime +from uuid import uuid5, NAMESPACE_URL + +from .models import ( + AssessmentCategory, + Confidence, + EmailEvidenceCandidate, + EvidenceStrength, + MessageClass, + ParsedMailboxMessage, +) + + +@dataclass(frozen=True) +class EvidenceMapping: + event_type: str + assessment_category: AssessmentCategory + assessment_subclass: str + evidence_strength: EvidenceStrength + normalized_meaning: str + + +EVIDENCE_MAPPINGS: dict[MessageClass, EvidenceMapping | None] = { + MessageClass.HARD_BOUNCE: EvidenceMapping( + "notification.endpoint.rejected_permanent", + AssessmentCategory.FAIL, + "fail.hard_bounce", + EvidenceStrength.NEGATIVE, + "Strong evidence that the recipient endpoint rejected the message permanently.", + ), + MessageClass.SOFT_BOUNCE: EvidenceMapping( + "notification.endpoint.rejected_temporary", + AssessmentCategory.UNDEF, + "undef.deferred", + EvidenceStrength.AMBIGUOUS, + "Temporary endpoint failure; retry or later evidence may change interpretation.", + ), + MessageClass.DELAYED_DELIVERY_NOTICE: EvidenceMapping( + "notification.endpoint.deferred", + AssessmentCategory.UNDEF, + "undef.deferred", + EvidenceStrength.AMBIGUOUS, + "Delivery is delayed and still uncertain.", + ), + MessageClass.FINAL_DELIVERY_FAILURE: EvidenceMapping( + "notification.endpoint.rejected_permanent", + AssessmentCategory.FAIL, + "fail.expired_without_delivery", + EvidenceStrength.NEGATIVE, + "Provider or endpoint gave up after retrying.", + ), + MessageClass.OUT_OF_OFFICE: EvidenceMapping( + "interaction.out_of_office_received", + AssessmentCategory.UNDEF, + "undef.out_of_office", + EvidenceStrength.WEAK, + "Mailbox auto-reply was observed; awareness and action remain uncertain.", + ), + MessageClass.HUMAN_REPLY: EvidenceMapping( + "interaction.reply_received", + AssessmentCategory.SUCCESS, + "success.reply_received", + EvidenceStrength.MEDIUM, + "A reply-like message was observed on the email channel.", + ), + MessageClass.COMPLAINT_OR_ABUSE: EvidenceMapping( + "notification.channel.complaint_received", + AssessmentCategory.FAIL, + "fail.complaint_received", + EvidenceStrength.NEGATIVE, + "Complaint or abuse feedback was observed.", + ), + MessageClass.UNSUBSCRIBE_OR_OPT_OUT: EvidenceMapping( + "notification.channel.unsubscribe_received", + AssessmentCategory.FAIL, + "fail.unsubscribed", + EvidenceStrength.NEGATIVE, + "Opt-out or unsubscribe signal was observed.", + ), + MessageClass.UNKNOWN_RETURN_MESSAGE: EvidenceMapping( + "notification.endpoint.unknown", + AssessmentCategory.UNDEF, + "undef.conflicting_evidence", + EvidenceStrength.AMBIGUOUS, + "Return-channel message could not be classified reliably.", + ), + MessageClass.CHALLENGE_RESPONSE: EvidenceMapping( + "interaction.unverified_actor_interaction", + AssessmentCategory.UNDEF, + "undef.identity_uncertain", + EvidenceStrength.AMBIGUOUS, + "Challenge-response or automated interaction was observed.", + ), + MessageClass.UNRELATED_MESSAGE: None, + MessageClass.PARSE_FAILED: None, +} + + +def evidence_deduplication_key(parsed: ParsedMailboxMessage, mapping: EvidenceMapping) -> str: + parts = [ + parsed.mailbox_message_id, + parsed.parser_version, + mapping.event_type, + parsed.affected_email_address or "", + parsed.original_message_id or "", + parsed.smtp_status_code or "", + parsed.enhanced_status_code or "", + parsed.reason_code or "", + ] + return "|".join(parts) + + +def candidate_from_parsed( + parsed: ParsedMailboxMessage, + *, + raw_message_ref: str | None, + observed_at: datetime, + occurred_at: datetime | None, +) -> EmailEvidenceCandidate | None: + mapping = EVIDENCE_MAPPINGS.get(parsed.message_class) + if mapping is None: + return None + + dedup_key = evidence_deduplication_key(parsed, mapping) + candidate_id = str(uuid5(NAMESPACE_URL, "email-connect:evidence:" + dedup_key)) + return EmailEvidenceCandidate( + evidence_candidate_id=candidate_id, + mailbox_message_id=parsed.mailbox_message_id, + parsed_message_id=parsed.parsed_message_id, + event_type=mapping.event_type, + assessment_category=mapping.assessment_category, + assessment_subclass=mapping.assessment_subclass, + affected_email_address=parsed.affected_email_address, + original_message_id=parsed.original_message_id, + confidence=parsed.confidence, + evidence_strength=mapping.evidence_strength, + occurred_at=occurred_at, + observed_at=observed_at, + deduplication_key=dedup_key, + raw_message_ref=raw_message_ref, + notes=[mapping.normalized_meaning, *parsed.notes], + metadata={ + "message_class": parsed.message_class.value, + "smtp_status_code": parsed.smtp_status_code, + "enhanced_status_code": parsed.enhanced_status_code, + "reason_code": parsed.reason_code, + }, + ) diff --git a/src/email_connect/models.py b/src/email_connect/models.py new file mode 100644 index 0000000..cced3eb --- /dev/null +++ b/src/email_connect/models.py @@ -0,0 +1,126 @@ +from __future__ import annotations + +from dataclasses import dataclass, field +from datetime import datetime +from enum import StrEnum +from typing import Any + + +class MessageClass(StrEnum): + HARD_BOUNCE = "hard_bounce" + SOFT_BOUNCE = "soft_bounce" + DELAYED_DELIVERY_NOTICE = "delayed_delivery_notice" + FINAL_DELIVERY_FAILURE = "final_delivery_failure" + OUT_OF_OFFICE = "out_of_office" + HUMAN_REPLY = "human_reply" + COMPLAINT_OR_ABUSE = "complaint_or_abuse" + UNSUBSCRIBE_OR_OPT_OUT = "unsubscribe_or_opt_out" + CHALLENGE_RESPONSE = "challenge_response" + UNKNOWN_RETURN_MESSAGE = "unknown_return_message" + UNRELATED_MESSAGE = "unrelated_message" + PARSE_FAILED = "parse_failed" + + +class AssessmentCategory(StrEnum): + SUCCESS = "success" + FAIL = "fail" + UNDEF = "undef" + + +class Confidence(StrEnum): + LOW = "low" + MEDIUM = "medium" + HIGH = "high" + + +class EvidenceStrength(StrEnum): + NONE = "none" + WEAK = "weak" + MEDIUM = "medium" + STRONG = "strong" + NEGATIVE = "negative" + AMBIGUOUS = "ambiguous" + + +@dataclass(frozen=True) +class MailboxScan: + scan_id: str + mailbox_id: str + started_at: datetime + finished_at: datetime | None + scan_mode: str + folder: str + status: str + messages_seen: int = 0 + messages_new: int = 0 + messages_parsed: int = 0 + evidence_events_created: int = 0 + report_path: str | None = None + since: datetime | None = None + + +@dataclass(frozen=True) +class InboundMailboxMessage: + mailbox_message_id: str + mailbox_id: str + imap_uid: str | None + message_id_header: str | None + received_at: datetime | None + from_address: str | None + to_addresses: list[str] + subject: str | None + raw_headers_ref: str | None + raw_message_ref: str | None + first_seen_at: datetime + last_seen_at: datetime + deduplication_key: str + + +@dataclass(frozen=True) +class ParsedMailboxMessage: + parsed_message_id: str + mailbox_message_id: str + parser_version: str + message_class: MessageClass + affected_email_address: str | None + original_message_id: str | None + original_recipient: str | None + smtp_status_code: str | None + enhanced_status_code: str | None + reason_code: str | None + confidence: Confidence + parsed_at: datetime + notes: list[str] = field(default_factory=list) + + +@dataclass(frozen=True) +class EmailEvidenceCandidate: + evidence_candidate_id: str + mailbox_message_id: str + parsed_message_id: str + event_type: str + assessment_category: AssessmentCategory + assessment_subclass: str + affected_email_address: str | None + original_message_id: str | None + confidence: Confidence + evidence_strength: EvidenceStrength + occurred_at: datetime | None + observed_at: datetime + deduplication_key: str + raw_message_ref: str | None + notes: list[str] = field(default_factory=list) + metadata: dict[str, Any] = field(default_factory=dict) + + +@dataclass(frozen=True) +class EndpointQualityUpdate: + affected_email_address: str + reachability: str = "unknown" + suppression_state: str = "unknown" + last_success_at: datetime | None = None + last_failure_at: datetime | None = None + last_auto_reply_at: datetime | None = None + reason_code: str | None = None + confidence: Confidence = Confidence.LOW + quality_signals: list[str] = field(default_factory=list) diff --git a/src/email_connect/parser.py b/src/email_connect/parser.py new file mode 100644 index 0000000..87881a4 --- /dev/null +++ b/src/email_connect/parser.py @@ -0,0 +1,320 @@ +from __future__ import annotations + +import hashlib +import re +from datetime import UTC, datetime +from email import policy +from email.parser import BytesParser +from email.utils import getaddresses, parsedate_to_datetime +from pathlib import Path +from uuid import NAMESPACE_URL, uuid5 + +from .evidence import candidate_from_parsed +from .models import ( + Confidence, + EmailEvidenceCandidate, + InboundMailboxMessage, + MessageClass, + ParsedMailboxMessage, +) + +PARSER_VERSION = "mailbox-scanner-v0.1" + +EMAIL_RE = re.compile(r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}", re.IGNORECASE) +ENHANCED_STATUS_RE = re.compile(r"\b[245]\.\d{1,3}\.\d{1,3}\b") +SMTP_STATUS_RE = re.compile(r"\b[245]\d{2}\b") + + +def parse_message_bytes( + raw_bytes: bytes, + *, + mailbox_id: str, + raw_message_ref: str | None, + imap_uid: str | None = None, + now: datetime | None = None, +) -> tuple[InboundMailboxMessage, ParsedMailboxMessage, EmailEvidenceCandidate | None]: + observed_at = now or datetime.now(UTC) + msg = BytesParser(policy=policy.default).parsebytes(raw_bytes) + + message_id = _clean_header(msg.get("Message-ID")) + subject = _clean_header(msg.get("Subject")) + from_address = _first_address(msg.get("From")) + to_addresses = [addr for _, addr in getaddresses(msg.get_all("To", []))] + received_at = _parse_date(msg.get("Date")) + text = _extract_text(msg) + headers_text = "\n".join(f"{key}: {value}" for key, value in msg.items()) + combined = f"{headers_text}\n\n{subject or ''}\n{text}" + dedup_key = _message_dedup_key( + mailbox_id=mailbox_id, + imap_uid=imap_uid, + message_id=message_id, + received_at=received_at, + from_address=from_address, + subject=subject, + body=text, + ) + mailbox_message_id = str(uuid5(NAMESPACE_URL, "email-connect:message:" + dedup_key)) + inbound = InboundMailboxMessage( + mailbox_message_id=mailbox_message_id, + mailbox_id=mailbox_id, + imap_uid=imap_uid, + message_id_header=message_id, + received_at=received_at, + from_address=from_address, + to_addresses=to_addresses, + subject=subject, + raw_headers_ref=raw_message_ref, + raw_message_ref=raw_message_ref, + first_seen_at=observed_at, + last_seen_at=observed_at, + deduplication_key=dedup_key, + ) + + parsed = classify_message(inbound, combined, observed_at=observed_at) + candidate = candidate_from_parsed( + parsed, + raw_message_ref=raw_message_ref, + observed_at=observed_at, + occurred_at=received_at, + ) + return inbound, parsed, candidate + + +def parse_message_file( + path: str | Path, + *, + mailbox_id: str, + now: datetime | None = None, +) -> tuple[InboundMailboxMessage, ParsedMailboxMessage, EmailEvidenceCandidate | None]: + file_path = Path(path) + return parse_message_bytes( + file_path.read_bytes(), + mailbox_id=mailbox_id, + raw_message_ref=str(file_path), + now=now, + ) + + +def classify_message( + inbound: InboundMailboxMessage, + combined_text: str, + *, + observed_at: datetime, +) -> ParsedMailboxMessage: + text = combined_text.lower() + enhanced_status = _first_match(ENHANCED_STATUS_RE, combined_text) + smtp_status = _first_match(SMTP_STATUS_RE, combined_text) + affected = _extract_affected_recipient(combined_text) + original_message_id = _extract_headerish(combined_text, "Original-Message-ID") + notes: list[str] = [] + + message_class = MessageClass.UNRELATED_MESSAGE + confidence = Confidence.LOW + reason_code: str | None = None + + if _contains_any(text, ["feedback loop", "abuse report", "spam complaint", "complaint notification"]): + message_class = MessageClass.COMPLAINT_OR_ABUSE + confidence = Confidence.HIGH + reason_code = "complaint" + elif _contains_any(text, ["unsubscribe", "opt-out", "opt out", "remove me"]): + message_class = MessageClass.UNSUBSCRIBE_OR_OPT_OUT + confidence = Confidence.MEDIUM + reason_code = "unsubscribe" + elif _is_out_of_office(text): + message_class = MessageClass.OUT_OF_OFFICE + confidence = Confidence.MEDIUM + reason_code = "auto_reply" + elif _contains_any(text, ["will keep trying", "delivery delayed", "message delayed", "not yet delivered"]): + message_class = MessageClass.DELAYED_DELIVERY_NOTICE + confidence = Confidence.HIGH + reason_code = "delayed" + elif _is_dsn_like(text): + message_class, confidence, reason_code = _classify_dsn(text, smtp_status, enhanced_status) + elif _looks_like_human_reply(inbound, text): + message_class = MessageClass.HUMAN_REPLY + confidence = Confidence.MEDIUM + reason_code = "reply" + elif _looks_return_related(text): + message_class = MessageClass.UNKNOWN_RETURN_MESSAGE + confidence = Confidence.LOW + reason_code = "unknown_return" + notes.append("Return-channel message did not match a reliable classifier.") + + if enhanced_status: + notes.append(f"enhanced_status={enhanced_status}") + if smtp_status: + notes.append(f"smtp_status={smtp_status}") + + parsed_id_basis = "|".join([inbound.mailbox_message_id, PARSER_VERSION, message_class.value]) + return ParsedMailboxMessage( + parsed_message_id=str(uuid5(NAMESPACE_URL, "email-connect:parsed:" + parsed_id_basis)), + mailbox_message_id=inbound.mailbox_message_id, + parser_version=PARSER_VERSION, + message_class=message_class, + affected_email_address=affected, + original_message_id=original_message_id, + original_recipient=affected, + smtp_status_code=smtp_status, + enhanced_status_code=enhanced_status, + reason_code=reason_code, + confidence=confidence, + parsed_at=observed_at, + notes=notes, + ) + + +def _classify_dsn( + text: str, + smtp_status: str | None, + enhanced_status: str | None, +) -> tuple[MessageClass, Confidence, str]: + if _contains_any(text, ["final failure", "giving up", "could not deliver after"]): + return MessageClass.FINAL_DELIVERY_FAILURE, Confidence.HIGH, "expired_without_delivery" + if (enhanced_status and enhanced_status.startswith("5.")) or (smtp_status and smtp_status.startswith("5")): + return MessageClass.HARD_BOUNCE, Confidence.HIGH, "hard_bounce" + if _contains_any(text, ["unknown user", "mailbox not found", "user unknown", "domain not found"]): + return MessageClass.HARD_BOUNCE, Confidence.HIGH, "hard_bounce" + if (enhanced_status and enhanced_status.startswith("4.")) or (smtp_status and smtp_status.startswith("4")): + return MessageClass.SOFT_BOUNCE, Confidence.HIGH, "soft_bounce" + if _contains_any(text, ["temporary failure", "mailbox full", "greylist", "try again later"]): + return MessageClass.SOFT_BOUNCE, Confidence.MEDIUM, "soft_bounce" + return MessageClass.UNKNOWN_RETURN_MESSAGE, Confidence.LOW, "unknown_dsn" + + +def _extract_text(msg) -> str: + chunks: list[str] = [] + if msg.is_multipart(): + for part in msg.walk(): + if part.get_content_maintype() == "multipart": + continue + if part.get_content_type() in {"text/plain", "message/delivery-status"}: + try: + chunks.append(str(part.get_content())) + except Exception: + continue + else: + try: + chunks.append(str(msg.get_content())) + except Exception: + payload = msg.get_payload(decode=True) + if payload: + chunks.append(payload.decode(errors="replace")) + return "\n".join(chunks) + + +def _message_dedup_key( + *, + mailbox_id: str, + imap_uid: str | None, + message_id: str | None, + received_at: datetime | None, + from_address: str | None, + subject: str | None, + body: str, +) -> str: + body_hash = hashlib.sha256(body.encode("utf-8", errors="replace")).hexdigest()[:16] + return "|".join( + [ + mailbox_id, + imap_uid or "", + message_id or "", + received_at.isoformat() if received_at else "", + from_address or "", + hashlib.sha256((subject or "").encode()).hexdigest()[:12], + body_hash, + ] + ) + + +def _clean_header(value: str | None) -> str | None: + if value is None: + return None + cleaned = str(value).strip() + return cleaned or None + + +def _first_address(value: str | None) -> str | None: + if not value: + return None + addresses = getaddresses([value]) + return addresses[0][1] if addresses else None + + +def _parse_date(value: str | None) -> datetime | None: + if not value: + return None + try: + parsed = parsedate_to_datetime(value) + except (TypeError, ValueError): + return None + if parsed.tzinfo is None: + return parsed.replace(tzinfo=UTC) + return parsed.astimezone(UTC) + + +def _first_match(pattern: re.Pattern[str], text: str) -> str | None: + match = pattern.search(text) + return match.group(0) if match else None + + +def _extract_headerish(text: str, name: str) -> str | None: + match = re.search(rf"^{re.escape(name)}:\s*(.+)$", text, re.IGNORECASE | re.MULTILINE) + return match.group(1).strip() if match else None + + +def _extract_affected_recipient(text: str) -> str | None: + for name in ["Final-Recipient", "Original-Recipient", "X-Failed-Recipients", "Failed-Recipient"]: + value = _extract_headerish(text, name) + if value: + match = EMAIL_RE.search(value) + if match: + return match.group(0).lower() + match = EMAIL_RE.search(text) + return match.group(0).lower() if match else None + + +def _contains_any(text: str, needles: list[str]) -> bool: + return any(needle in text for needle in needles) + + +def _is_dsn_like(text: str) -> bool: + return _contains_any( + text, + [ + "delivery status notification", + "delivery failure", + "undeliverable", + "returned mail", + "mail delivery subsystem", + "final-recipient", + "diagnostic-code", + "status:", + ], + ) + + +def _is_out_of_office(text: str) -> bool: + return _contains_any( + text, + [ + "out of office", + "auto-reply", + "autoreply", + "automatic reply", + "vacation", + "abwesenheitsnotiz", + "nicht im buero", + "nicht im büro", + ], + ) + + +def _looks_like_human_reply(inbound: InboundMailboxMessage, text: str) -> bool: + subject = (inbound.subject or "").lower() + if _contains_any(text, ["auto-submitted: auto-replied", "x-autoreply", "auto-generated"]): + return False + return subject.startswith("re:") or _contains_any(text, ["thanks", "thank you", "i confirm", "received"]) + + +def _looks_return_related(text: str) -> bool: + return _contains_any(text, ["delivery", "mailbox", "recipient", "message", "smtp", "unsubscribe", "reply"]) diff --git a/src/email_connect/reporting.py b/src/email_connect/reporting.py new file mode 100644 index 0000000..d901eb8 --- /dev/null +++ b/src/email_connect/reporting.py @@ -0,0 +1,107 @@ +from __future__ import annotations + +import csv +import json +from datetime import UTC, datetime +from pathlib import Path + +REPORT_COLUMNS = [ + "report_generated_at", + "scan_id", + "mailbox_id", + "mailbox_message_id", + "mailbox_received_at", + "source_from", + "source_to", + "source_subject", + "message_id_header", + "detected_message_class", + "normalized_event_type", + "assessment_category", + "assessment_subclass", + "affected_email_address", + "original_message_id", + "original_recipient", + "smtp_status_code", + "enhanced_status_code", + "reason_code", + "confidence", + "evidence_strength", + "occurred_at", + "observed_at", + "first_seen_at", + "last_seen_at", + "deduplication_key", + "raw_message_ref", + "notes", +] + + +def report_filename(now: datetime | None = None) -> str: + stamp = (now or datetime.now(UTC)).strftime("%Y%m%d-%H%M%S") + return f"email-channel-evidence-report-{stamp}.csv" + + +def write_evidence_report( + rows: list[dict], + *, + output_dir: str | Path, + scan_id: str, + mailbox_id: str, + generated_at: datetime | None = None, +) -> Path: + generated = generated_at or datetime.now(UTC) + out_dir = Path(output_dir) + out_dir.mkdir(parents=True, exist_ok=True) + path = out_dir / report_filename(generated) + + with path.open("w", newline="", encoding="utf-8") as fh: + writer = csv.DictWriter(fh, fieldnames=REPORT_COLUMNS) + writer.writeheader() + for row in rows: + writer.writerow(_report_row(row, scan_id=scan_id, mailbox_id=mailbox_id, generated_at=generated)) + return path + + +def _report_row(row: dict, *, scan_id: str, mailbox_id: str, generated_at: datetime) -> dict: + metadata = _json(row.get("metadata_json")) + notes = _json(row.get("notes_json")) + return { + "report_generated_at": generated_at.isoformat(), + "scan_id": scan_id, + "mailbox_id": mailbox_id, + "mailbox_message_id": row.get("mailbox_message_id", ""), + "mailbox_received_at": row.get("occurred_at") or "", + "source_from": metadata.get("source_from", ""), + "source_to": metadata.get("source_to", ""), + "source_subject": metadata.get("source_subject", ""), + "message_id_header": metadata.get("message_id_header", ""), + "detected_message_class": metadata.get("message_class", ""), + "normalized_event_type": row.get("event_type", ""), + "assessment_category": row.get("assessment_category", ""), + "assessment_subclass": row.get("assessment_subclass", ""), + "affected_email_address": row.get("affected_email_address") or "", + "original_message_id": row.get("original_message_id") or "", + "original_recipient": metadata.get("original_recipient", ""), + "smtp_status_code": metadata.get("smtp_status_code") or "", + "enhanced_status_code": metadata.get("enhanced_status_code") or "", + "reason_code": metadata.get("reason_code") or "", + "confidence": row.get("confidence", ""), + "evidence_strength": row.get("evidence_strength", ""), + "occurred_at": row.get("occurred_at") or "", + "observed_at": row.get("observed_at") or "", + "first_seen_at": metadata.get("first_seen_at", ""), + "last_seen_at": metadata.get("last_seen_at", ""), + "deduplication_key": row.get("deduplication_key", ""), + "raw_message_ref": row.get("raw_message_ref") or "", + "notes": "; ".join(str(item) for item in notes), + } + + +def _json(value: str | None) -> dict | list: + if not value: + return {} + try: + return json.loads(value) + except json.JSONDecodeError: + return {} diff --git a/src/email_connect/scanner.py b/src/email_connect/scanner.py new file mode 100644 index 0000000..1fde93a --- /dev/null +++ b/src/email_connect/scanner.py @@ -0,0 +1,107 @@ +from __future__ import annotations + +from dataclasses import dataclass +from datetime import UTC, datetime +from pathlib import Path +from uuid import uuid4 + +from .config import AppConfig +from .models import MailboxScan +from .parser import parse_message_file +from .reporting import write_evidence_report +from .storage import StateStore + + +@dataclass(frozen=True) +class ScanResult: + scan: MailboxScan + report_path: Path | None + + +def scan_mailbox( + config: AppConfig, + *, + output_dir: str | None = None, + full_rescan: bool = False, + report_only_new: bool = False, + dry_run: bool = False, + fixture_dir: str | None = None, +) -> ScanResult: + started_at = datetime.now(UTC) + scan_id = str(uuid4()) + source_dir = fixture_dir or config.source.fixture_dir + if not source_dir: + raise ValueError("No source.fixture_dir configured yet. IMAP connector is planned in EMAIL-WP-0002-T02.") + + files = sorted(Path(source_dir).glob("*.eml")) + if config.scan.max_messages_per_run: + files = files[: config.scan.max_messages_per_run] + + store = StateStore(config.storage.path) + messages_seen = 0 + messages_new = 0 + messages_parsed = 0 + evidence_created = 0 + try: + for path in files: + messages_seen += 1 + inbound, parsed, candidate = parse_message_file(path, mailbox_id=config.mailbox.id) + if dry_run: + messages_parsed += 1 + continue + is_new = store.upsert_message(inbound) + messages_new += int(is_new) + store.insert_parsed(parsed) + messages_parsed += 1 + if candidate is not None: + candidate = _enrich_candidate(candidate, inbound, parsed) + evidence_created += int(store.insert_evidence(candidate)) + + report_path = None + if not dry_run: + report_rows = store.evidence_rows() + report_path = write_evidence_report( + report_rows, + output_dir=output_dir or config.reports.output_dir, + scan_id=scan_id, + mailbox_id=config.mailbox.id, + ) + finished_at = datetime.now(UTC) + scan = MailboxScan( + scan_id=scan_id, + mailbox_id=config.mailbox.id, + started_at=started_at, + finished_at=finished_at, + scan_mode="full_rescan" if full_rescan else config.scan.mode, + folder=config.mailbox.folder, + status="completed", + messages_seen=messages_seen, + messages_new=messages_new, + messages_parsed=messages_parsed, + evidence_events_created=evidence_created, + report_path=str(report_path) if report_path else None, + ) + if not dry_run: + store.insert_scan(scan) + return ScanResult(scan=scan, report_path=report_path) + finally: + store.close() + + +def _enrich_candidate(candidate, inbound, parsed): + metadata = { + **candidate.metadata, + "source_from": inbound.from_address, + "source_to": ", ".join(inbound.to_addresses), + "source_subject": inbound.subject, + "message_id_header": inbound.message_id_header, + "original_recipient": parsed.original_recipient, + "first_seen_at": inbound.first_seen_at.isoformat(), + "last_seen_at": inbound.last_seen_at.isoformat(), + } + return type(candidate)( + **{ + **candidate.__dict__, + "metadata": metadata, + } + ) diff --git a/src/email_connect/storage.py b/src/email_connect/storage.py new file mode 100644 index 0000000..7709c1f --- /dev/null +++ b/src/email_connect/storage.py @@ -0,0 +1,212 @@ +from __future__ import annotations + +import json +import sqlite3 +from datetime import UTC, datetime +from pathlib import Path + +from .models import EmailEvidenceCandidate, InboundMailboxMessage, MailboxScan, ParsedMailboxMessage + + +class StateStore: + def __init__(self, path: str | Path) -> None: + self.path = Path(path) + self.path.parent.mkdir(parents=True, exist_ok=True) + self.conn = sqlite3.connect(self.path) + self.conn.row_factory = sqlite3.Row + self.init_schema() + + def close(self) -> None: + self.conn.close() + + def init_schema(self) -> None: + self.conn.executescript( + """ + create table if not exists mailbox_scans ( + scan_id text primary key, + mailbox_id text not null, + started_at text not null, + finished_at text, + scan_mode text not null, + folder text not null, + status text not null, + messages_seen integer not null, + messages_new integer not null, + messages_parsed integer not null, + evidence_events_created integer not null, + report_path text, + since text + ); + + create table if not exists mailbox_messages ( + mailbox_message_id text primary key, + mailbox_id text not null, + imap_uid text, + message_id_header text, + received_at text, + from_address text, + to_addresses_json text not null, + subject text, + raw_headers_ref text, + raw_message_ref text, + first_seen_at text not null, + last_seen_at text not null, + deduplication_key text not null unique + ); + + create table if not exists parsed_messages ( + parsed_message_id text primary key, + mailbox_message_id text not null, + parser_version text not null, + message_class text not null, + affected_email_address text, + original_message_id text, + original_recipient text, + smtp_status_code text, + enhanced_status_code text, + reason_code text, + confidence text not null, + parsed_at text not null, + notes_json text not null + ); + + create table if not exists evidence_candidates ( + evidence_candidate_id text primary key, + mailbox_message_id text not null, + parsed_message_id text not null, + event_type text not null, + assessment_category text not null, + assessment_subclass text not null, + affected_email_address text, + original_message_id text, + confidence text not null, + evidence_strength text not null, + occurred_at text, + observed_at text not null, + deduplication_key text not null unique, + raw_message_ref text, + notes_json text not null, + metadata_json text not null + ); + """ + ) + self.conn.commit() + + def upsert_message(self, message: InboundMailboxMessage) -> bool: + existing = self.conn.execute( + "select mailbox_message_id from mailbox_messages where deduplication_key = ?", + (message.deduplication_key,), + ).fetchone() + if existing: + self.conn.execute( + "update mailbox_messages set last_seen_at = ? where deduplication_key = ?", + (_dt(message.last_seen_at), message.deduplication_key), + ) + self.conn.commit() + return False + self.conn.execute( + """ + insert into mailbox_messages values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + message.mailbox_message_id, + message.mailbox_id, + message.imap_uid, + message.message_id_header, + _dt(message.received_at), + message.from_address, + json.dumps(message.to_addresses), + message.subject, + message.raw_headers_ref, + message.raw_message_ref, + _dt(message.first_seen_at), + _dt(message.last_seen_at), + message.deduplication_key, + ), + ) + self.conn.commit() + return True + + def insert_parsed(self, parsed: ParsedMailboxMessage) -> None: + self.conn.execute( + """ + insert or replace into parsed_messages values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + parsed.parsed_message_id, + parsed.mailbox_message_id, + parsed.parser_version, + parsed.message_class.value, + parsed.affected_email_address, + parsed.original_message_id, + parsed.original_recipient, + parsed.smtp_status_code, + parsed.enhanced_status_code, + parsed.reason_code, + parsed.confidence.value, + _dt(parsed.parsed_at), + json.dumps(parsed.notes), + ), + ) + self.conn.commit() + + def insert_evidence(self, candidate: EmailEvidenceCandidate) -> bool: + try: + self.conn.execute( + """ + insert into evidence_candidates values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + candidate.evidence_candidate_id, + candidate.mailbox_message_id, + candidate.parsed_message_id, + candidate.event_type, + candidate.assessment_category.value, + candidate.assessment_subclass, + candidate.affected_email_address, + candidate.original_message_id, + candidate.confidence.value, + candidate.evidence_strength.value, + _dt(candidate.occurred_at), + _dt(candidate.observed_at), + candidate.deduplication_key, + candidate.raw_message_ref, + json.dumps(candidate.notes), + json.dumps(candidate.metadata), + ), + ) + except sqlite3.IntegrityError: + return False + self.conn.commit() + return True + + def insert_scan(self, scan: MailboxScan) -> None: + self.conn.execute( + """ + insert or replace into mailbox_scans values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + """, + ( + scan.scan_id, + scan.mailbox_id, + _dt(scan.started_at), + _dt(scan.finished_at), + scan.scan_mode, + scan.folder, + scan.status, + scan.messages_seen, + scan.messages_new, + scan.messages_parsed, + scan.evidence_events_created, + scan.report_path, + _dt(scan.since), + ), + ) + self.conn.commit() + + def evidence_rows(self) -> list[dict]: + rows = self.conn.execute("select * from evidence_candidates order by observed_at, event_type").fetchall() + return [dict(row) for row in rows] + + +def _dt(value: datetime | None) -> str | None: + return value.isoformat() if value else None diff --git a/tests/fixtures/mailbox/hard_bounce.eml b/tests/fixtures/mailbox/hard_bounce.eml new file mode 100644 index 0000000..b1599be --- /dev/null +++ b/tests/fixtures/mailbox/hard_bounce.eml @@ -0,0 +1,12 @@ +From: Mail Delivery Subsystem +To: sender@example.com +Subject: Delivery Status Notification (Failure) +Date: Tue, 02 Jun 2026 10:00:00 +0000 +Message-ID: +Content-Type: text/plain; charset=utf-8 + +Delivery failure. +Final-Recipient: rfc822; missing@example.com +Action: failed +Status: 5.1.1 +Diagnostic-Code: smtp; 550 5.1.1 User unknown diff --git a/tests/fixtures/mailbox/human_reply.eml b/tests/fixtures/mailbox/human_reply.eml new file mode 100644 index 0000000..4d0a55b --- /dev/null +++ b/tests/fixtures/mailbox/human_reply.eml @@ -0,0 +1,8 @@ +From: Recipient +To: sender@example.com +Subject: Re: Your notification +Date: Tue, 02 Jun 2026 10:03:00 +0000 +Message-ID: +Content-Type: text/plain; charset=utf-8 + +Thanks, I received this and will review it today. diff --git a/tests/fixtures/mailbox/out_of_office.eml b/tests/fixtures/mailbox/out_of_office.eml new file mode 100644 index 0000000..dd554de --- /dev/null +++ b/tests/fixtures/mailbox/out_of_office.eml @@ -0,0 +1,9 @@ +From: Recipient +To: sender@example.com +Subject: Auto-reply: Out of office +Date: Tue, 02 Jun 2026 10:02:00 +0000 +Message-ID: +Auto-Submitted: auto-replied +Content-Type: text/plain; charset=utf-8 + +I am out of office until next week. diff --git a/tests/fixtures/mailbox/soft_bounce.eml b/tests/fixtures/mailbox/soft_bounce.eml new file mode 100644 index 0000000..0b4b776 --- /dev/null +++ b/tests/fixtures/mailbox/soft_bounce.eml @@ -0,0 +1,12 @@ +From: Mail Delivery Subsystem +To: sender@example.com +Subject: Delivery temporarily delayed +Date: Tue, 02 Jun 2026 10:01:00 +0000 +Message-ID: +Content-Type: text/plain; charset=utf-8 + +Delivery Status Notification. +Final-Recipient: rfc822; full@example.com +Action: delayed +Status: 4.2.2 +Diagnostic-Code: smtp; 452 4.2.2 Mailbox full diff --git a/tests/test_parser.py b/tests/test_parser.py new file mode 100644 index 0000000..fb11232 --- /dev/null +++ b/tests/test_parser.py @@ -0,0 +1,43 @@ +from __future__ import annotations + +import unittest +from pathlib import Path + +from email_connect.models import MessageClass +from email_connect.parser import parse_message_file + + +FIXTURES = Path(__file__).parent / "fixtures" / "mailbox" + + +class ParserTests(unittest.TestCase): + def test_hard_bounce_maps_to_permanent_rejection(self) -> None: + _inbound, parsed, candidate = parse_message_file(FIXTURES / "hard_bounce.eml", mailbox_id="test") + self.assertEqual(parsed.message_class, MessageClass.HARD_BOUNCE) + self.assertIsNotNone(candidate) + self.assertEqual(candidate.event_type, "notification.endpoint.rejected_permanent") + self.assertEqual(candidate.assessment_subclass, "fail.hard_bounce") + + def test_soft_bounce_maps_to_temporary_rejection(self) -> None: + _inbound, parsed, candidate = parse_message_file(FIXTURES / "soft_bounce.eml", mailbox_id="test") + self.assertEqual(parsed.message_class, MessageClass.SOFT_BOUNCE) + self.assertIsNotNone(candidate) + self.assertEqual(candidate.event_type, "notification.endpoint.rejected_temporary") + + def test_out_of_office_stays_undef(self) -> None: + _inbound, parsed, candidate = parse_message_file(FIXTURES / "out_of_office.eml", mailbox_id="test") + self.assertEqual(parsed.message_class, MessageClass.OUT_OF_OFFICE) + self.assertIsNotNone(candidate) + self.assertEqual(candidate.assessment_category.value, "undef") + self.assertEqual(candidate.event_type, "interaction.out_of_office_received") + + def test_human_reply_is_email_channel_success_only(self) -> None: + _inbound, parsed, candidate = parse_message_file(FIXTURES / "human_reply.eml", mailbox_id="test") + self.assertEqual(parsed.message_class, MessageClass.HUMAN_REPLY) + self.assertIsNotNone(candidate) + self.assertEqual(candidate.event_type, "interaction.reply_received") + self.assertEqual(candidate.assessment_subclass, "success.reply_received") + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_scanner.py b/tests/test_scanner.py new file mode 100644 index 0000000..3b2894b --- /dev/null +++ b/tests/test_scanner.py @@ -0,0 +1,37 @@ +from __future__ import annotations + +import tempfile +import unittest +from pathlib import Path + +from email_connect.config import AppConfig, MailboxConfig, ReportsConfig, ScanConfig, SourceConfig, StorageConfig +from email_connect.scanner import scan_mailbox + + +FIXTURES = Path(__file__).parent / "fixtures" / "mailbox" + + +class ScannerTests(unittest.TestCase): + def test_scan_fixture_directory_writes_report_and_deduplicates(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + root = Path(tmp) + config = AppConfig( + mailbox=MailboxConfig(id="test-mailbox", protocol="fixture"), + scan=ScanConfig(), + storage=StorageConfig(path=str(root / "state.sqlite")), + reports=ReportsConfig(output_dir=str(root / "reports")), + source=SourceConfig(fixture_dir=str(FIXTURES)), + ) + first = scan_mailbox(config) + second = scan_mailbox(config) + + self.assertEqual(first.scan.messages_seen, 4) + self.assertEqual(first.scan.messages_new, 4) + self.assertGreaterEqual(first.scan.evidence_events_created, 4) + self.assertEqual(second.scan.messages_new, 0) + self.assertEqual(second.scan.evidence_events_created, 0) + self.assertTrue(first.report_path and first.report_path.exists()) + + +if __name__ == "__main__": + unittest.main() diff --git a/workplans/EMAIL-WP-0001-repo-onboarding.md b/workplans/EMAIL-WP-0001-repo-onboarding.md index 7c34290..0a6c9ee 100644 --- a/workplans/EMAIL-WP-0001-repo-onboarding.md +++ b/workplans/EMAIL-WP-0001-repo-onboarding.md @@ -4,7 +4,7 @@ type: workplan title: "Repository Onboarding and Implementation Foundation" domain: custodian repo: email-connect -status: active +status: finished owner: codex topic_slug: custodian created: "2026-06-02" @@ -55,7 +55,7 @@ Done when: ```task id: EMAIL-WP-0001-T02 -status: todo +status: done priority: high state_hub_task_id: "fdfd8b96-7326-414f-8126-79bb3a21b950" ``` @@ -76,7 +76,7 @@ The architecture note should cover: ```task id: EMAIL-WP-0001-T03 -status: todo +status: done priority: high state_hub_task_id: "ef1eb769-dfa0-4b46-8633-274d90962423" ``` @@ -105,7 +105,7 @@ click telemetry as proof of human awareness or result satisfaction. ```task id: EMAIL-WP-0001-T04 -status: todo +status: done priority: medium state_hub_task_id: "4b94e544-5aad-4c38-8fe3-eed17af79971" ``` diff --git a/workplans/EMAIL-WP-0002-mvp-mailbox-evidence-scanner.md b/workplans/EMAIL-WP-0002-mvp-mailbox-evidence-scanner.md index c01ca4e..a9c8fd4 100644 --- a/workplans/EMAIL-WP-0002-mvp-mailbox-evidence-scanner.md +++ b/workplans/EMAIL-WP-0002-mvp-mailbox-evidence-scanner.md @@ -4,7 +4,7 @@ type: workplan title: "MVP Mailbox Evidence Scanner" domain: custodian repo: email-connect -status: ready +status: active owner: codex topic_slug: custodian created: "2026-06-02" @@ -688,7 +688,7 @@ Endpoint quality is diagnostic and must not be treated as coordination success. ```task id: EMAIL-WP-0002-T01 -status: todo +status: done priority: high state_hub_task_id: "3a17215d-62a9-48ef-877f-a6fbc7e95a22" ``` @@ -716,7 +716,7 @@ Config file is loaded and validated. ```task id: EMAIL-WP-0002-T02 -status: todo +status: progress priority: high state_hub_task_id: "25a4da12-1bcd-4c6d-a0eb-a2f525b9c4b9" ``` @@ -744,7 +744,7 @@ CLI can connect to mailbox and list/fetch messages without modifying mailbox. ```task id: EMAIL-WP-0002-T03 -status: todo +status: progress priority: high state_hub_task_id: "16b95a6b-1375-4c91-8b78-0b75d51e0aeb" ``` @@ -773,7 +773,7 @@ Full rescan can revisit all messages while preserving deduplication. ```task id: EMAIL-WP-0002-T04 -status: todo +status: done priority: high state_hub_task_id: "5a50cd85-b0ab-4017-aba0-b2087068abb4" ``` @@ -802,7 +802,7 @@ Scanner extracts basic metadata and text from representative bounce and reply me ```task id: EMAIL-WP-0002-T05 -status: todo +status: progress priority: high state_hub_task_id: "8ea826d1-0add-4573-9bb4-2b73adefba55" ``` @@ -831,7 +831,7 @@ Representative hard and soft bounce samples are classified correctly. ```task id: EMAIL-WP-0002-T06 -status: todo +status: progress priority: high state_hub_task_id: "4d94a332-173b-4787-8fb2-27aa63db6a8d" ``` @@ -880,7 +880,7 @@ Representative complaint and unsubscribe examples are classified. ```task id: EMAIL-WP-0002-T08 -status: todo +status: progress priority: high state_hub_task_id: "6d62dea0-f416-4c0b-80a0-7c16422b8e5f" ``` @@ -932,7 +932,7 @@ Complaint/unsubscribe updates suppression state. ```task id: EMAIL-WP-0002-T10 -status: todo +status: progress priority: medium state_hub_task_id: "5ab35176-d6c2-4c73-b7b3-bde4c097e3ee" ``` @@ -959,7 +959,7 @@ Report can be opened in spreadsheet tools. ```task id: EMAIL-WP-0002-T11 -status: todo +status: progress priority: high state_hub_task_id: "514fa099-781b-4590-aae4-c28970413b3f" ``` @@ -989,7 +989,7 @@ Automated tests verify expected classification and normalized event output. ```task id: EMAIL-WP-0002-T12 -status: todo +status: progress priority: medium state_hub_task_id: "a5f7067e-87be-4438-ba35-b12d06a8181e" ```