diff --git a/spec/CoreArchitectureBlueprint.md b/spec/CoreArchitectureBlueprint.md index cc6145d..80ff343 100644 --- a/spec/CoreArchitectureBlueprint.md +++ b/spec/CoreArchitectureBlueprint.md @@ -454,6 +454,43 @@ solved): the exact convergence bound for high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a vector clock vs. a simple base-rev comparison, are deferred to implementation spikes. +### 8.7 Scaling the union — incremental-first, rebuild as fallback + +The derived tier is *recomputable* (I-2) but recompute must never be the **operational** +mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited, +paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days +and directly fights the operational-envelope axis (review C-2). So: + +**Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify` +capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event +drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes, +and projections. The derived tier is a continuously-maintained materialised view, not a +periodically-recomputed one. Steady-state cost is O(changes), not O(corpus). + +**Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or +suspected corruption — and it is **explicitly not required to be cheap**. It respects each +shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs +*concurrently with serving the existing derived tier*; it swaps in atomically on completion. +I-2 guarantees rebuild is *possible and correct*, not instant. + +**Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set +comparison across all pages of all shards is O(N²) and is forbidden. Instead: + +1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent: + normalised title, normalised path tail, explicit alias-table entries (coordination- + canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and + derived-content detection. Only within-bucket pairs are considered — turning O(N²) into + ≈O(N) candidates. +2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and + any curator binding. Confirmed equivalences become union edges. +3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set + is re-verified; equivalence is maintained per-change, never recomputed globally. + +**The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9). +Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted +**false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a +small miss rate for tractability, and curator bindings are the escape hatch for misses. + --- ## 9. Cross-cut — Authorization (L5) diff --git a/workplans/SHARD-WP-0005-architecture-hardening.md b/workplans/SHARD-WP-0005-architecture-hardening.md index 1901329..1c32d9b 100644 --- a/workplans/SHARD-WP-0005-architecture-hardening.md +++ b/workplans/SHARD-WP-0005-architecture-hardening.md @@ -90,7 +90,7 @@ sameness). ```task id: SHARD-WP-0005-T3 -status: todo +status: done priority: high state_hub_task_id: "fb91f43f-3bf0-41f5-ad0a-bfd15a7fad17" ```