spec(SHARD-WP-0005 T4): scale the union — incremental-first + indexed equivalence (§8.7)

Fixes C-1/C-2. Incremental change-driven maintenance (notify->delta) is primary; full rebuild is a rare, envelope-respecting, concurrent fallback (not required cheap). Equivalence via blocking/LSH candidate-gen + verify + incremental maintenance, replacing O(N^2). Index is derived, per-tenant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 01:36:09 +02:00
parent 04be66161e
commit dc451b0f4e
2 changed files with 38 additions and 1 deletions
--- a/spec/CoreArchitectureBlueprint.md
+++ b/spec/CoreArchitectureBlueprint.md
@@ -454,6 +454,43 @@ solved): the exact convergence bound for
 high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a
 vector clock vs. a simple base-rev comparison, are deferred to implementation spikes.

+### 8.7 Scaling the union — incremental-first, rebuild as fallback
+
+The derived tier is *recomputable* (I-2) but recompute must never be the **operational**
+mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited,
+paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days
+and directly fights the operational-envelope axis (review C-2). So:
+
+**Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify`
+capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event
+drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes,
+and projections. The derived tier is a continuously-maintained materialised view, not a
+periodically-recomputed one. Steady-state cost is O(changes), not O(corpus).
+
+**Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or
+suspected corruption — and it is **explicitly not required to be cheap**. It respects each
+shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs
+*concurrently with serving the existing derived tier*; it swaps in atomically on completion.
+I-2 guarantees rebuild is *possible and correct*, not instant.
+
+**Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set
+comparison across all pages of all shards is O(N²) and is forbidden. Instead:
+
+1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent:
+   normalised title, normalised path tail, explicit alias-table entries (coordination-
+   canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and
+   derived-content detection. Only within-bucket pairs are considered — turning O(N²) into
+   ≈O(N) candidates.
+2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and
+   any curator binding. Confirmed equivalences become union edges.
+3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set
+   is re-verified; equivalence is maintained per-change, never recomputed globally.
+
+**The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9).
+Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted
+**false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a
+small miss rate for tractability, and curator bindings are the escape hatch for misses.
+
 ---

 ## 9. Cross-cut — Authorization (L5)