spec(SHARD-WP-0005 T4): scale the union — incremental-first + indexed equivalence (§8.7)

Fixes C-1/C-2. Incremental change-driven maintenance (notify->delta) is
primary; full rebuild is a rare, envelope-respecting, concurrent fallback
(not required cheap). Equivalence via blocking/LSH candidate-gen + verify +
incremental maintenance, replacing O(N^2). Index is derived, per-tenant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 01:36:09 +02:00
parent 04be66161e
commit dc451b0f4e
2 changed files with 38 additions and 1 deletions

View File

@@ -454,6 +454,43 @@ solved): the exact convergence bound for
high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a
vector clock vs. a simple base-rev comparison, are deferred to implementation spikes.
### 8.7 Scaling the union — incremental-first, rebuild as fallback
The derived tier is *recomputable* (I-2) but recompute must never be the **operational**
mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited,
paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days
and directly fights the operational-envelope axis (review C-2). So:
**Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify`
capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event
drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes,
and projections. The derived tier is a continuously-maintained materialised view, not a
periodically-recomputed one. Steady-state cost is O(changes), not O(corpus).
**Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or
suspected corruption — and it is **explicitly not required to be cheap**. It respects each
shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs
*concurrently with serving the existing derived tier*; it swaps in atomically on completion.
I-2 guarantees rebuild is *possible and correct*, not instant.
**Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set
comparison across all pages of all shards is O(N²) and is forbidden. Instead:
1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent:
normalised title, normalised path tail, explicit alias-table entries (coordination-
canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and
derived-content detection. Only within-bucket pairs are considered — turning O(N²) into
≈O(N) candidates.
2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and
any curator binding. Confirmed equivalences become union edges.
3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set
is re-verified; equivalence is maintained per-change, never recomputed globally.
**The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9).
Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted
**false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a
small miss rate for tractability, and curator bindings are the escape hatch for misses.
---
## 9. Cross-cut — Authorization (L5)