generated from coulomb/repo-seed
spec(SHARD-WP-0005 T4): scale the union — incremental-first + indexed equivalence (§8.7)
Fixes C-1/C-2. Incremental change-driven maintenance (notify->delta) is primary; full rebuild is a rare, envelope-respecting, concurrent fallback (not required cheap). Equivalence via blocking/LSH candidate-gen + verify + incremental maintenance, replacing O(N^2). Index is derived, per-tenant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -454,6 +454,43 @@ solved): the exact convergence bound for
|
||||
high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a
|
||||
vector clock vs. a simple base-rev comparison, are deferred to implementation spikes.
|
||||
|
||||
### 8.7 Scaling the union — incremental-first, rebuild as fallback
|
||||
|
||||
The derived tier is *recomputable* (I-2) but recompute must never be the **operational**
|
||||
mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited,
|
||||
paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days
|
||||
and directly fights the operational-envelope axis (review C-2). So:
|
||||
|
||||
**Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify`
|
||||
capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event
|
||||
drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes,
|
||||
and projections. The derived tier is a continuously-maintained materialised view, not a
|
||||
periodically-recomputed one. Steady-state cost is O(changes), not O(corpus).
|
||||
|
||||
**Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or
|
||||
suspected corruption — and it is **explicitly not required to be cheap**. It respects each
|
||||
shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs
|
||||
*concurrently with serving the existing derived tier*; it swaps in atomically on completion.
|
||||
I-2 guarantees rebuild is *possible and correct*, not instant.
|
||||
|
||||
**Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set
|
||||
comparison across all pages of all shards is O(N²) and is forbidden. Instead:
|
||||
|
||||
1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent:
|
||||
normalised title, normalised path tail, explicit alias-table entries (coordination-
|
||||
canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and
|
||||
derived-content detection. Only within-bucket pairs are considered — turning O(N²) into
|
||||
≈O(N) candidates.
|
||||
2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and
|
||||
any curator binding. Confirmed equivalences become union edges.
|
||||
3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set
|
||||
is re-verified; equivalence is maintained per-change, never recomputed globally.
|
||||
|
||||
**The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9).
|
||||
Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted
|
||||
**false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a
|
||||
small miss rate for tractability, and curator bindings are the escape hatch for misses.
|
||||
|
||||
---
|
||||
|
||||
## 9. Cross-cut — Authorization (L5)
|
||||
|
||||
@@ -90,7 +90,7 @@ sameness).
|
||||
|
||||
```task
|
||||
id: SHARD-WP-0005-T3
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "fb91f43f-3bf0-41f5-ad0a-bfd15a7fad17"
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user