generated from coulomb/repo-seed
spec(SHARD-WP-0005 T4): scale the union — incremental-first + indexed equivalence (§8.7)
Fixes C-1/C-2. Incremental change-driven maintenance (notify->delta) is primary; full rebuild is a rare, envelope-respecting, concurrent fallback (not required cheap). Equivalence via blocking/LSH candidate-gen + verify + incremental maintenance, replacing O(N^2). Index is derived, per-tenant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -454,6 +454,43 @@ solved): the exact convergence bound for
|
|||||||
high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a
|
high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a
|
||||||
vector clock vs. a simple base-rev comparison, are deferred to implementation spikes.
|
vector clock vs. a simple base-rev comparison, are deferred to implementation spikes.
|
||||||
|
|
||||||
|
### 8.7 Scaling the union — incremental-first, rebuild as fallback
|
||||||
|
|
||||||
|
The derived tier is *recomputable* (I-2) but recompute must never be the **operational**
|
||||||
|
mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited,
|
||||||
|
paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days
|
||||||
|
and directly fights the operational-envelope axis (review C-2). So:
|
||||||
|
|
||||||
|
**Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify`
|
||||||
|
capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event
|
||||||
|
drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes,
|
||||||
|
and projections. The derived tier is a continuously-maintained materialised view, not a
|
||||||
|
periodically-recomputed one. Steady-state cost is O(changes), not O(corpus).
|
||||||
|
|
||||||
|
**Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or
|
||||||
|
suspected corruption — and it is **explicitly not required to be cheap**. It respects each
|
||||||
|
shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs
|
||||||
|
*concurrently with serving the existing derived tier*; it swaps in atomically on completion.
|
||||||
|
I-2 guarantees rebuild is *possible and correct*, not instant.
|
||||||
|
|
||||||
|
**Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set
|
||||||
|
comparison across all pages of all shards is O(N²) and is forbidden. Instead:
|
||||||
|
|
||||||
|
1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent:
|
||||||
|
normalised title, normalised path tail, explicit alias-table entries (coordination-
|
||||||
|
canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and
|
||||||
|
derived-content detection. Only within-bucket pairs are considered — turning O(N²) into
|
||||||
|
≈O(N) candidates.
|
||||||
|
2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and
|
||||||
|
any curator binding. Confirmed equivalences become union edges.
|
||||||
|
3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set
|
||||||
|
is re-verified; equivalence is maintained per-change, never recomputed globally.
|
||||||
|
|
||||||
|
**The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9).
|
||||||
|
Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted
|
||||||
|
**false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a
|
||||||
|
small miss rate for tractability, and curator bindings are the escape hatch for misses.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 9. Cross-cut — Authorization (L5)
|
## 9. Cross-cut — Authorization (L5)
|
||||||
|
|||||||
@@ -90,7 +90,7 @@ sameness).
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: SHARD-WP-0005-T3
|
id: SHARD-WP-0005-T3
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "fb91f43f-3bf0-41f5-ad0a-bfd15a7fad17"
|
state_hub_task_id: "fb91f43f-3bf0-41f5-ad0a-bfd15a7fad17"
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user