# CoreArchitectureBlueprint — shard-wiki Status: **draft for review** · Date: 2026-06-15 · Owner: tegwick The whole-system architecture for shard-wiki, synthesised from `INTENT.md`, the 84-entry `UseCaseCatalog.md`, and the full research arc (`research/260608-*`, `research/260613-*`, `research/260614-*` — ~23 wiki/knowledge systems plus two cross-dive syntheses). This is the **core** blueprint: it defines the layers, the abstractions, and the load-bearing decisions that everything else implements. Scope relationship to the other specs: - **`ArchitectureBlueprint.md`** (existing) is the **authorization & history sub-blueprint** (the L0–L4 ladder). This document references it as the design of the cross-cutting authorization layer (§9) and does not restate it. - **`SHARD-WP-0002`** is the workplan that turns §6–§8 into `spec/FederationArchitecture.md` + the adapter-contract section of `spec/TechnicalSpecificationDocument.md`. - **`UseCaseCatalog.md`** is the demand this architecture must satisfy; UC references below are load tests, not decoration. --- ## 1. The thesis: *canonical vs derived* (three states) Everything in shard-wiki follows from one organising decision — that state comes in exactly **three kinds**, and only one of them is disposable: > **1. Sharded-canonical** — content owned by each shard (shard sovereignty). > **2. Coordination-canonical** — durable state *born inside shard-wiki* that encodes human > or cross-shard decisions and exists nowhere else: overlays (the local truth against a > read-only shard), curator equivalence bindings, alias tables, merge/reconciliation > decisions. It lives in the **Git coordination journal**. > **3. Derived-disposable** — everything shard-wiki *computes* from (1)+(2): the union graph, > equivalence index, query indexes, projections, views. It can be deleted and recomputed. > > **Canonical = sharded ∪ coordination. Derived = a pure function of canonical:** > `derived = f(sharded, coordination)`. This is the architectural form of "orchestrator, not engine." shard-wiki never *becomes* the source of truth; it composes sources and records the decisions it makes about them. The research earned the *files-canonical* half empirically — every serious system externalises its durable truth to files+VCS and treats the rest as derived: Logseq (DataScript index over plain files), ikiwiki (static HTML compiled from a git repo), Glamorous Toolkit / Lepiter (live views over git-versioned JSON), Pharo (Tonel/Iceberg code as git text), Jupyter teams (nbstripout — outputs are derived noise). The one tradition that refuses this — the Smalltalk **image** — is exactly the one we record as a *boundary, not a backend* (`research/260614-squeak-pharo-deep-dive`). The earlier draft of this blueprint used a two-bucket framing ("canonical at the edges, derived in the middle"). That was wrong by omission: it had no home for **coordination- canonical** state, and so contradicted itself by listing curator bindings and alias tables as "derived/rebuildable" when a human binding manifestly cannot be rebuilt. The three-state model fixes that crack (review finding A-1) and makes `derived = f(canonical)` *literally* true. Three consequences fall straight out, and they are the spine of the rest of this document: 1. **Graceful degradation is free.** If the derived tier is always recomputable, a backend that can only be read is still a first-class participant — you just derive less from it. 2. **Provenance is tractable.** Because shard-wiki never claims to *be* the source, every derived artifact can always point back to the canonical input it came from (union without erasure is a structural property, not a feature bolted on). 3. **The derived tier is a pure function of canonical state.** `derived = f(sharded, coordination)`. Bugs in the derived tier are recoverable by recompute; only the two canonical tiers must be durably protected — sharded by each shard, coordination by the Git journal (history). *Recomputability is a correctness property of the derived tier, not a promise that a from-scratch rebuild is operationally cheap — see §8.4 and the operational-envelope axis.* ### The dual narrow waist Heterogeneity is mediated at exactly two interfaces, and nowhere else: - **Bottom waist — the Shard Adapter Contract (§6).** Every backend, however weird, enters through one versioned, capability-described interface. - **Top waist — the Wiki Page Model (§7).** Every consumer, however demanding, sees one backend-neutral, Markdown-first-but-stretchable page model. Between the waists, core logic is written **once** against capabilities and the page model — never against a specific backend. Adding TiddlyWiki or Notion or a git forge is writing an adapter and declaring a capability profile, not editing core algorithms. --- ## 2. Architectural invariants These are non-negotiable. Violating one is a design bug, not a tradeoff. They are INTENT's principles fused with the research through-lines. | # | Invariant | Source | |---|-----------|--------| | I-1 | **Orchestrator, not engine.** Core composes shards; it never replaces or homogenises them. | INTENT Stability Note | | I-2 | **Three states; derived = f(canonical).** State is sharded-canonical, coordination-canonical (journal), or derived-disposable. The derived tier (union/index/projection) is a pure, recomputable function of the two canonical tiers; only canonical state is durably protected. | §1; Logseq/ikiwiki/GT through-line | | I-3 | **Capability-awareness is data.** A binding's abilities are a *profile* (positions on spectra), read by generic core logic — not per-backend branches. | synthesis v3 §2; INTENT capability-aware adapters | | I-4 | **Union without erasure.** Every page/revision/projection/overlay/view carries its provenance, freshness, liveness, and divergence. | INTENT; provenance-granularity spectrum (Wikibase) | | I-5 | **Overlay before mutation.** Writes to anything below write-through land as drafts/patches/MRs first; no silent remote mutation. | INTENT | | I-6 | **Git-addressable coordination.** Every information space has a Git-backed journal even when its shards are not git-native. | INTENT | | I-7 | **Mechanism over policy.** Canonical-source, conflict, editorial, sync cadence are configurable presets, never hard-coded. | INTENT | | I-8 | **Graceful degradation.** A limited backend is still usable as read-only / cache / projection / backup / patch target. | INTENT | | I-9 | **Identity ≠ placement.** A page is an entity that may occupy N locations; address by identity, not by path. | Trilium note/branch; ZigZag | | I-10 | **History is the floor.** Every write is a recoverable commit; recoverability, not gatekeeping, is the baseline protection. | ArchitectureBlueprint §2 | | I-11 | **Authorization in core, authentication delegated.** Core decides who-may; an external provider says who-is. | INTENT; ArchitectureBlueprint | | I-12 | **Not a file-sync daemon; not an execution platform.** Sync is wiki-page-semantic; computation is recognised+projected, not hosted. | INTENT; computational-page-model synthesis | --- ## 3. The layered architecture ``` ┌───────────────────────────────────────────────────────────────┐ │ L6 Consumers — Orchestrator API · CLI/agents · Web/Obsidian │ ├───────────────────────────────────────────────────────────────┤ X-cut │ L5 Authorization (PEP/PDP, identity-provider iface) → │ X-cut Prove- │ see ArchitectureBlueprint.md (L0–L4 ladder) │ Capa- nance ├───────────────────────────────────────────────────────────────┤ bility ▲ │ L4 Union & Projection (DERIVED, rebuildable cache) │ ▲ │ │ identity resolution · equivalence/chorus · union graph · │ │ │ │ replication+derivation projections · moldable view registry│ │ │ │ · derived query index │ │ │ ├───────────────────────────────────────────────────────────────┤ │ │ │ L3 Coordination (Git journal · overlay/patch engine · │ │ │ │ federation-model strategies · reconciliation) │ │ │ ├───────────────────────────────────────────────────────────────┤ │ │ │ L2 Wiki Page Model ── TOP WAIST ── │ │ │ │ backend-neutral pages · identity≠placement · span address ·│ │ │ │ provenance envelope · the page shapes │ │ │ ├───────────────────────────────────────────────────────────────┤ │ │ │ L1 Shard Adapter Contract ── BOTTOM WAIST ── │ │ │ │ versioned iface · capability profile (15 spectra) · │ │ │ │ attachment-mode binding · operation verbs │ │ └──── ├───────────────────────────────────────────────────────────────┤ ──┘ │ L0 Backends (not ours): git repos, wiki/ subdirs, Gitea/ │ │ GitLab/GitHub wikis, folders, Obsidian, WebDAV, Notion, │ │ Coulomb spaces, notebooks, … │ └───────────────────────────────────────────────────────────────┘ ``` **Provenance** and **Capability** are drawn as vertical rails because they are not layers — they are present at every layer. A page object at L2 carries provenance; a projection at L4 carries provenance; an authz decision at L5 records the context under which content was read. Likewise a capability profile declared at L1 is consulted at L3 (can we write-through?), L4 (can we delegate a query?), and L5 (can this principal even reach the op?). The dependency rule is strict and downward, and it tracks the **three states (§1)**, not the layer numbers: **the derived-disposable tier (the whole of L4) may be deleted and recomputed from canonical state (sharded content at L1 + coordination-canonical state in the L3 journal).** Nothing canonical may depend on derived state. Note the journal at L3 is *canonical* (it holds overlays, bindings, aliases, merges); only L4 is disposable. --- ## 4. Core abstractions (the vocabulary code must use) Straight from INTENT, sharpened by research. New code maps onto these; it does not invent parallel terms. - **Shard** — an independently meaningful page store attached to a root entity, with *sovereignty*: its own backend, capability profile, history, identity model, limits. - **Root entity / information space** — the joined space shards attach to; the unit of Git coordination and of multi-tenancy (a tenant maps to a root entity, ArchitectureBlueprint). - **Shard adapter contract** — the versioned L1 interface; the bottom waist. - **Capability profile** — a shard binding's position on each of the 15 spectra (§6) plus its supported verbs. *The* data structure that drives degradation. - **Wiki page model** — the L2 backend-neutral page; the top waist. - **Page identity vs placement** — a page is an entity (identity); it may have N placements (paths/shards). Addressing, equivalence, and transclusion key on identity (I-9). - **Provenance envelope** — the metadata wrapper every artifact carries: source shard, freshness, liveness, authorization context, overlay status, divergence, derivation lineage. - **Coordination journal** — the L3 Git-backed record of change flows for a space, and the durable home of all **coordination-canonical** state (§1): overlays, curator equivalence bindings, alias tables, merge/reconciliation decisions. This state is born inside shard-wiki, exists nowhere else, and is *not* derived — it must be committed, never recomputed. - **Overlay** — a non-destructive local edit against a remote/read-only/limited shard, representable as draft/patch/commit/MR before destructive apply. Coordination-canonical: an unapplied overlay is the local truth and lives in the journal. - **Projection** — a derived view of shard content, typed on two axes (§8): *kind* (replication | derivation) × *liveness* (static … irreducibly-live). - **Federation model** — the selected coordination strategy for a space (§ taxonomy, T17). - **Shard mode** — read-only · write-through · mirrored · projected · cached · canonical (a *policy* selection constrained by the capability profile). --- ## 5. Why "layered" and not "pipeline" or "plugin-bus" Two rejected alternatives, recorded so the choice is legible: - **A sync pipeline** (source → transform → sink) was rejected: it implies a privileged direction and a canonical sink, which violates shard sovereignty (I-1) and union-without- erasure (I-4). shard-wiki is a *star* (many shards ↔ one space), not a pipe. - **A flat plugin bus** (every backend a peer plugin emitting events) was rejected as the *top-level* shape: it has no narrow waist, so heterogeneity leaks into every consumer. We keep the plugin idea but confine it to L1 (adapters) and L3 (federation strategies), behind the waists. The layered-with-rails shape is what makes I-2/I-3/I-4 hold simultaneously. --- ## 6. Bottom waist — the Shard Adapter Contract (L1) The single most important design decision in the project: **the adapter contract models positions on capability spectra, not a flat checklist of boolean verbs.** A backend is not "can/can't merge"; it sits *somewhere* on the merge spectrum, and federation operations degrade by position. This is the lesson of putting ~23 systems in one matrix (`research/260614-shard-spectrum-synthesis`, v3). ### 6.1 The fifteen capability spectra Each binding declares a position on each axis. Core algorithms read these positions; there is no per-backend code in core (I-3). 1. **Addressing granularity** — none → path → page-level store-id → in-file span → in-file block id (Logseq `id::`) → store-UUID → portable tumbler (Xanadu, the unreached ideal) 2. **Content identity** — none → path/title → fingerprint → span-set 3. **Identity vs placement** — path=identity → identity separated from placement (Trilium note/branch = a DAG) 4. **Structure** — flat MD → frontmatter/`key::` → `%META%` → typed objects → DB schema+ relations → object-graph/ontology → computed (inherited+templated) → typed-graph statements 5. **History** — none → internal-only / CRDT-log → open-file → git-native 6. **Merge model** — none → git/text → conflict-notes/keep-both → native-CRDT → coexist-with-rank 7. **Native query** — none → text → build-your-own derived index → datalog/graph → DB query → SPARQL 8. **Translation** — native → lossless → lossy-with-fidelity-report (incl. HTML) 9. **Attachment mode** — file-store (native | interchange-mirror) → git-IS-store → in-engine-host → local-REST → external-API → direct-DB → CRDT-replica → P2P/no-central-endpoint 10. **Operational envelope** — local/unbounded → realtime CRDT/WebSocket → rate-limited/ eventually-consistent/paginated 11. **Access grant** — open → token → OAuth scoped+revocable → P2P key/invite → enterprise ACL 12. **Content opacity** — plaintext → structured re-evaluable value → encrypted whole-shard → per-item → proprietary-lossy-exportable 13. **Write granularity** — whole-file (TiddlyWiki) → per-page → section/anchor → per-block → story-item 14. **Provenance granularity** — per-shard → per-page → per-edit → per-statement/value (Wikibase rank+refs) 15. **Computational / liveness** — static source → captured-output snapshot → live-over-files → view-time render → irreducibly-live/temporal ### 6.2 Operation verbs `read, write, diff, merge, lock, version, publish, notify, transclude-source, translate-syntax, structured-payload, derive-projection, execute/evaluate`. The last two are **gated, off by default** (§8, computational content). Verb support is part of the profile and must reconcile with the federation-ops capability matrix (SHARD-WP-0002 T10). ### 6.3 Attachment-mode taxonomy (axis 9, expanded) A backend may offer **several** modes; attach mode is a **per-binding, capability-gated choice**, with one declared authoritative. Modes: file-store (native vault/folder *or* an interchange/sync mirror), **git-IS-store** (the home case — forge wikis & ikiwiki: git is the store *and* the journal at once, resolving the engine-mirror write-race), in-engine hosted adapter (XWiki component, Obsidian/Logseq/Roam plugin, Trilium script), local-REST (Joplin Data API, Trilium ETAPI), external-API-only (Notion), direct-DB (MojoMojo schema→model), CRDT-replica (Anytype/AFFiNE/AppFlowy), P2P/no-central-endpoint. **Boundary:** a monolithic live-memory blob (Smalltalk image, a kernel) is **never** an attach target — it participates only via export→files (I-12). ### 6.4 Contract rules - **Versioned interface** (Foswiki::Store + Foswiki::Meta is the proof that a stable store-interface-with-swappable-backends works). Capability discovery is a static profile with optional runtime negotiation. - **Backend-swap tolerance** — shard identity/provenance survives a substrate change (RCS↔PlainFile, folder→Git, Logseq file→SQLite): bind to *capabilities*, not to "it's files." - **Absence is first-class** — the profile must express *can't* cleanly (Oddmuse floor), so degradation paths are explicit, never guessed. --- ## 7. Top waist — the Wiki Page Model (L2) Backend-neutral, **Markdown-first but stretchable many ways at once**. The page model is the lingua franca every consumer sees; an adapter's job is to project its backend into this model (read) and accept overlays back (write), within its capabilities. ### 7.1 Page shapes the model must carry - **Prose Markdown** — the baseline. - **Typed / computed records** — frontmatter/`%META%`/XObjects/Notion DB rows; **computed metadata** (Trilium inherited+templated) represented as *effective-vs-own with per-attribute provenance*. - **Typed-graph statements** — Wikibase claim + qualifiers + references + rank (structure far-end). - **Inline-embedded objects** — Quip/Notion spreadsheets & live apps inside prose. - **Non-Markdown assets** — drawings, canvases, images: typed asset / opaque blob / pluggable content-type registry, never silent-flattened. - **The four computational shapes** (§8): one-source-many-projections, notebook (embedded computed output), program-as-page, live/temporal. All shapes reduce to a common skeleton: **`(content | source, structure, provenance envelope, [derivation rule])`**. The page model stores the richest faithful form as canonical and treats any Markdown rendering of a non-Markdown shape as a *lossy projection* (I-4 + fidelity report). ### 7.2 Identity, placement, addressing — three distinct concepts The earlier draft used "identity" for two different things and (worse) suggested deriving page identity from a content fingerprint — which would make *editing a page change its identity* and break every reference to it (review bug B-1). They are pulled apart here: - **Page identity — a *stable handle*.** A shard-scoped, durable key that **survives edits**: the backend's native page/note id where one exists (Roam/Notion/Trilium uid, a git path treated as a name, a wiki page name), wrapped in a shard scope so it survives projection and never collides across shards. Identity is *assigned/minted, not computed from content*. References, placement, transclusion targets, and overlays all key on identity. - **Placement — *where* an identity sits.** One identity → N placements (paths/shards) = a DAG; no single canonical path (I-9). Placement can change without changing identity. - **Content equivalence — *detecting sameness*, never identity.** A **content fingerprint** (or span-set overlap) identifies a *version / a piece of content*, used to detect that two *distinct identities* hold the same or derived content (the equivalence/chorus mechanism, §8.4). A fingerprint is never a page's identity: same page, edited → new fingerprint, **same identity**; two pages, identical content → same fingerprint, **different identities**. - **Span addressing** — a sub-page address within an identity: adopt native span IDs where minted (Roam `:block/uid`, Logseq `id::`, Notion/CRDT UUID); else a *position* address (path+range) or a *content-fingerprint* address for equivalence/transclusion. The Xanadu tumbler is the portable ideal the scheme aims at without requiring. - **Provenance envelope** rides on pages and spans (see §7.3 for its layered, low-cost form). So the chain is: **identity (stable) → placements (N, mutable) → equivalence (cross-identity sameness, fingerprint-based)** — three concepts, three mechanisms, never conflated. --- ## 8. Coordination, federation & projection ### 8.1 Coordination journal (L3) — Git as the spine Every information space has a Git-backed coordination journal (I-6). It records cross-shard operations (fork, import, reconcile, overlay-apply, space-branch) and **is** the history floor (I-10). For git-IS-store shards the shard's own git log *is* this journal; for non-git shards the journal supplements (begins-now / mirrors-forward / snapshots-replica) or imports (backfill open file history). History portability is a spectrum, handled per profile (axis 5). ### 8.2 Overlay / patch engine (L3) The default write path for anything below write-through capability (I-5): an edit becomes a draft → patch/commit → MR, applied destructively only on explicit intent and only where the profile + policy both permit. This is what lets a read-only or rate-limited or lossy backend still be *edited* safely. ### 8.3 Federation is plural & composable (L3) — the model taxonomy Federation is not one mechanism. shard-wiki selects a **federation model per space and composes per shard** (mechanism over policy, I-7): | Model | Anchor | Coordination shape | |-------|--------|--------------------| | **Fork + journal** (default home case) | Federated Wiki | copy-with-provenance + per-page action journal (story = replay) | | **VCS-replication + ping** | ikiwiki | git clone/pull/push + change-ping | | **Query-time graph-join** | Wikibase SPARQL `SERVICE` | join remote graphs at query time, no copy | | **Feed aggregation** | RSS/Atom | inbound feed → pages | | **Activity streams** | ActivityPub | Create/Update events, notify or content-bearing | | **Engine-mirror** | Wiki.js DB↔Git | engine syncs its own store to a git mirror | ### 8.4 Union & projection (L4) — the derived cache This whole layer is **derived-disposable**: recomputable from canonical state — sharded content + the **coordination-canonical** inputs in the journal (I-2). Crucially, the *automatic* equivalence results are derived, but the **human/curatorial inputs they consume — alias tables and curator equivalence bindings — are coordination-canonical (they live in the journal), not derived**; recompute reads them, never regenerates them. It comprises: - **Identity resolution & equivalence** — detect "same topic / derived content" path- independently from *derived* signals (content fingerprint, span-set overlap) **plus** the *coordination-canonical* inputs (alias table, curator binding); present as **chorus-of-voices** or designated-canonical (a *policy* preset). (Scaling: §8.7.) - **Union graph** — the navigable join of pages, links, and dimensions (namespace, genealogy, version, shard, equivalence). A *derived lens over canonical files+journal, never a new store* (the ZigZag boundary). - **Transclusion** — one **reference-not-copy** primitive unifying Xanadu transclusion, ZigZag clone, Roam/Obsidian/Logseq embed, Notion synced block, Trilium note-cloning, and literate named-chunk assembly, over the addressable union. - **Projection — the two-axis model:** - *Kind:* **replication-projection** (lazy cache of remote content — the default) vs **derivation-projection** (transform/compile/weave/evaluate a source). - *Liveness:* static → captured snapshot → live-over-files → view-time → irreducibly-live. - Derivation facets: materialization timing (ahead-of-time vs view-time), multiplicity (one output vs N co-equal), continuity (one-shot vs continuous). Every projection declares its liveness + freshness + provenance; the irreducibly-live far end has no faithful static form (source + a marked recording). - **Moldable view registry** — projection generalises to an **open, type-keyed set of co-equal, possibly-computed views, none canonical-by-fact** (display-canonical is policy). This unifies replication/derivation/dimensional/query projection and answers the "pluggable content-type registry" question (GT prior art). - **Derived query index** — delegate to a shard's native query engine where present (Roam/Logseq Datalog, Notion DB query, XWiki XWQL, Wikibase SPARQL); else build a derived index over the projection (the Logseq DataScript-over-files pattern). The index is disposable (I-2). ### 8.5 Computational / executable content — the scope decision **In scope as a page-model + projection concern; out of scope as an execution platform.** shard-wiki *recognises* computational types, attaches the **canonical source**, and presents derived forms as **provenance- and liveness-marked projections**. Driving a derivation (tangle/weave, re-execute a notebook, render a sketch, evaluate a pattern) is a **gated capability, off by default, with a trust/sandbox concern, degrading to a captured snapshot**. One snapshot-provenance record (run id, source rev, timestamp, environment "unguaranteed") serves notebooks, renders, and recordings alike. **No INTENT amendment is required** — this lives inside the existing page model (L2) and projection model (L4). ### 8.6 Consistency, concurrency & conflict model INTENT makes real-time cross-shard consistency a non-goal — but "no strong consistency" is not the same as "no defined consistency." This is the guarantee shard-wiki *does* offer, and the mechanism (not policy) that makes concurrent editing safe (review bug B-2). **The consistency guarantee — causal, anchored on the journal:** - **Read-your-writes for coordination-canonical state.** Once an overlay/binding/merge is committed to the journal, this client always sees it (the journal is the client's own causal spine). This is a *strong* local guarantee, cheap because the journal is local Git. - **Causal consistency across the derived tier.** The union/index/projections reflect a causal cut of `(sharded inputs seen so far, journal)`. Effects never appear before their causes; a projection that has seen journal commit *C* has seen everything *C* depends on. - **Eventual convergence for sharded-canonical inputs.** Remote shard content is pulled asynchronously (lazily or by notify/poll, §8.7); the union converges to each shard's latest *as observed*, bounded by the shard's operational envelope. Freshness is always *shown* (provenance envelope), never faked — a stale projection is labelled stale, not wrong. So: **strong + read-your-writes for what shard-wiki owns (the journal); causal for what it derives; eventual + freshness-labelled for what shards own.** No global clock, no distributed transaction, no two-phase commit across shards — none is needed, because shard-wiki coordinates rather than controls. **Conflict detection & representation is core mechanism; only resolution is policy (I-7).** The split the earlier draft elided: - **Detection (core).** Divergence is detected structurally: two identities resolve as equivalent (§8.4) but their content fingerprints differ, or an overlay's base revision no longer matches the shard's current revision. Detection is always on; it is never optional. - **Representation (core).** A detected conflict is **first-class data in the union**, not an error: equivalent-but-divergent pages are presented as a **coexisting set** (the chorus/keep-both representation), each fully attributed, with the divergence recorded in the provenance envelope (union without erasure — a conflict is information, not a failure). - **Resolution (policy).** *Which* version wins, or whether they stay coexisting, is a configurable preset (§10): chorus / designated-canonical / git-merge / vote-to-merge / overlay-only. Core never hard-codes one. **Overlay-apply under source drift (the concurrent-write case).** An overlay carries the **base revision** of the shard content it was authored against. On apply, core compares base to the shard's current revision: - *unchanged* → apply (fast-forward), commit to journal, propagate if the profile permits; - *changed, non-overlapping* → three-way merge where the merge capability allows (axis 6), else keep-both; - *changed, overlapping* → **refuse + re-present** as a conflict (above); never silently clobber (I-5, no silent remote mutation). The unapplied overlay remains coordination- canonical and valid against its base. **Ordering.** The journal commit is the ordering authority for coordination-canonical effects; a shard-native write is only *acknowledged* in the journal after the adapter confirms it, so a crash between journal-intent and shard-write is recoverable (the intent is replayable, the write is idempotent-keyed on identity+base-rev). Cross-shard operations are ordered by their journal commits, giving the causal cut above. **Residual open items** (tracked in *Known scaling risks & open problems*, §12, not pretended solved): the exact convergence bound for high-write CRDT shards under partition, and whether per-equivalence-set divergence needs a vector clock vs. a simple base-rev comparison, are deferred to implementation spikes. ### 8.7 Scaling the union — incremental-first, rebuild as fallback The derived tier is *recomputable* (I-2) but recompute must never be the **operational** mechanism. A from-scratch rebuild reads every page of every shard — including rate-limited, paginated external APIs (Notion) and irreducibly-live sources — which can take hours to days and directly fights the operational-envelope axis (review C-2). So: **Incremental, change-driven maintenance is the primary mechanism.** Each shard's `notify` capability (or a poll/ETag fallback where it has none, §8.8) emits **change events**; an event drives a **delta update** to exactly the affected union nodes, equivalence candidates, indexes, and projections. The derived tier is a continuously-maintained materialised view, not a periodically-recomputed one. Steady-state cost is O(changes), not O(corpus). **Full rebuild is a rare, bounded fallback** — for cold start, schema/algorithm change, or suspected corruption — and it is **explicitly not required to be cheap**. It respects each shard's envelope (it may be slow, throttled, or resumable for a rate-limited shard) and runs *concurrently with serving the existing derived tier*; it swaps in atomically on completion. I-2 guarantees rebuild is *possible and correct*, not instant. **Equivalence detection is indexed, not pairwise (review C-1).** Naive fingerprint/span-set comparison across all pages of all shards is O(N²) and is forbidden. Instead: 1. **Blocking / candidate generation** — cheap keys bucket pages that *could* be equivalent: normalised title, normalised path tail, explicit alias-table entries (coordination- canonical), and **MinHash/LSH bands over content shingles** for near-duplicate and derived-content detection. Only within-bucket pairs are considered — turning O(N²) into ≈O(N) candidates. 2. **Verification** — candidate pairs are confirmed by full fingerprint / span-set overlap and any curator binding. Confirmed equivalences become union edges. 3. **Incremental maintenance** — a changed page is re-bucketed and only its *new* candidate set is re-verified; equivalence is maintained per-change, never recomputed globally. **The index is itself derived** (disposable, recomputable) and per-tenant-partitioned (§9). Its parameters (LSH band/row counts, shingle size, precision/recall) are tunable; the accepted **false-negative rate of blocking** is a known, tracked limitation (§12) — blocking trades a small miss rate for tractability, and curator bindings are the escape hatch for misses. --- ## 9. Cross-cut — Authorization (L5) Fully specified in **`ArchitectureBlueprint.md`** (the access & history sub-blueprint); summarised here for completeness: - **One core, a ladder of modes** L0 (open/c2, zero deps) → L1 (attributed) → L2 (authenticated) → L3 (role/group) → L4 (multi-tenant enterprise). Climbing is configuration, not re-architecture. - **PEP** wraps every adapter op; **PDP** decides `(principal, action, target)` over actions `read/write/patch/merge/administer`, layered on the adapter's capability profile (a shard that can't write can't be written regardless of policy — L5 consults the L1 rail). - **Authentication delegated** to a pluggable IdentityProvider (null provider = L0 default); real identity from `user-engine` over `net-kingdom` IAM. - **Fail open only at L0, fail closed at L2+.** Authorization is pure/offline once a Principal is resolved. Provenance carries authz context so the union never leaks unreadable content (the L5↔provenance-rail interaction). --- ## 10. The policy surface (mechanism over policy, made concrete) I-7 only means something if the policy knobs are enumerated and kept *out* of core algorithms. The configurable presets are: - **Canonical-source policy** — chorus / designated-canonical / git-merge / overlay-only / vote-to-merge (per space or per equivalence set). - **Federation model** — the §8.3 taxonomy, per space, composable per shard. - **Shard mode** — read-only / write-through / mirrored / projected / cached / canonical (constrained by the capability profile). - **Reconciliation cadence & conflict exposure** — push/poll/manual; show-conflicts vs auto-merge-when-supported. - **Execution policy** — derive/execute off (default) / sandboxed / per-shard-allowed. - **Authorization mode** — the L0–L4 ladder. - **Projection materialization** — lazy/eager; snapshot vs view-time; recording retention. Core ships sane defaults (L0 open; fork+journal; lazy replication-projection; overlay-before- mutation; execution off) and never hard-codes any of the above. --- ## 11. Concrete module structure (bridge to implementation) A proposed package layout for `src/shard_wiki/`, mapping 1:1 to the layers so the dependency rule (downward only; L4 rebuildable) is enforceable by import lint: ``` src/shard_wiki/ model/ # L2 top waist: Page, Identity, Placement, ProvenanceEnvelope, # Span, the page-shape types; capability-spectrum value types adapters/ # L1 bottom waist: AdapterContract (versioned iface), CapabilityProfile, # attachment-mode binding; concrete adapters: git/ folder/ gitea/ obsidian/ webdav/ notion/ … # each: profile + verbs coordination/ # L3: GitJournal, OverlayEngine (draft→patch→MR), reconcile federation/ # L3: FederationModel strategies (fork_journal, vcs_ping, # graph_join, feed, activitypub, engine_mirror) union/ # L4 (derived): IdentityResolver, EquivalenceGraph, UnionGraph, # Transclusion (reference-not-copy) projection/ # L4 (derived): ReplicationProjection, DerivationProjection, # ViewRegistry (moldable), QueryIndex (delegate|derive) authz/ # L5 cross-cut: PDP, PEP, IdentityProvider iface, NullProvider provenance/ # cross-cut: the envelope plumbing used by every layer api/ # L6: orchestrator API (server-side union for agents/CLI) ``` Hard import rules: `union/` and `projection/` may import `model/`, `adapters/`, `coordination/` but **nothing may import them** (they are the disposable middle). `model/` and `adapters/` import nothing else in the tree except `provenance/` (the waists stay thin). --- ## 12. Canonical data flows (the architecture exercised) **A. Attach a shard.** Adapter binds (chosen attachment mode) → probes/declares a capability profile → core registers the shard under a root entity → if not git-native, the coordination journal is seeded (begin-now/mirror/import per axis 5). No union rebuild yet (lazy). **B. Read a page through the union.** Consumer asks the union for an identity → Identity resolver maps it to placements across shards → equivalence yields chorus or canonical → replication-projection lazily fetches from each shard (cache + freshness) → page returned wrapped in its provenance envelope → L5 filters anything the principal can't see at source. **C. Edit a read-only / limited shard.** Write request → L5 PDP allows → capability profile says < write-through → OverlayEngine records a draft → renders a patch/MR in the shard's native syntax (lossless) or Markdown (lossy-with-report) → on explicit apply, commit to the journal and (if the profile permits) propagate; otherwise the overlay stands as the local truth, fully attributed. **D. Attach a computational notebook.** Adapter declares profile (attachment=file-store, opacity=mixed, computational=captured-output). Core attaches the `.ipynb` **source** as canonical; presents cells + embedded outputs as **derivation-projection snapshots** marked "run N, env unguaranteed"; offers a static render via the view registry; re-execution stays gated off. History uses paired-text/nbdime per axis 5. --- ## 13. Key tradeoffs & decisions to confirm Resolved here: - **Capability spectra over a verb checklist** — accept richer contract complexity for precise, uniform degradation. (Decided: spectra.) - **Derived middle is a cache, not a store** — accept recompute cost for rebuildability, provenance, and graceful degradation. (Decided: cache.) - **Default federation = fork+journal over Git** — the home case; other models opt-in. (Decided.) - **Execution off by default** — recognise+project always; execute only when gated on. (Decided.) Open — to confirm before SHARD-WP-0002 spec-writing finalises: 1. **Union graph persistence.** Pure-recompute (simplest, honours I-2 hardest) vs a persisted- but-disposable cache (faster, must guarantee rebuild equivalence). *Recommendation: persisted-but-disposable with a `rebuild` that must reproduce it byte-for-byte.* 2. **Address scheme.** Ship shard-scoped native-id wrapping now and treat a portable tumbler as a later capability, or design the tumbler up front? *Recommendation: wrap native ids now.* 3. **L1 "attributed-but-open" mode** — ship it or jump L0→L2? (Carried from ArchitectureBlueprint.) 4. **Per-page ACL default** — off (per-shard/namespace) confirmed; revisit only if demand appears. --- ## 14. What this architecture is *not* - Not a wiki engine, UI, or rendering pipeline (those are consumers at L6). - Not a canonical-source-of-truth — shards keep sovereignty; the middle is derived. - Not a generic file-sync daemon — synchronisation is wiki-page-semantic. - Not an execution platform — computation is recognised and projected, not hosted. - Not a universal ontology — no single schema is imposed on all shards. - Not an authentication/identity store — that is delegated (authorization is owned). --- ## 15. Traceability - **INTENT** — every invariant in §2 cites an INTENT principle or boundary; no invariant contradicts the Stability Note. - **Research** — §6 (spectra) ← `260614-shard-spectrum-synthesis` v3; §8.3 (federation taxonomy) ← v3 §2.5; §8.4–§8.5 (two-axis projection, view registry, computational scope) ← `260614-computational-page-model-synthesis`; §7 page shapes ← the engine + modern-tool + computational dives; §1 thesis ← the files-canonical/index-derived through-line across Logseq/ikiwiki/GT/Pharo/Jupyter. - **Use cases** — the architecture is sized to UC-01–UC-84: federation/coordination (UC-01–07, 26–33, 56, 71–72, 79) → §8; attachment/adapter (UC-34–43, 50, 53, 57, 60–62, 64–66, 68–70, 76–82) → §6; page model & fidelity (UC-34, 39, 42, 55, 58–59, 67, 73, 80, 83–84) → §7/§8.5; addressing/identity/query (UC-32, 44–49, 51–52, 54, 63, 74) → §7.2/§8.4; provenance & metadata (UC-24–25, 75) → the provenance rail; collaboration & discovery (UC-08–23) → L6 consumers over the union. - **Workplans** — §6–§8 are the design target of `SHARD-WP-0002` (T11–T18); §9 is owned by `ArchitectureBlueprint.md`; §1 (yawex-derived resolution/overlay) aligns with `SHARD-WP-0001`. --- ## 16. Stability note This document defines shard-wiki's **internal** architecture; it may evolve as the spec workplans land. But the **thesis (§1)**, the **invariants (§2)**, and the **dual narrow waist (§1, §6, §7)** are load-bearing — changing any of them is an architectural change in the sense of INTENT's Stability Note and should be rare and deliberate.