# Wikibase / Wikidata — deep dive (findings) **Date:** 2026-06-14 · **Source:** SHARD-WP-0003 T2 · **Subject:** Wikibase (MediaWiki extension) and its flagship instance **Wikidata**, incl. the Wikidata Query Service (SPARQL). ## Why this dive Every structured shard so far tops out at *typed records in a database*: Notion's database-pages, XWiki's XObjects/classes, Trilium's typed relations, Roam/Logseq's attribute blocks. Wikibase is a *different kind of structure altogether* — a **typed knowledge graph of entities and provenance-bearing statements**, queried with **SPARQL** over an **RDF** projection. It is the **far end of the structure spectrum** and of the **native-query spectrum**, and it pushes **provenance down to the individual assertion**. The question for shard-wiki: what does a shard look like when its "page" is *not prose but a set of statements*, and what does the page model / adapter contract owe such a shard? ## 1. The data model — entities, statements, snaks **Entities** are the top-level objects, each on its own MediaWiki page with a **stable opaque ID**: - **Item** — `Q42`. Has multilingual **labels / descriptions / aliases**, a set of **statements**, and **sitelinks** (links to wiki articles). The label is *annotation*, not identity — `Q42` is the identity, "Douglas Adams" is just its English label. - **Property** — `P31` ("instance of"). Also has labels/descriptions/aliases, plus a **fixed datatype** constraining its values (item-reference, string, time, globe-coordinate, quantity, monolingual-text, url, external-id, commons-media, …). **Statement** = the unit of assertion on an item. Structure: ``` statement = claim + references[] + rank claim = mainSnak + qualifiers[] snak = property + snaktype + (value) # snaktype ∈ value | somevalue | novalue ``` - **Main snak** — the core property→value assertion (e.g. `P31` → `Q5` "human"). - **Qualifiers** — snaks that *contextualize* the claim without being the subject (validity time, "as of", determination method, units). E.g. *population (P1082) = 8.4M, **point in time (P585) = 2020***. - **References** — lists of snaks citing **where the claim comes from** (a source item, a URL, a page number). **Provenance attached to the individual statement, not the page.** - **Rank** — `preferred` | `normal` | `deprecated`: relative importance among same-property statements (lets multiple, even contradictory, values coexist with a curation signal — the structured analogue of fedwiki's "chorus"). - Each statement carries a **stable GUID** (`Q42$`), so statements are individually addressable. `somevalue` (known to exist, value unknown) and `novalue` (known *not* to have a value) are **first-class** — the model represents *known-unknowns* explicitly, which prose and most DBs cannot. ## 2. The RDF / SPARQL surface Wikibase **projects entities to RDF**; the **Wikidata Query Service (WDQS)** is a **Blazegraph** triple store exposing a **SPARQL** endpoint. The projection is deliberately multi-layered: - **Truthy** triples (`wdt:` prefix) — the simple "best" value, for easy queries: `wd:Q42 wdt:P31 wd:Q5`. - **Full** statements — reified so qualifiers/references/rank survive: `wd:Q42 p:P31 ?stmt . ?stmt ps:P31 wd:Q5 ; pq:P585 ?time ; prov:wasDerivedFrom ?ref`. (`p:`=statement node, `ps:`=statement value, `pq:`=qualifier, `pr:`/`prov:`=reference.) - **Federated SPARQL** — the `SERVICE { … }` keyword runs a sub-query against *another* SPARQL endpoint and joins the results. **Query-level federation is built into the query language** — a different federation primitive from fedwiki's fork/neighborhood. - **EntitySchemas / ShEx** — schemas (`E`-ids) that *validate* an entity's shape (Shape Expressions). Optional, declarative structure validation over the open graph. ## 3. Storage, identity, history - **Storage:** each entity is a **JSON blob stored as a MediaWiki page** (`Item:` / `Property:` content model). The RDF/SPARQL store is a **derived index** rebuilt from these canonical JSON entities (an *update stream* feeds WDQS) — exactly shard-wiki's "derived query index over a canonical store" pattern (UC-63), at planet scale. - **Identity:** the **opaque Q/P/L IDs are the identity**, fully decoupled from human-readable labels and from language. This is the cleanest real-world instance of **stable, language-neutral identity ≠ label/placement** — a strong reinforcement of our identity model (T16). - **History:** because each entity is one MediaWiki page, history is **page-level MediaWiki revisions** — every edit is a full-entity JSON snapshot with author/timestamp/comment. *Coarse* history granularity (whole entity per revision), but the **edit API is fine-grained** (`wbsetclaim`, `wbeditentity` patch individual statements). So: **fine write API over a coarse history unit** — a distinct point on the write/history spectra. ## 4. Capability profile | Dimension (synthesis spectrum) | Wikibase / Wikidata | |--------------------------------|---------------------| | Attachment mode | **external-API** (MediaWiki Action API + REST) **and** a derived **SPARQL endpoint**; self-hostable | | Addressing granularity | **statement** (each has a GUID) within an **entity** (Q/P id) | | Content identity | **stable opaque ID** (Q/P/L); labels are multilingual annotations | | Identity vs placement | **fully separated** — identity is language- and label-neutral | | Structure | **typed knowledge graph**: entities + statements (claim+qualifiers+refs+rank) | | History | **page-level revisions** (whole-entity JSON snapshots); fine-grained edit API | | Merge model | MediaWiki last-writer / edit-conflict; rank lets contradictory values coexist | | Native query | **SPARQL** (RDF) + **federated `SERVICE`** cross-endpoint join — the far end | | Translation | **not Markdown** — content *is* statements; render to prose is a lossy projection | | Attachment/write granularity | **statement-level writes** via API; coarse history unit | | Operational envelope | huge derived index (Blazegraph), rate-limited public endpoints | | Access grant | open read; MediaWiki user/permission model for write; self-host = own ACL | | Content opacity | transparent (public JSON + RDF); not encrypted | | Provenance | **statement-level** — references + rank per assertion (new far end) | ## 5. INTENT mapping ### Reinforcements - **Stable identity ≠ placement** (T16): Q/P IDs decoupled from labels/language are the textbook case — adopt the principle that a page's *identity* is an opaque stable handle, display names are annotations. - **Derived index over canonical store** (UC-63): WDQS is exactly a SPARQL index rebuilt from canonical JSON entities via an update stream — validates the projection pattern. - **Union without erasure / chorus**: **rank** lets multiple (even contradictory) statements coexist with a curation signal rather than forcing one truth — the *structured* analogue of fedwiki's chorus (UC-72) and our "view multiple versions" (UC-27). - **Mechanism over policy**: references + rank are *mechanism* for representing disagreement and sourcing; which statement "wins" is left to the consumer/query. ### Divergences (boundaries / design notes) - **Content is not Markdown.** A Wikibase "page" is a set of statements; there is no prose body. This is the **structure far-end**: shard-wiki must either (a) treat such a shard as a **structured/typed shard** projected to a *lossy* Markdown/table rendering (UC-55/UC-73), or (b) model a page whose payload is typed statements (T12). Forcing it into Markdown-first erases the graph — a design-bug if done silently; render-with-provenance instead. - **Provenance granularity is finer than ours.** Our provenance is per-page/per-shard; Wikibase is **per-statement** (references) and even per-value (rank). The page model and coordination journal should *allow* sub-page provenance (UC-75) even if MVP records it per page. - **Query is graph, not text/datalog.** SPARQL over RDF (with federated `SERVICE`) is a richer query far-end than Roam/Logseq datalog or Notion filters (UC-52) — and its `SERVICE` federation is a *query-time* cross-shard join, distinct from fedwiki structural federation. Note both as native-query tiers. ### What to keep 1. **Opaque stable identity, labels-as-annotations** as the identity model (T16). 2. **Statement/assertion-level provenance** (references) and a **coexistence-with-rank** model as the structured form of union-without-erasure (UC-75). 3. **Derived SPARQL/graph index over a canonical entity store** as a projection pattern (UC-63/UC-74), incl. **federated query** as a first-class federation mode. 4. A **typed-graph page payload** option in the page model (T12), with **lossy render-to-Markdown** as the projection (never silent flattening). ## 6. UC seeds | # | Seed | Disposition | |---|------|-------------| | UC-73 | Attach a **Wikibase** as a **typed entity-statement (RDF) shard** (items/properties/statements w/ qualifiers); project to a rendered page view, lossy to Markdown, preserving the graph | **new** | | UC-74 | **Graph-query the union** via **SPARQL** and **federate queries across endpoints** (`SERVICE`) — graph query as a native-query tier + query-time cross-shard join | **new** | | UC-75 | Preserve **statement-level provenance** — references + rank attached to each assertion (sub-page provenance granularity) | **new** | | — | typed records → typed *graph* entities | enrich **UC-34** | | — | inter-record relations → typed graph edges with qualifiers | enrich **UC-58** | | — | native query → SPARQL/RDF + federated SERVICE | enrich **UC-52** | | — | provenance → statement/assertion granularity | enrich **UC-24** | ## 7. Architecture notes for SHARD-WP-0002 - **T12 (structured/typed page model):** add a **typed-graph payload** tier above typed-records — a page whose content is **entities + statements (claim + qualifiers + references + rank)**, with `somevalue`/`novalue` known-unknowns. Render-to-Markdown is a **lossy projection**, not the canonical form. - **T16 (identity / addressing):** adopt **opaque stable identity with labels-as-annotation** (Q/P model); record **statement GUIDs** as an example of *sub-page addressable units*. - **Native-query tiering:** SPARQL/RDF + federated `SERVICE` is the **graph far-end** of the query spectrum (above datalog/filters); `SERVICE` is also a **query-time federation** mode to sit beside fedwiki's structural federation. - **Provenance model:** allow **per-statement references + rank** (sub-page provenance, coexistence-with-curation) in the union, even if MVP collapses to per-page. - **Derived index:** WDQS = canonical JSON entities → update stream → Blazegraph SPARQL index; the reference implementation of UC-63 at scale (per-shard or core-built index, Q16). ## 8. Open questions 1. Does shard-wiki model a **typed-graph page** natively (T12), or always treat Wikibase as a structured shard **projected to a Markdown/table rendering** (UC-55), or both (canonical graph + lossy view)? 2. Is **SPARQL/graph query** exposed as a union-level capability (translate to a common query layer) or only as a **pass-through** to graph-capable shards? How does federated `SERVICE` relate to shard-wiki's own cross-shard query? 3. At what granularity does the coordination journal record **provenance** — per page (MVP), per statement (Wikibase-native), or configurable? 4. Is **rank** (coexisting contradictory values w/ curation) representable in the union as a first-class "chorus of statements," unifying with fedwiki's page-level chorus (UC-72/27)? ## 9. Sources - Wikibase/DataModel and **DataModel/Primer** — mediawiki.org - Help:Qualifiers; Wikidata SPARQL query service + Query Help; SPARQL tutorial — wikidata.org - Wikidata Query Service / User Manual — mediawiki.org; Wikitech (Blazegraph, updater) - "The wikibase model" — Vanderbilt Libraries Digital Lab (heardlibrary.github.io) - RaiseWikibase — Wikibase Data Model functions (ub-mannheim.github.io) - WShEx / EntitySchemas (ShEx) — arxiv.org/abs/2208.02697, ceur-ws.org Vol-3262 ## 10. Traceability New UCs **UC-73–UC-75** carry the marker **⬡** in the wikiengines column of `spec/UseCaseCatalog.md`. Enriched: UC-34, UC-58, UC-52, UC-24. Architecture cross-refs: SHARD-WP-0002 T12, T16, native-query tiering, provenance model, UC-63 derived index.