Files
shard-wiki/research/260614-wikibase-deep-dive/findings.md
tegwick a6629bdb29 research: Wikibase/Wikidata deep dive (entity-statement graph, SPARQL); UC-73-75
SHARD-WP-0003 T2. Structure & native-query far-end: typed knowledge graph
(items/properties, statements = claim+qualifiers+references+rank), RDF
projection + SPARQL (WDQS/Blazegraph) incl. federated SERVICE, opaque stable
Q/P identity (labels-as-annotation), statement-level provenance. UC-73
(typed-graph shard, lossy render), UC-74 (SPARQL + federated query), UC-75
(per-assertion provenance). Enriched UC-34/58/52/24. Marks T2 done.
Feeds SHARD-WP-0002 T12/T16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:31:08 +02:00

12 KiB
Raw Blame History

Wikibase / Wikidata — deep dive (findings)

Date: 2026-06-14 · Source: SHARD-WP-0003 T2 · Subject: Wikibase (MediaWiki extension) and its flagship instance Wikidata, incl. the Wikidata Query Service (SPARQL).

Why this dive

Every structured shard so far tops out at typed records in a database: Notion's database-pages, XWiki's XObjects/classes, Trilium's typed relations, Roam/Logseq's attribute blocks. Wikibase is a different kind of structure altogether — a typed knowledge graph of entities and provenance-bearing statements, queried with SPARQL over an RDF projection. It is the far end of the structure spectrum and of the native-query spectrum, and it pushes provenance down to the individual assertion. The question for shard-wiki: what does a shard look like when its "page" is not prose but a set of statements, and what does the page model / adapter contract owe such a shard?

1. The data model — entities, statements, snaks

Entities are the top-level objects, each on its own MediaWiki page with a stable opaque ID:

  • ItemQ42. Has multilingual labels / descriptions / aliases, a set of statements, and sitelinks (links to wiki articles). The label is annotation, not identity — Q42 is the identity, "Douglas Adams" is just its English label.
  • PropertyP31 ("instance of"). Also has labels/descriptions/aliases, plus a fixed datatype constraining its values (item-reference, string, time, globe-coordinate, quantity, monolingual-text, url, external-id, commons-media, …).

Statement = the unit of assertion on an item. Structure:

statement = claim + references[] + rank
claim     = mainSnak + qualifiers[]
snak      = property + snaktype + (value)        # snaktype ∈ value | somevalue | novalue
  • Main snak — the core property→value assertion (e.g. P31Q5 "human").
  • Qualifiers — snaks that contextualize the claim without being the subject (validity time, "as of", determination method, units). E.g. population (P1082) = 8.4M, point in time (P585) = 2020.
  • References — lists of snaks citing where the claim comes from (a source item, a URL, a page number). Provenance attached to the individual statement, not the page.
  • Rankpreferred | normal | deprecated: relative importance among same-property statements (lets multiple, even contradictory, values coexist with a curation signal — the structured analogue of fedwiki's "chorus").
  • Each statement carries a stable GUID (Q42$<uuid>), so statements are individually addressable.

somevalue (known to exist, value unknown) and novalue (known not to have a value) are first-class — the model represents known-unknowns explicitly, which prose and most DBs cannot.

2. The RDF / SPARQL surface

Wikibase projects entities to RDF; the Wikidata Query Service (WDQS) is a Blazegraph triple store exposing a SPARQL endpoint. The projection is deliberately multi-layered:

  • Truthy triples (wdt: prefix) — the simple "best" value, for easy queries: wd:Q42 wdt:P31 wd:Q5.
  • Full statements — reified so qualifiers/references/rank survive: wd:Q42 p:P31 ?stmt . ?stmt ps:P31 wd:Q5 ; pq:P585 ?time ; prov:wasDerivedFrom ?ref. (p:=statement node, ps:=statement value, pq:=qualifier, pr:/prov:=reference.)
  • Federated SPARQL — the SERVICE <endpoint> { … } keyword runs a sub-query against another SPARQL endpoint and joins the results. Query-level federation is built into the query language — a different federation primitive from fedwiki's fork/neighborhood.
  • EntitySchemas / ShEx — schemas (E-ids) that validate an entity's shape (Shape Expressions). Optional, declarative structure validation over the open graph.

3. Storage, identity, history

  • Storage: each entity is a JSON blob stored as a MediaWiki page (Item: / Property: content model). The RDF/SPARQL store is a derived index rebuilt from these canonical JSON entities (an update stream feeds WDQS) — exactly shard-wiki's "derived query index over a canonical store" pattern (UC-63), at planet scale.
  • Identity: the opaque Q/P/L IDs are the identity, fully decoupled from human-readable labels and from language. This is the cleanest real-world instance of stable, language-neutral identity ≠ label/placement — a strong reinforcement of our identity model (T16).
  • History: because each entity is one MediaWiki page, history is page-level MediaWiki revisions — every edit is a full-entity JSON snapshot with author/timestamp/comment. Coarse history granularity (whole entity per revision), but the edit API is fine-grained (wbsetclaim, wbeditentity patch individual statements). So: fine write API over a coarse history unit — a distinct point on the write/history spectra.

4. Capability profile

Dimension (synthesis spectrum) Wikibase / Wikidata
Attachment mode external-API (MediaWiki Action API + REST) and a derived SPARQL endpoint; self-hostable
Addressing granularity statement (each has a GUID) within an entity (Q/P id)
Content identity stable opaque ID (Q/P/L); labels are multilingual annotations
Identity vs placement fully separated — identity is language- and label-neutral
Structure typed knowledge graph: entities + statements (claim+qualifiers+refs+rank)
History page-level revisions (whole-entity JSON snapshots); fine-grained edit API
Merge model MediaWiki last-writer / edit-conflict; rank lets contradictory values coexist
Native query SPARQL (RDF) + federated SERVICE cross-endpoint join — the far end
Translation not Markdown — content is statements; render to prose is a lossy projection
Attachment/write granularity statement-level writes via API; coarse history unit
Operational envelope huge derived index (Blazegraph), rate-limited public endpoints
Access grant open read; MediaWiki user/permission model for write; self-host = own ACL
Content opacity transparent (public JSON + RDF); not encrypted
Provenance statement-level — references + rank per assertion (new far end)

5. INTENT mapping

Reinforcements

  • Stable identity ≠ placement (T16): Q/P IDs decoupled from labels/language are the textbook case — adopt the principle that a page's identity is an opaque stable handle, display names are annotations.
  • Derived index over canonical store (UC-63): WDQS is exactly a SPARQL index rebuilt from canonical JSON entities via an update stream — validates the projection pattern.
  • Union without erasure / chorus: rank lets multiple (even contradictory) statements coexist with a curation signal rather than forcing one truth — the structured analogue of fedwiki's chorus (UC-72) and our "view multiple versions" (UC-27).
  • Mechanism over policy: references + rank are mechanism for representing disagreement and sourcing; which statement "wins" is left to the consumer/query.

Divergences (boundaries / design notes)

  • Content is not Markdown. A Wikibase "page" is a set of statements; there is no prose body. This is the structure far-end: shard-wiki must either (a) treat such a shard as a structured/typed shard projected to a lossy Markdown/table rendering (UC-55/UC-73), or (b) model a page whose payload is typed statements (T12). Forcing it into Markdown-first erases the graph — a design-bug if done silently; render-with-provenance instead.
  • Provenance granularity is finer than ours. Our provenance is per-page/per-shard; Wikibase is per-statement (references) and even per-value (rank). The page model and coordination journal should allow sub-page provenance (UC-75) even if MVP records it per page.
  • Query is graph, not text/datalog. SPARQL over RDF (with federated SERVICE) is a richer query far-end than Roam/Logseq datalog or Notion filters (UC-52) — and its SERVICE federation is a query-time cross-shard join, distinct from fedwiki structural federation. Note both as native-query tiers.

What to keep

  1. Opaque stable identity, labels-as-annotations as the identity model (T16).
  2. Statement/assertion-level provenance (references) and a coexistence-with-rank model as the structured form of union-without-erasure (UC-75).
  3. Derived SPARQL/graph index over a canonical entity store as a projection pattern (UC-63/UC-74), incl. federated query as a first-class federation mode.
  4. A typed-graph page payload option in the page model (T12), with lossy render-to-Markdown as the projection (never silent flattening).

6. UC seeds

# Seed Disposition
UC-73 Attach a Wikibase as a typed entity-statement (RDF) shard (items/properties/statements w/ qualifiers); project to a rendered page view, lossy to Markdown, preserving the graph new
UC-74 Graph-query the union via SPARQL and federate queries across endpoints (SERVICE) — graph query as a native-query tier + query-time cross-shard join new
UC-75 Preserve statement-level provenance — references + rank attached to each assertion (sub-page provenance granularity) new
typed records → typed graph entities enrich UC-34
inter-record relations → typed graph edges with qualifiers enrich UC-58
native query → SPARQL/RDF + federated SERVICE enrich UC-52
provenance → statement/assertion granularity enrich UC-24

7. Architecture notes for SHARD-WP-0002

  • T12 (structured/typed page model): add a typed-graph payload tier above typed-records — a page whose content is entities + statements (claim + qualifiers + references + rank), with somevalue/novalue known-unknowns. Render-to-Markdown is a lossy projection, not the canonical form.
  • T16 (identity / addressing): adopt opaque stable identity with labels-as-annotation (Q/P model); record statement GUIDs as an example of sub-page addressable units.
  • Native-query tiering: SPARQL/RDF + federated SERVICE is the graph far-end of the query spectrum (above datalog/filters); SERVICE is also a query-time federation mode to sit beside fedwiki's structural federation.
  • Provenance model: allow per-statement references + rank (sub-page provenance, coexistence-with-curation) in the union, even if MVP collapses to per-page.
  • Derived index: WDQS = canonical JSON entities → update stream → Blazegraph SPARQL index; the reference implementation of UC-63 at scale (per-shard or core-built index, Q16).

8. Open questions

  1. Does shard-wiki model a typed-graph page natively (T12), or always treat Wikibase as a structured shard projected to a Markdown/table rendering (UC-55), or both (canonical graph + lossy view)?
  2. Is SPARQL/graph query exposed as a union-level capability (translate to a common query layer) or only as a pass-through to graph-capable shards? How does federated SERVICE relate to shard-wiki's own cross-shard query?
  3. At what granularity does the coordination journal record provenance — per page (MVP), per statement (Wikibase-native), or configurable?
  4. Is rank (coexisting contradictory values w/ curation) representable in the union as a first-class "chorus of statements," unifying with fedwiki's page-level chorus (UC-72/27)?

9. Sources

  • Wikibase/DataModel and DataModel/Primer — mediawiki.org
  • Help:Qualifiers; Wikidata SPARQL query service + Query Help; SPARQL tutorial — wikidata.org
  • Wikidata Query Service / User Manual — mediawiki.org; Wikitech (Blazegraph, updater)
  • "The wikibase model" — Vanderbilt Libraries Digital Lab (heardlibrary.github.io)
  • RaiseWikibase — Wikibase Data Model functions (ub-mannheim.github.io)
  • WShEx / EntitySchemas (ShEx) — arxiv.org/abs/2208.02697, ceur-ws.org Vol-3262

10. Traceability

New UCs UC-73UC-75 carry the marker in the wikiengines column of spec/UseCaseCatalog.md. Enriched: UC-34, UC-58, UC-52, UC-24. Architecture cross-refs: SHARD-WP-0002 T12, T16, native-query tiering, provenance model, UC-63 derived index.