SHARD-WP-0003 T2. Structure & native-query far-end: typed knowledge graph (items/properties, statements = claim+qualifiers+references+rank), RDF projection + SPARQL (WDQS/Blazegraph) incl. federated SERVICE, opaque stable Q/P identity (labels-as-annotation), statement-level provenance. UC-73 (typed-graph shard, lossy render), UC-74 (SPARQL + federated query), UC-75 (per-assertion provenance). Enriched UC-34/58/52/24. Marks T2 done. Feeds SHARD-WP-0002 T12/T16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
12 KiB
Wikibase / Wikidata — deep dive (findings)
Date: 2026-06-14 · Source: SHARD-WP-0003 T2 · Subject: Wikibase (MediaWiki extension) and its flagship instance Wikidata, incl. the Wikidata Query Service (SPARQL).
Why this dive
Every structured shard so far tops out at typed records in a database: Notion's database-pages, XWiki's XObjects/classes, Trilium's typed relations, Roam/Logseq's attribute blocks. Wikibase is a different kind of structure altogether — a typed knowledge graph of entities and provenance-bearing statements, queried with SPARQL over an RDF projection. It is the far end of the structure spectrum and of the native-query spectrum, and it pushes provenance down to the individual assertion. The question for shard-wiki: what does a shard look like when its "page" is not prose but a set of statements, and what does the page model / adapter contract owe such a shard?
1. The data model — entities, statements, snaks
Entities are the top-level objects, each on its own MediaWiki page with a stable opaque ID:
- Item —
Q42. Has multilingual labels / descriptions / aliases, a set of statements, and sitelinks (links to wiki articles). The label is annotation, not identity —Q42is the identity, "Douglas Adams" is just its English label. - Property —
P31("instance of"). Also has labels/descriptions/aliases, plus a fixed datatype constraining its values (item-reference, string, time, globe-coordinate, quantity, monolingual-text, url, external-id, commons-media, …).
Statement = the unit of assertion on an item. Structure:
statement = claim + references[] + rank
claim = mainSnak + qualifiers[]
snak = property + snaktype + (value) # snaktype ∈ value | somevalue | novalue
- Main snak — the core property→value assertion (e.g.
P31→Q5"human"). - Qualifiers — snaks that contextualize the claim without being the subject (validity time, "as of", determination method, units). E.g. population (P1082) = 8.4M, point in time (P585) = 2020.
- References — lists of snaks citing where the claim comes from (a source item, a URL, a page number). Provenance attached to the individual statement, not the page.
- Rank —
preferred|normal|deprecated: relative importance among same-property statements (lets multiple, even contradictory, values coexist with a curation signal — the structured analogue of fedwiki's "chorus"). - Each statement carries a stable GUID (
Q42$<uuid>), so statements are individually addressable.
somevalue (known to exist, value unknown) and novalue (known not to have a value) are
first-class — the model represents known-unknowns explicitly, which prose and most
DBs cannot.
2. The RDF / SPARQL surface
Wikibase projects entities to RDF; the Wikidata Query Service (WDQS) is a Blazegraph triple store exposing a SPARQL endpoint. The projection is deliberately multi-layered:
- Truthy triples (
wdt:prefix) — the simple "best" value, for easy queries:wd:Q42 wdt:P31 wd:Q5. - Full statements — reified so qualifiers/references/rank survive:
wd:Q42 p:P31 ?stmt . ?stmt ps:P31 wd:Q5 ; pq:P585 ?time ; prov:wasDerivedFrom ?ref. (p:=statement node,ps:=statement value,pq:=qualifier,pr:/prov:=reference.) - Federated SPARQL — the
SERVICE <endpoint> { … }keyword runs a sub-query against another SPARQL endpoint and joins the results. Query-level federation is built into the query language — a different federation primitive from fedwiki's fork/neighborhood. - EntitySchemas / ShEx — schemas (
E-ids) that validate an entity's shape (Shape Expressions). Optional, declarative structure validation over the open graph.
3. Storage, identity, history
- Storage: each entity is a JSON blob stored as a MediaWiki page (
Item:/Property:content model). The RDF/SPARQL store is a derived index rebuilt from these canonical JSON entities (an update stream feeds WDQS) — exactly shard-wiki's "derived query index over a canonical store" pattern (UC-63), at planet scale. - Identity: the opaque Q/P/L IDs are the identity, fully decoupled from human-readable labels and from language. This is the cleanest real-world instance of stable, language-neutral identity ≠ label/placement — a strong reinforcement of our identity model (T16).
- History: because each entity is one MediaWiki page, history is page-level MediaWiki
revisions — every edit is a full-entity JSON snapshot with author/timestamp/comment.
Coarse history granularity (whole entity per revision), but the edit API is
fine-grained (
wbsetclaim,wbeditentitypatch individual statements). So: fine write API over a coarse history unit — a distinct point on the write/history spectra.
4. Capability profile
| Dimension (synthesis spectrum) | Wikibase / Wikidata |
|---|---|
| Attachment mode | external-API (MediaWiki Action API + REST) and a derived SPARQL endpoint; self-hostable |
| Addressing granularity | statement (each has a GUID) within an entity (Q/P id) |
| Content identity | stable opaque ID (Q/P/L); labels are multilingual annotations |
| Identity vs placement | fully separated — identity is language- and label-neutral |
| Structure | typed knowledge graph: entities + statements (claim+qualifiers+refs+rank) |
| History | page-level revisions (whole-entity JSON snapshots); fine-grained edit API |
| Merge model | MediaWiki last-writer / edit-conflict; rank lets contradictory values coexist |
| Native query | SPARQL (RDF) + federated SERVICE cross-endpoint join — the far end |
| Translation | not Markdown — content is statements; render to prose is a lossy projection |
| Attachment/write granularity | statement-level writes via API; coarse history unit |
| Operational envelope | huge derived index (Blazegraph), rate-limited public endpoints |
| Access grant | open read; MediaWiki user/permission model for write; self-host = own ACL |
| Content opacity | transparent (public JSON + RDF); not encrypted |
| Provenance | statement-level — references + rank per assertion (new far end) |
5. INTENT mapping
Reinforcements
- Stable identity ≠ placement (T16): Q/P IDs decoupled from labels/language are the textbook case — adopt the principle that a page's identity is an opaque stable handle, display names are annotations.
- Derived index over canonical store (UC-63): WDQS is exactly a SPARQL index rebuilt from canonical JSON entities via an update stream — validates the projection pattern.
- Union without erasure / chorus: rank lets multiple (even contradictory) statements coexist with a curation signal rather than forcing one truth — the structured analogue of fedwiki's chorus (UC-72) and our "view multiple versions" (UC-27).
- Mechanism over policy: references + rank are mechanism for representing disagreement and sourcing; which statement "wins" is left to the consumer/query.
Divergences (boundaries / design notes)
- Content is not Markdown. A Wikibase "page" is a set of statements; there is no prose body. This is the structure far-end: shard-wiki must either (a) treat such a shard as a structured/typed shard projected to a lossy Markdown/table rendering (UC-55/UC-73), or (b) model a page whose payload is typed statements (T12). Forcing it into Markdown-first erases the graph — a design-bug if done silently; render-with-provenance instead.
- Provenance granularity is finer than ours. Our provenance is per-page/per-shard; Wikibase is per-statement (references) and even per-value (rank). The page model and coordination journal should allow sub-page provenance (UC-75) even if MVP records it per page.
- Query is graph, not text/datalog. SPARQL over RDF (with federated
SERVICE) is a richer query far-end than Roam/Logseq datalog or Notion filters (UC-52) — and itsSERVICEfederation is a query-time cross-shard join, distinct from fedwiki structural federation. Note both as native-query tiers.
What to keep
- Opaque stable identity, labels-as-annotations as the identity model (T16).
- Statement/assertion-level provenance (references) and a coexistence-with-rank model as the structured form of union-without-erasure (UC-75).
- Derived SPARQL/graph index over a canonical entity store as a projection pattern (UC-63/UC-74), incl. federated query as a first-class federation mode.
- A typed-graph page payload option in the page model (T12), with lossy render-to-Markdown as the projection (never silent flattening).
6. UC seeds
| # | Seed | Disposition |
|---|---|---|
| UC-73 | Attach a Wikibase as a typed entity-statement (RDF) shard (items/properties/statements w/ qualifiers); project to a rendered page view, lossy to Markdown, preserving the graph | new |
| UC-74 | Graph-query the union via SPARQL and federate queries across endpoints (SERVICE) — graph query as a native-query tier + query-time cross-shard join |
new |
| UC-75 | Preserve statement-level provenance — references + rank attached to each assertion (sub-page provenance granularity) | new |
| — | typed records → typed graph entities | enrich UC-34 |
| — | inter-record relations → typed graph edges with qualifiers | enrich UC-58 |
| — | native query → SPARQL/RDF + federated SERVICE | enrich UC-52 |
| — | provenance → statement/assertion granularity | enrich UC-24 |
7. Architecture notes for SHARD-WP-0002
- T12 (structured/typed page model): add a typed-graph payload tier above
typed-records — a page whose content is entities + statements (claim + qualifiers +
references + rank), with
somevalue/novalueknown-unknowns. Render-to-Markdown is a lossy projection, not the canonical form. - T16 (identity / addressing): adopt opaque stable identity with labels-as-annotation (Q/P model); record statement GUIDs as an example of sub-page addressable units.
- Native-query tiering: SPARQL/RDF + federated
SERVICEis the graph far-end of the query spectrum (above datalog/filters);SERVICEis also a query-time federation mode to sit beside fedwiki's structural federation. - Provenance model: allow per-statement references + rank (sub-page provenance, coexistence-with-curation) in the union, even if MVP collapses to per-page.
- Derived index: WDQS = canonical JSON entities → update stream → Blazegraph SPARQL index; the reference implementation of UC-63 at scale (per-shard or core-built index, Q16).
8. Open questions
- Does shard-wiki model a typed-graph page natively (T12), or always treat Wikibase as a structured shard projected to a Markdown/table rendering (UC-55), or both (canonical graph + lossy view)?
- Is SPARQL/graph query exposed as a union-level capability (translate to a common
query layer) or only as a pass-through to graph-capable shards? How does federated
SERVICErelate to shard-wiki's own cross-shard query? - At what granularity does the coordination journal record provenance — per page (MVP), per statement (Wikibase-native), or configurable?
- Is rank (coexisting contradictory values w/ curation) representable in the union as a first-class "chorus of statements," unifying with fedwiki's page-level chorus (UC-72/27)?
9. Sources
- Wikibase/DataModel and DataModel/Primer — mediawiki.org
- Help:Qualifiers; Wikidata SPARQL query service + Query Help; SPARQL tutorial — wikidata.org
- Wikidata Query Service / User Manual — mediawiki.org; Wikitech (Blazegraph, updater)
- "The wikibase model" — Vanderbilt Libraries Digital Lab (heardlibrary.github.io)
- RaiseWikibase — Wikibase Data Model functions (ub-mannheim.github.io)
- WShEx / EntitySchemas (ShEx) — arxiv.org/abs/2208.02697, ceur-ws.org Vol-3262
10. Traceability
New UCs UC-73–UC-75 carry the marker ⬡ in the wikiengines column of
spec/UseCaseCatalog.md. Enriched: UC-34, UC-58, UC-52, UC-24. Architecture cross-refs:
SHARD-WP-0002 T12, T16, native-query tiering, provenance model, UC-63 derived index.