Files
shard-wiki/research/260614-wikibase-deep-dive/findings.md
tegwick a6629bdb29 research: Wikibase/Wikidata deep dive (entity-statement graph, SPARQL); UC-73-75
SHARD-WP-0003 T2. Structure & native-query far-end: typed knowledge graph
(items/properties, statements = claim+qualifiers+references+rank), RDF
projection + SPARQL (WDQS/Blazegraph) incl. federated SERVICE, opaque stable
Q/P identity (labels-as-annotation), statement-level provenance. UC-73
(typed-graph shard, lossy render), UC-74 (SPARQL + federated query), UC-75
(per-assertion provenance). Enriched UC-34/58/52/24. Marks T2 done.
Feeds SHARD-WP-0002 T12/T16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:31:08 +02:00

201 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Wikibase / Wikidata — deep dive (findings)
**Date:** 2026-06-14 · **Source:** SHARD-WP-0003 T2 · **Subject:** Wikibase (MediaWiki
extension) and its flagship instance **Wikidata**, incl. the Wikidata Query Service (SPARQL).
## Why this dive
Every structured shard so far tops out at *typed records in a database*: Notion's
database-pages, XWiki's XObjects/classes, Trilium's typed relations, Roam/Logseq's
attribute blocks. Wikibase is a *different kind of structure altogether* — a **typed
knowledge graph of entities and provenance-bearing statements**, queried with **SPARQL**
over an **RDF** projection. It is the **far end of the structure spectrum** and of the
**native-query spectrum**, and it pushes **provenance down to the individual assertion**.
The question for shard-wiki: what does a shard look like when its "page" is *not prose but a
set of statements*, and what does the page model / adapter contract owe such a shard?
## 1. The data model — entities, statements, snaks
**Entities** are the top-level objects, each on its own MediaWiki page with a **stable
opaque ID**:
- **Item** — `Q42`. Has multilingual **labels / descriptions / aliases**, a set of
**statements**, and **sitelinks** (links to wiki articles). The label is *annotation*,
not identity — `Q42` is the identity, "Douglas Adams" is just its English label.
- **Property** — `P31` ("instance of"). Also has labels/descriptions/aliases, plus a
**fixed datatype** constraining its values (item-reference, string, time,
globe-coordinate, quantity, monolingual-text, url, external-id, commons-media, …).
**Statement** = the unit of assertion on an item. Structure:
```
statement = claim + references[] + rank
claim = mainSnak + qualifiers[]
snak = property + snaktype + (value) # snaktype ∈ value | somevalue | novalue
```
- **Main snak** — the core property→value assertion (e.g. `P31``Q5` "human").
- **Qualifiers** — snaks that *contextualize* the claim without being the subject (validity
time, "as of", determination method, units). E.g. *population (P1082) = 8.4M, **point in
time (P585) = 2020***.
- **References** — lists of snaks citing **where the claim comes from** (a source item, a
URL, a page number). **Provenance attached to the individual statement, not the page.**
- **Rank** — `preferred` | `normal` | `deprecated`: relative importance among same-property
statements (lets multiple, even contradictory, values coexist with a curation signal —
the structured analogue of fedwiki's "chorus").
- Each statement carries a **stable GUID** (`Q42$<uuid>`), so statements are individually
addressable.
`somevalue` (known to exist, value unknown) and `novalue` (known *not* to have a value) are
**first-class** — the model represents *known-unknowns* explicitly, which prose and most
DBs cannot.
## 2. The RDF / SPARQL surface
Wikibase **projects entities to RDF**; the **Wikidata Query Service (WDQS)** is a
**Blazegraph** triple store exposing a **SPARQL** endpoint. The projection is deliberately
multi-layered:
- **Truthy** triples (`wdt:` prefix) — the simple "best" value, for easy queries:
`wd:Q42 wdt:P31 wd:Q5`.
- **Full** statements — reified so qualifiers/references/rank survive: `wd:Q42 p:P31
?stmt . ?stmt ps:P31 wd:Q5 ; pq:P585 ?time ; prov:wasDerivedFrom ?ref`. (`p:`=statement
node, `ps:`=statement value, `pq:`=qualifier, `pr:`/`prov:`=reference.)
- **Federated SPARQL** — the `SERVICE <endpoint> { … }` keyword runs a sub-query against
*another* SPARQL endpoint and joins the results. **Query-level federation is built into
the query language** — a different federation primitive from fedwiki's fork/neighborhood.
- **EntitySchemas / ShEx** — schemas (`E`-ids) that *validate* an entity's shape (Shape
Expressions). Optional, declarative structure validation over the open graph.
## 3. Storage, identity, history
- **Storage:** each entity is a **JSON blob stored as a MediaWiki page** (`Item:` /
`Property:` content model). The RDF/SPARQL store is a **derived index** rebuilt from these
canonical JSON entities (an *update stream* feeds WDQS) — exactly shard-wiki's
"derived query index over a canonical store" pattern (UC-63), at planet scale.
- **Identity:** the **opaque Q/P/L IDs are the identity**, fully decoupled from
human-readable labels and from language. This is the cleanest real-world instance of
**stable, language-neutral identity ≠ label/placement** — a strong reinforcement of our
identity model (T16).
- **History:** because each entity is one MediaWiki page, history is **page-level MediaWiki
revisions** — every edit is a full-entity JSON snapshot with author/timestamp/comment.
*Coarse* history granularity (whole entity per revision), but the **edit API is
fine-grained** (`wbsetclaim`, `wbeditentity` patch individual statements). So: **fine
write API over a coarse history unit** — a distinct point on the write/history spectra.
## 4. Capability profile
| Dimension (synthesis spectrum) | Wikibase / Wikidata |
|--------------------------------|---------------------|
| Attachment mode | **external-API** (MediaWiki Action API + REST) **and** a derived **SPARQL endpoint**; self-hostable |
| Addressing granularity | **statement** (each has a GUID) within an **entity** (Q/P id) |
| Content identity | **stable opaque ID** (Q/P/L); labels are multilingual annotations |
| Identity vs placement | **fully separated** — identity is language- and label-neutral |
| Structure | **typed knowledge graph**: entities + statements (claim+qualifiers+refs+rank) |
| History | **page-level revisions** (whole-entity JSON snapshots); fine-grained edit API |
| Merge model | MediaWiki last-writer / edit-conflict; rank lets contradictory values coexist |
| Native query | **SPARQL** (RDF) + **federated `SERVICE`** cross-endpoint join — the far end |
| Translation | **not Markdown** — content *is* statements; render to prose is a lossy projection |
| Attachment/write granularity | **statement-level writes** via API; coarse history unit |
| Operational envelope | huge derived index (Blazegraph), rate-limited public endpoints |
| Access grant | open read; MediaWiki user/permission model for write; self-host = own ACL |
| Content opacity | transparent (public JSON + RDF); not encrypted |
| Provenance | **statement-level** — references + rank per assertion (new far end) |
## 5. INTENT mapping
### Reinforcements
- **Stable identity ≠ placement** (T16): Q/P IDs decoupled from labels/language are the
textbook case — adopt the principle that a page's *identity* is an opaque stable handle,
display names are annotations.
- **Derived index over canonical store** (UC-63): WDQS is exactly a SPARQL index rebuilt
from canonical JSON entities via an update stream — validates the projection pattern.
- **Union without erasure / chorus**: **rank** lets multiple (even contradictory) statements
coexist with a curation signal rather than forcing one truth — the *structured* analogue
of fedwiki's chorus (UC-72) and our "view multiple versions" (UC-27).
- **Mechanism over policy**: references + rank are *mechanism* for representing disagreement
and sourcing; which statement "wins" is left to the consumer/query.
### Divergences (boundaries / design notes)
- **Content is not Markdown.** A Wikibase "page" is a set of statements; there is no prose
body. This is the **structure far-end**: shard-wiki must either (a) treat such a shard as
a **structured/typed shard** projected to a *lossy* Markdown/table rendering (UC-55/UC-73),
or (b) model a page whose payload is typed statements (T12). Forcing it into Markdown-first
erases the graph — a design-bug if done silently; render-with-provenance instead.
- **Provenance granularity is finer than ours.** Our provenance is per-page/per-shard;
Wikibase is **per-statement** (references) and even per-value (rank). The page model and
coordination journal should *allow* sub-page provenance (UC-75) even if MVP records it per
page.
- **Query is graph, not text/datalog.** SPARQL over RDF (with federated `SERVICE`) is a
richer query far-end than Roam/Logseq datalog or Notion filters (UC-52) — and its
`SERVICE` federation is a *query-time* cross-shard join, distinct from fedwiki structural
federation. Note both as native-query tiers.
### What to keep
1. **Opaque stable identity, labels-as-annotations** as the identity model (T16).
2. **Statement/assertion-level provenance** (references) and a **coexistence-with-rank**
model as the structured form of union-without-erasure (UC-75).
3. **Derived SPARQL/graph index over a canonical entity store** as a projection pattern
(UC-63/UC-74), incl. **federated query** as a first-class federation mode.
4. A **typed-graph page payload** option in the page model (T12), with **lossy
render-to-Markdown** as the projection (never silent flattening).
## 6. UC seeds
| # | Seed | Disposition |
|---|------|-------------|
| UC-73 | Attach a **Wikibase** as a **typed entity-statement (RDF) shard** (items/properties/statements w/ qualifiers); project to a rendered page view, lossy to Markdown, preserving the graph | **new** |
| UC-74 | **Graph-query the union** via **SPARQL** and **federate queries across endpoints** (`SERVICE`) — graph query as a native-query tier + query-time cross-shard join | **new** |
| UC-75 | Preserve **statement-level provenance** — references + rank attached to each assertion (sub-page provenance granularity) | **new** |
| — | typed records → typed *graph* entities | enrich **UC-34** |
| — | inter-record relations → typed graph edges with qualifiers | enrich **UC-58** |
| — | native query → SPARQL/RDF + federated SERVICE | enrich **UC-52** |
| — | provenance → statement/assertion granularity | enrich **UC-24** |
## 7. Architecture notes for SHARD-WP-0002
- **T12 (structured/typed page model):** add a **typed-graph payload** tier above
typed-records — a page whose content is **entities + statements (claim + qualifiers +
references + rank)**, with `somevalue`/`novalue` known-unknowns. Render-to-Markdown is a
**lossy projection**, not the canonical form.
- **T16 (identity / addressing):** adopt **opaque stable identity with labels-as-annotation**
(Q/P model); record **statement GUIDs** as an example of *sub-page addressable units*.
- **Native-query tiering:** SPARQL/RDF + federated `SERVICE` is the **graph far-end** of the
query spectrum (above datalog/filters); `SERVICE` is also a **query-time federation**
mode to sit beside fedwiki's structural federation.
- **Provenance model:** allow **per-statement references + rank** (sub-page provenance,
coexistence-with-curation) in the union, even if MVP collapses to per-page.
- **Derived index:** WDQS = canonical JSON entities → update stream → Blazegraph SPARQL
index; the reference implementation of UC-63 at scale (per-shard or core-built index, Q16).
## 8. Open questions
1. Does shard-wiki model a **typed-graph page** natively (T12), or always treat Wikibase as
a structured shard **projected to a Markdown/table rendering** (UC-55), or both
(canonical graph + lossy view)?
2. Is **SPARQL/graph query** exposed as a union-level capability (translate to a common
query layer) or only as a **pass-through** to graph-capable shards? How does federated
`SERVICE` relate to shard-wiki's own cross-shard query?
3. At what granularity does the coordination journal record **provenance** — per page
(MVP), per statement (Wikibase-native), or configurable?
4. Is **rank** (coexisting contradictory values w/ curation) representable in the union as a
first-class "chorus of statements," unifying with fedwiki's page-level chorus (UC-72/27)?
## 9. Sources
- Wikibase/DataModel and **DataModel/Primer** — mediawiki.org
- Help:Qualifiers; Wikidata SPARQL query service + Query Help; SPARQL tutorial — wikidata.org
- Wikidata Query Service / User Manual — mediawiki.org; Wikitech (Blazegraph, updater)
- "The wikibase model" — Vanderbilt Libraries Digital Lab (heardlibrary.github.io)
- RaiseWikibase — Wikibase Data Model functions (ub-mannheim.github.io)
- WShEx / EntitySchemas (ShEx) — arxiv.org/abs/2208.02697, ceur-ws.org Vol-3262
## 10. Traceability
New UCs **UC-73UC-75** carry the marker **⬡** in the wikiengines column of
`spec/UseCaseCatalog.md`. Enriched: UC-34, UC-58, UC-52, UC-24. Architecture cross-refs:
SHARD-WP-0002 T12, T16, native-query tiering, provenance model, UC-63 derived index.