generated from coulomb/repo-seed
SHARD-WP-0003 T2. Structure & native-query far-end: typed knowledge graph (items/properties, statements = claim+qualifiers+references+rank), RDF projection + SPARQL (WDQS/Blazegraph) incl. federated SERVICE, opaque stable Q/P identity (labels-as-annotation), statement-level provenance. UC-73 (typed-graph shard, lossy render), UC-74 (SPARQL + federated query), UC-75 (per-assertion provenance). Enriched UC-34/58/52/24. Marks T2 done. Feeds SHARD-WP-0002 T12/T16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
201 lines
12 KiB
Markdown
201 lines
12 KiB
Markdown
# Wikibase / Wikidata — deep dive (findings)
|
||
|
||
**Date:** 2026-06-14 · **Source:** SHARD-WP-0003 T2 · **Subject:** Wikibase (MediaWiki
|
||
extension) and its flagship instance **Wikidata**, incl. the Wikidata Query Service (SPARQL).
|
||
|
||
## Why this dive
|
||
|
||
Every structured shard so far tops out at *typed records in a database*: Notion's
|
||
database-pages, XWiki's XObjects/classes, Trilium's typed relations, Roam/Logseq's
|
||
attribute blocks. Wikibase is a *different kind of structure altogether* — a **typed
|
||
knowledge graph of entities and provenance-bearing statements**, queried with **SPARQL**
|
||
over an **RDF** projection. It is the **far end of the structure spectrum** and of the
|
||
**native-query spectrum**, and it pushes **provenance down to the individual assertion**.
|
||
The question for shard-wiki: what does a shard look like when its "page" is *not prose but a
|
||
set of statements*, and what does the page model / adapter contract owe such a shard?
|
||
|
||
## 1. The data model — entities, statements, snaks
|
||
|
||
**Entities** are the top-level objects, each on its own MediaWiki page with a **stable
|
||
opaque ID**:
|
||
|
||
- **Item** — `Q42`. Has multilingual **labels / descriptions / aliases**, a set of
|
||
**statements**, and **sitelinks** (links to wiki articles). The label is *annotation*,
|
||
not identity — `Q42` is the identity, "Douglas Adams" is just its English label.
|
||
- **Property** — `P31` ("instance of"). Also has labels/descriptions/aliases, plus a
|
||
**fixed datatype** constraining its values (item-reference, string, time,
|
||
globe-coordinate, quantity, monolingual-text, url, external-id, commons-media, …).
|
||
|
||
**Statement** = the unit of assertion on an item. Structure:
|
||
|
||
```
|
||
statement = claim + references[] + rank
|
||
claim = mainSnak + qualifiers[]
|
||
snak = property + snaktype + (value) # snaktype ∈ value | somevalue | novalue
|
||
```
|
||
|
||
- **Main snak** — the core property→value assertion (e.g. `P31` → `Q5` "human").
|
||
- **Qualifiers** — snaks that *contextualize* the claim without being the subject (validity
|
||
time, "as of", determination method, units). E.g. *population (P1082) = 8.4M, **point in
|
||
time (P585) = 2020***.
|
||
- **References** — lists of snaks citing **where the claim comes from** (a source item, a
|
||
URL, a page number). **Provenance attached to the individual statement, not the page.**
|
||
- **Rank** — `preferred` | `normal` | `deprecated`: relative importance among same-property
|
||
statements (lets multiple, even contradictory, values coexist with a curation signal —
|
||
the structured analogue of fedwiki's "chorus").
|
||
- Each statement carries a **stable GUID** (`Q42$<uuid>`), so statements are individually
|
||
addressable.
|
||
|
||
`somevalue` (known to exist, value unknown) and `novalue` (known *not* to have a value) are
|
||
**first-class** — the model represents *known-unknowns* explicitly, which prose and most
|
||
DBs cannot.
|
||
|
||
## 2. The RDF / SPARQL surface
|
||
|
||
Wikibase **projects entities to RDF**; the **Wikidata Query Service (WDQS)** is a
|
||
**Blazegraph** triple store exposing a **SPARQL** endpoint. The projection is deliberately
|
||
multi-layered:
|
||
|
||
- **Truthy** triples (`wdt:` prefix) — the simple "best" value, for easy queries:
|
||
`wd:Q42 wdt:P31 wd:Q5`.
|
||
- **Full** statements — reified so qualifiers/references/rank survive: `wd:Q42 p:P31
|
||
?stmt . ?stmt ps:P31 wd:Q5 ; pq:P585 ?time ; prov:wasDerivedFrom ?ref`. (`p:`=statement
|
||
node, `ps:`=statement value, `pq:`=qualifier, `pr:`/`prov:`=reference.)
|
||
- **Federated SPARQL** — the `SERVICE <endpoint> { … }` keyword runs a sub-query against
|
||
*another* SPARQL endpoint and joins the results. **Query-level federation is built into
|
||
the query language** — a different federation primitive from fedwiki's fork/neighborhood.
|
||
- **EntitySchemas / ShEx** — schemas (`E`-ids) that *validate* an entity's shape (Shape
|
||
Expressions). Optional, declarative structure validation over the open graph.
|
||
|
||
## 3. Storage, identity, history
|
||
|
||
- **Storage:** each entity is a **JSON blob stored as a MediaWiki page** (`Item:` /
|
||
`Property:` content model). The RDF/SPARQL store is a **derived index** rebuilt from these
|
||
canonical JSON entities (an *update stream* feeds WDQS) — exactly shard-wiki's
|
||
"derived query index over a canonical store" pattern (UC-63), at planet scale.
|
||
- **Identity:** the **opaque Q/P/L IDs are the identity**, fully decoupled from
|
||
human-readable labels and from language. This is the cleanest real-world instance of
|
||
**stable, language-neutral identity ≠ label/placement** — a strong reinforcement of our
|
||
identity model (T16).
|
||
- **History:** because each entity is one MediaWiki page, history is **page-level MediaWiki
|
||
revisions** — every edit is a full-entity JSON snapshot with author/timestamp/comment.
|
||
*Coarse* history granularity (whole entity per revision), but the **edit API is
|
||
fine-grained** (`wbsetclaim`, `wbeditentity` patch individual statements). So: **fine
|
||
write API over a coarse history unit** — a distinct point on the write/history spectra.
|
||
|
||
## 4. Capability profile
|
||
|
||
| Dimension (synthesis spectrum) | Wikibase / Wikidata |
|
||
|--------------------------------|---------------------|
|
||
| Attachment mode | **external-API** (MediaWiki Action API + REST) **and** a derived **SPARQL endpoint**; self-hostable |
|
||
| Addressing granularity | **statement** (each has a GUID) within an **entity** (Q/P id) |
|
||
| Content identity | **stable opaque ID** (Q/P/L); labels are multilingual annotations |
|
||
| Identity vs placement | **fully separated** — identity is language- and label-neutral |
|
||
| Structure | **typed knowledge graph**: entities + statements (claim+qualifiers+refs+rank) |
|
||
| History | **page-level revisions** (whole-entity JSON snapshots); fine-grained edit API |
|
||
| Merge model | MediaWiki last-writer / edit-conflict; rank lets contradictory values coexist |
|
||
| Native query | **SPARQL** (RDF) + **federated `SERVICE`** cross-endpoint join — the far end |
|
||
| Translation | **not Markdown** — content *is* statements; render to prose is a lossy projection |
|
||
| Attachment/write granularity | **statement-level writes** via API; coarse history unit |
|
||
| Operational envelope | huge derived index (Blazegraph), rate-limited public endpoints |
|
||
| Access grant | open read; MediaWiki user/permission model for write; self-host = own ACL |
|
||
| Content opacity | transparent (public JSON + RDF); not encrypted |
|
||
| Provenance | **statement-level** — references + rank per assertion (new far end) |
|
||
|
||
## 5. INTENT mapping
|
||
|
||
### Reinforcements
|
||
|
||
- **Stable identity ≠ placement** (T16): Q/P IDs decoupled from labels/language are the
|
||
textbook case — adopt the principle that a page's *identity* is an opaque stable handle,
|
||
display names are annotations.
|
||
- **Derived index over canonical store** (UC-63): WDQS is exactly a SPARQL index rebuilt
|
||
from canonical JSON entities via an update stream — validates the projection pattern.
|
||
- **Union without erasure / chorus**: **rank** lets multiple (even contradictory) statements
|
||
coexist with a curation signal rather than forcing one truth — the *structured* analogue
|
||
of fedwiki's chorus (UC-72) and our "view multiple versions" (UC-27).
|
||
- **Mechanism over policy**: references + rank are *mechanism* for representing disagreement
|
||
and sourcing; which statement "wins" is left to the consumer/query.
|
||
|
||
### Divergences (boundaries / design notes)
|
||
|
||
- **Content is not Markdown.** A Wikibase "page" is a set of statements; there is no prose
|
||
body. This is the **structure far-end**: shard-wiki must either (a) treat such a shard as
|
||
a **structured/typed shard** projected to a *lossy* Markdown/table rendering (UC-55/UC-73),
|
||
or (b) model a page whose payload is typed statements (T12). Forcing it into Markdown-first
|
||
erases the graph — a design-bug if done silently; render-with-provenance instead.
|
||
- **Provenance granularity is finer than ours.** Our provenance is per-page/per-shard;
|
||
Wikibase is **per-statement** (references) and even per-value (rank). The page model and
|
||
coordination journal should *allow* sub-page provenance (UC-75) even if MVP records it per
|
||
page.
|
||
- **Query is graph, not text/datalog.** SPARQL over RDF (with federated `SERVICE`) is a
|
||
richer query far-end than Roam/Logseq datalog or Notion filters (UC-52) — and its
|
||
`SERVICE` federation is a *query-time* cross-shard join, distinct from fedwiki structural
|
||
federation. Note both as native-query tiers.
|
||
|
||
### What to keep
|
||
|
||
1. **Opaque stable identity, labels-as-annotations** as the identity model (T16).
|
||
2. **Statement/assertion-level provenance** (references) and a **coexistence-with-rank**
|
||
model as the structured form of union-without-erasure (UC-75).
|
||
3. **Derived SPARQL/graph index over a canonical entity store** as a projection pattern
|
||
(UC-63/UC-74), incl. **federated query** as a first-class federation mode.
|
||
4. A **typed-graph page payload** option in the page model (T12), with **lossy
|
||
render-to-Markdown** as the projection (never silent flattening).
|
||
|
||
## 6. UC seeds
|
||
|
||
| # | Seed | Disposition |
|
||
|---|------|-------------|
|
||
| UC-73 | Attach a **Wikibase** as a **typed entity-statement (RDF) shard** (items/properties/statements w/ qualifiers); project to a rendered page view, lossy to Markdown, preserving the graph | **new** |
|
||
| UC-74 | **Graph-query the union** via **SPARQL** and **federate queries across endpoints** (`SERVICE`) — graph query as a native-query tier + query-time cross-shard join | **new** |
|
||
| UC-75 | Preserve **statement-level provenance** — references + rank attached to each assertion (sub-page provenance granularity) | **new** |
|
||
| — | typed records → typed *graph* entities | enrich **UC-34** |
|
||
| — | inter-record relations → typed graph edges with qualifiers | enrich **UC-58** |
|
||
| — | native query → SPARQL/RDF + federated SERVICE | enrich **UC-52** |
|
||
| — | provenance → statement/assertion granularity | enrich **UC-24** |
|
||
|
||
## 7. Architecture notes for SHARD-WP-0002
|
||
|
||
- **T12 (structured/typed page model):** add a **typed-graph payload** tier above
|
||
typed-records — a page whose content is **entities + statements (claim + qualifiers +
|
||
references + rank)**, with `somevalue`/`novalue` known-unknowns. Render-to-Markdown is a
|
||
**lossy projection**, not the canonical form.
|
||
- **T16 (identity / addressing):** adopt **opaque stable identity with labels-as-annotation**
|
||
(Q/P model); record **statement GUIDs** as an example of *sub-page addressable units*.
|
||
- **Native-query tiering:** SPARQL/RDF + federated `SERVICE` is the **graph far-end** of the
|
||
query spectrum (above datalog/filters); `SERVICE` is also a **query-time federation**
|
||
mode to sit beside fedwiki's structural federation.
|
||
- **Provenance model:** allow **per-statement references + rank** (sub-page provenance,
|
||
coexistence-with-curation) in the union, even if MVP collapses to per-page.
|
||
- **Derived index:** WDQS = canonical JSON entities → update stream → Blazegraph SPARQL
|
||
index; the reference implementation of UC-63 at scale (per-shard or core-built index, Q16).
|
||
|
||
## 8. Open questions
|
||
|
||
1. Does shard-wiki model a **typed-graph page** natively (T12), or always treat Wikibase as
|
||
a structured shard **projected to a Markdown/table rendering** (UC-55), or both
|
||
(canonical graph + lossy view)?
|
||
2. Is **SPARQL/graph query** exposed as a union-level capability (translate to a common
|
||
query layer) or only as a **pass-through** to graph-capable shards? How does federated
|
||
`SERVICE` relate to shard-wiki's own cross-shard query?
|
||
3. At what granularity does the coordination journal record **provenance** — per page
|
||
(MVP), per statement (Wikibase-native), or configurable?
|
||
4. Is **rank** (coexisting contradictory values w/ curation) representable in the union as a
|
||
first-class "chorus of statements," unifying with fedwiki's page-level chorus (UC-72/27)?
|
||
|
||
## 9. Sources
|
||
|
||
- Wikibase/DataModel and **DataModel/Primer** — mediawiki.org
|
||
- Help:Qualifiers; Wikidata SPARQL query service + Query Help; SPARQL tutorial — wikidata.org
|
||
- Wikidata Query Service / User Manual — mediawiki.org; Wikitech (Blazegraph, updater)
|
||
- "The wikibase model" — Vanderbilt Libraries Digital Lab (heardlibrary.github.io)
|
||
- RaiseWikibase — Wikibase Data Model functions (ub-mannheim.github.io)
|
||
- WShEx / EntitySchemas (ShEx) — arxiv.org/abs/2208.02697, ceur-ws.org Vol-3262
|
||
|
||
## 10. Traceability
|
||
|
||
New UCs **UC-73–UC-75** carry the marker **⬡** in the wikiengines column of
|
||
`spec/UseCaseCatalog.md`. Enriched: UC-34, UC-58, UC-52, UC-24. Architecture cross-refs:
|
||
SHARD-WP-0002 T12, T16, native-query tiering, provenance model, UC-63 derived index.
|