.ipynb JSON cells + embedded computed outputs with fragile execution provenance; derived output stored inside the source. Non-Markdown/lossy; kernel = capability, default = present snapshot + static render. Enriches UC-54/55/59/35; links UC-32/83/79. Marks T3 done. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
12 KiB
Jupyter Notebooks — deep dive (findings)
Date: 2026-06-14 · Source: SHARD-WP-0004 T3 · Subject: Jupyter Notebooks — the
.ipynb JSON document, kernels, embedded computed outputs, execution provenance.
Why this dive
T1 (literate programming) established one source → derived projections and split replication-projection from derivation-projection. Jupyter is the dominant modern computational document and the concrete case where the derived output is captured and stored inside the source — a non-Markdown, partially-executable content type whose provenance is real but fragile. It is the most plausible concrete "computational shard" content type, so it tests the page model (T12), lossy translation (T15), and the output-provenance question head-on.
1. The .ipynb document model
A notebook is a single JSON document (nbformat), not Markdown:
cells[]— an ordered list. Each cell has acell_type:markdown— prose (Markdown + LaTeX), the human-readable part.code— source text (source), plus anexecution_countand anoutputs[]array captured from the last run.raw— passthrough.
outputs[](per code cell) carry results inline:stream(stdout/stderr),execute_result/display_data(a MIME bundle —text/plain,text/html,image/pngbase64,application/json, vendor MIME types), anderror(traceback).metadataat notebook and cell level (kernelspec,language_info, tags likehide-input,scrolled, slideshow roles).
So an .ipynb is source + last-run computed outputs + environment metadata, fused in one
JSON file. The Markdown cells are an island inside a JSON envelope — relevant to how
shard-wiki extracts/round-trips content.
2. Kernels and execution
- A kernel is a separate language process (IPython, IRkernel, IJulia, …) speaking the
Jupyter messaging protocol (ZeroMQ). The document is decoupled from the kernel: the
.ipynbpersists captured outputs; re-running requires a live kernel + the right environment. execution_countnumbers the order cells were run, which need not match document order — the infamous hidden-state / out-of-order execution problem: stored outputs may reflect a run sequence that no longer corresponds to top-to-bottom reading.- Reproducibility therefore depends on out-of-band state: package versions, data files,
environment, random seeds — none captured by
nbformatitself.
Consequence for shard-wiki: the captured outputs are a snapshot projection with weak provenance — honest treatment must mark them as "computed at run N, environment not guaranteed," never as live or authoritative truth.
3. The ecosystem (relevant to attach/project/translate)
- nbconvert — derives other forms from a notebook: HTML, Markdown, LaTeX/PDF, slides,
script. This is derivation-projection (T1): notebook source → rendered view, lossy in
both directions (HTML keeps outputs;
--to scriptkeeps only code, liketangle). - Jupytext — represents a notebook as a
.py/.mdtext file (pairing), making it git-diffable plain text and round-trippable — directly relevant to storing notebooks in a git shard without JSON-diff noise. - papermill — parameterize + execute a notebook to produce a new output notebook (notebook as a runnable template — a derivation with inputs).
- JupyterLab / Notebook / nbviewer / Colab — front-ends; nbviewer renders a static read-only projection from a URL (a natural projection target).
nbstripout— strips outputs before commit: teams treat outputs as derived noise, keeping only source under version control — an explicit "source canonical, outputs derived" stance mirroring T1.
4. Capability profile (as a shard / content type)
| Dimension (synthesis spectrum) | Jupyter notebook |
|---|---|
| Attachment mode | file-store (.ipynb JSON in a repo) or via Jupyter Server REST API |
| Addressing granularity | document; cell as sub-address (by index / id; nbformat 4.5+ adds stable cell id) |
| Content identity | file path; cell id (4.5+) else positional |
| Structure | ordered cell list (markdown / code+outputs / raw); MIME-bundle outputs |
| History | VCS on the file; JSON diffs are noisy unless paired (Jupytext) or stripped |
| Merge model | git on JSON (poor) → paired text (good) or nbdime (cell-aware diff/merge) |
| Native query | none |
| Translation | nbconvert → HTML/MD/script/PDF (lossy, directional); Jupytext text pairing |
| Write granularity | file / cell |
| Operational envelope | a kernel + environment to (re)execute; static render needs none |
| Content opacity | mixed: source transparent; outputs = MIME blobs (some opaque, e.g. base64 PNG) |
| Provenance | execution_count (weak, out-of-order); environment not captured |
| Computed-output | stored inline, snapshot, reproducibility out-of-band |
5. INTENT mapping
Reinforcements
- Replication- vs derivation-projection (T1) confirmed and extended. nbconvert (→HTML/
script) and nbviewer are derivation-projections;
--to scriptis literallytangle. Jupyter adds a third wrinkle: the derived output is also stored back inside the source (captured outputs), so the "source vs projection" line runs through the document. - Union without erasure / provenance honesty. Captured outputs must be surfaced as
snapshots with weak provenance (run N, environment unguaranteed) — a concrete instance
of "never hide freshness/authorship." The out-of-order
execution_countis exactly the kind of fragility shard-wiki must show, not paper over. - Non-Markdown content + lossy translation (UC-55/UC-59).
.ipynbis JSON with embedded MIME-bundle outputs; any Markdown projection is lossy (loses live outputs, kernel, rich MIME). Surface the lossiness; keep the JSON as canonical payload (T12/T15). - Markdown island. Markdown cells fit the text-first model, but only as fragments inside a JSON envelope — the adapter extracts/round-trips them, it does not pretend the notebook is a Markdown page.
Divergences / boundaries
- shard-wiki is not a kernel host. Re-execution (driving a kernel) is out of scope/ capability-gated; default treatment is attach + present captured outputs as a snapshot projection + offer nbconvert-style static render. Executing/parameterizing (papermill) is an optional capability, never assumed.
- Outputs-in-source is an anti-pattern to respect, not adopt. Teams strip/pair outputs precisely because mixing derived data into the source breaks diffs. shard-wiki should prefer the source-canonical, outputs-as-derived reading (Jupytext pairing / nbstripout ethos) and treat stored outputs as a capturable projection.
What to keep
- Computational-notebook as a first-class content type with cell structure + inline computed outputs carrying (weak) execution provenance — UC-84.
- Outputs = derivation-projection snapshot (T1 vocabulary): regenerable only with a kernel+environment; degrade gracefully to the stored snapshot / static render.
- Cell-level addressing (stable cell
id, nbformat 4.5+) as the sub-page granularity for transclusion/anchoring (UC-32/UC-35). - Text-pairing (Jupytext) as the git-friendly storage strategy — feeds the history-portability thread (poor JSON diffs → paired text / nbdime).
6. UC seed
| # | Seed | Disposition |
|---|---|---|
| UC-84 | Attach/project a computational notebook (.ipynb): preserve cell structure (markdown / code / output) and embedded computed outputs, surfacing each output as a snapshot with its (weak) execution provenance (run count, environment not guaranteed) — re-execution is capability-gated, default is present-the-snapshot + offer a static rendered projection |
new |
| — | Notebook JSON / MIME-bundle outputs = non-Markdown content; Markdown projection is lossy | enrich UC-55, UC-59 |
| — | Computed/evaluated cell = computation-defined content | enrich UC-54 |
| — | Cell id (nbformat 4.5+) = sub-page address for anchor/transclusion |
enrich UC-35, links UC-32 |
| — | Stored outputs as derived snapshot (nbstripout/Jupytext ethos) = source-canonical/outputs-derived | links UC-83, UC-79 |
7. Architecture notes for SHARD-WP-0002
- T12 (page model): add computational-notebook as a page shape — an ordered cell list where code cells own embedded computed outputs (MIME bundles) with weak execution provenance. Distinct from prose, typed records, query-defined, inline-embedded objects (Quip/Notion), typed-graph (Wikibase), and the literate one-source-many-projection shape (UC-83). The defining new attribute: derived output stored inside the source.
- T15 (translation / fidelity):
.ipynbis non-Markdown; nbconvert→Markdown is lossy and directional (drops live outputs/kernel/rich MIME). Keep JSON canonical; any Markdown is a projection. MIME-bundle outputs map to the content-opacity spectrum (text→html→base64 image = transparent→opaque). - T13 (history): JSON diffs are noisy; record text-pairing (Jupytext) and cell-aware diff/merge (nbdime) as history-portability strategies for embedded-output documents. Reinforces "source-canonical, outputs-derived."
- T16 (projection): captured outputs are a derivation-projection snapshot; re-execution (kernel) and parameterized execution (papermill) are capabilities, not assumptions; degrade to the stored snapshot / nbviewer-style static render.
8. Open questions
- Does shard-wiki ever re-execute a notebook (host/broker a kernel), or strictly attach + present captured outputs + static render? (Same scope boundary as UC-83/UC-56 "do we ever drive the derivation.")
- Is UC-84 distinct from UC-83, or is a notebook just the "outputs-stored-in-source" special case of the literate one-source-many-projection pattern? (Kept separate: UC-84's defining trait is captured derived output embedded in the canonical source with weak provenance — a page-model attribute UC-83 doesn't carry.)
- How are MIME-bundle outputs represented in the page model — opaque provenance-tagged blobs, a typed-asset registry (UC-55 open question #10), or selected-MIME projection?
- Default storage: attach
.ipynbas-is (JSON, noisy diffs) or prefer a paired text representation when the shard is a git repo? (Policy → configurable.)
9. Sources
- Jupyter
nbformatreference (cells, outputs, MIME bundles, cellid4.5+); Jupyter messaging protocol / kernels docs. - nbconvert, nbviewer, JupyterLab, Colab docs.
- Jupytext, papermill, nbdime, nbstripout project docs.
- prior:
research/260614-literate-programming-deep-dive/(replication- vs derivation-projection, UC-83);research/260614-notion-deep-dive/(block-JSON, external-API),research/260614-quip-deep-dive/(inline embedded objects, UC-55/58/59).
10. Traceability
New UC UC-84 carries the marker ⊜ in the wikiengines column of
spec/UseCaseCatalog.md (true lineage = this dive). Enriched: UC-54, UC-55, UC-59, UC-35;
links UC-32, UC-83, UC-79. Architecture cross-refs: SHARD-WP-0002 T12 (notebook page shape:
outputs embedded in source), T15 (lossy non-Markdown translation; MIME opacity), T13
(paired-text / nbdime history), T16 (output = derivation-projection snapshot; kernel =
capability).