Files
shard-wiki/research/260614-jupyter-deep-dive/findings.md
tegwick 25a714efa7 research: Jupyter Notebooks deep dive; UC-84 (SHARD-WP-0004 T3)
.ipynb JSON cells + embedded computed outputs with fragile execution
provenance; derived output stored inside the source. Non-Markdown/lossy;
kernel = capability, default = present snapshot + static render.
Enriches UC-54/55/59/35; links UC-32/83/79. Marks T3 done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 23:08:13 +02:00

186 lines
12 KiB
Markdown

# Jupyter Notebooks — deep dive (findings)
**Date:** 2026-06-14 · **Source:** SHARD-WP-0004 T3 · **Subject:** Jupyter Notebooks — the
`.ipynb` JSON document, kernels, embedded computed outputs, execution provenance.
## Why this dive
T1 (literate programming) established **one source → derived projections** and split
**replication-projection** from **derivation-projection**. Jupyter is the *dominant modern
computational document* and the concrete case where the **derived output is captured and
stored inside the source** — a non-Markdown, partially-executable content type whose
provenance is real but **fragile**. It is the most plausible concrete "computational shard"
content type, so it tests the page model (T12), lossy translation (T15), and the
output-provenance question head-on.
## 1. The `.ipynb` document model
A notebook is a single **JSON document** (`nbformat`), not Markdown:
- **`cells[]`** — an ordered list. Each cell has a `cell_type`:
- `markdown` — prose (Markdown + LaTeX), the human-readable part.
- `code` — source text (`source`), plus an **`execution_count`** and an **`outputs[]`**
array captured from the last run.
- `raw` — passthrough.
- **`outputs[]`** (per code cell) carry results inline: `stream` (stdout/stderr),
`execute_result` / `display_data` (a **MIME bundle**`text/plain`, `text/html`,
`image/png` base64, `application/json`, vendor MIME types), and `error` (traceback).
- **`metadata`** at notebook and cell level (`kernelspec`, `language_info`, tags like
`hide-input`, `scrolled`, slideshow roles).
So an `.ipynb` is **source + last-run computed outputs + environment metadata, fused in one
JSON file**. The Markdown cells are an *island* inside a JSON envelope — relevant to how
shard-wiki extracts/round-trips content.
## 2. Kernels and execution
- A **kernel** is a separate language process (IPython, IRkernel, IJulia, …) speaking the
Jupyter messaging protocol (ZeroMQ). The document is **decoupled from the kernel**: the
`.ipynb` persists *captured* outputs; re-running requires a live kernel + the right
environment.
- **`execution_count`** numbers the order cells were *run*, which **need not match document
order** — the infamous **hidden-state / out-of-order execution** problem: stored outputs
may reflect a run sequence that no longer corresponds to top-to-bottom reading.
- Reproducibility therefore depends on **out-of-band state**: package versions, data files,
environment, random seeds — none captured by `nbformat` itself.
**Consequence for shard-wiki:** the captured outputs are a **snapshot projection with weak
provenance** — honest treatment must mark them as "computed at run N, environment not
guaranteed," never as live or authoritative truth.
## 3. The ecosystem (relevant to attach/project/translate)
- **nbconvert** — derives other forms from a notebook: HTML, Markdown, LaTeX/PDF, slides,
script. This is **derivation-projection** (T1): notebook source → rendered view, lossy in
both directions (HTML keeps outputs; `--to script` keeps only code, like `tangle`).
- **Jupytext** — represents a notebook **as** a `.py`/`.md` text file (pairing), making it
**git-diffable plain text** and round-trippable — directly relevant to storing notebooks
in a git shard without JSON-diff noise.
- **papermill** — parameterize + execute a notebook to produce a new output notebook
(notebook as a runnable template — a *derivation with inputs*).
- **JupyterLab / Notebook / nbviewer / Colab** — front-ends; nbviewer renders a static
read-only projection from a URL (a natural projection target).
- **`nbstripout`** — strips outputs before commit: teams treat **outputs as derived noise**,
keeping only source under version control — an explicit "source canonical, outputs
derived" stance mirroring T1.
## 4. Capability profile (as a shard / content type)
| Dimension (synthesis spectrum) | Jupyter notebook |
|--------------------------------|------------------|
| Attachment mode | file-store (`.ipynb` JSON in a repo) or via Jupyter Server REST API |
| Addressing granularity | document; **cell** as sub-address (by index / id; `nbformat 4.5+` adds stable cell `id`) |
| Content identity | file path; cell `id` (4.5+) else positional |
| Structure | **ordered cell list** (markdown / code+outputs / raw); MIME-bundle outputs |
| History | VCS on the file; **JSON diffs are noisy** unless paired (Jupytext) or stripped |
| Merge model | git on JSON (poor) → **paired text** (good) or nbdime (cell-aware diff/merge) |
| Native query | none |
| Translation | nbconvert → HTML/MD/script/PDF (lossy, directional); Jupytext text pairing |
| Write granularity | file / **cell** |
| Operational envelope | a kernel + environment to (re)execute; static render needs none |
| Content opacity | **mixed**: source transparent; outputs = MIME blobs (some opaque, e.g. base64 PNG) |
| Provenance | `execution_count` (weak, out-of-order); environment **not** captured |
| **Computed-output** | **stored inline**, snapshot, reproducibility out-of-band |
## 5. INTENT mapping
### Reinforcements
- **Replication- vs derivation-projection (T1) confirmed and extended.** nbconvert (→HTML/
script) and nbviewer are derivation-projections; `--to script` is literally `tangle`.
Jupyter adds a third wrinkle: **the derived output is also stored back inside the source**
(captured outputs), so the "source vs projection" line runs *through* the document.
- **Union without erasure / provenance honesty.** Captured outputs must be surfaced **as
snapshots with weak provenance** (run N, environment unguaranteed) — a concrete instance
of "never hide freshness/authorship." The out-of-order `execution_count` is exactly the
kind of fragility shard-wiki must *show*, not paper over.
- **Non-Markdown content + lossy translation (UC-55/UC-59).** `.ipynb` is JSON with embedded
MIME-bundle outputs; any Markdown projection is **lossy** (loses live outputs, kernel,
rich MIME). Surface the lossiness; keep the JSON as canonical payload (T12/T15).
- **Markdown island.** Markdown cells fit the text-first model, but only as *fragments
inside* a JSON envelope — the adapter extracts/round-trips them, it does not pretend the
notebook is a Markdown page.
### Divergences / boundaries
- **shard-wiki is not a kernel host.** Re-execution (driving a kernel) is out of scope/
capability-gated; default treatment is **attach + present captured outputs as a snapshot
projection** + offer nbconvert-style static render. Executing/parameterizing (papermill)
is an optional capability, never assumed.
- **Outputs-in-source is an anti-pattern to respect, not adopt.** Teams strip/pair outputs
precisely because mixing derived data into the source breaks diffs. shard-wiki should
prefer the **source-canonical, outputs-as-derived** reading (Jupytext pairing / nbstripout
ethos) and treat stored outputs as a capturable projection.
### What to keep
1. **Computational-notebook as a first-class content type** with cell structure + inline
**computed outputs carrying (weak) execution provenance** — UC-84.
2. **Outputs = derivation-projection snapshot** (T1 vocabulary): regenerable only with a
kernel+environment; degrade gracefully to the stored snapshot / static render.
3. **Cell-level addressing** (stable cell `id`, nbformat 4.5+) as the sub-page granularity
for transclusion/anchoring (UC-32/UC-35).
4. **Text-pairing (Jupytext)** as the git-friendly storage strategy — feeds the
history-portability thread (poor JSON diffs → paired text / nbdime).
## 6. UC seed
| # | Seed | Disposition |
|---|------|-------------|
| UC-84 | Attach/project a **computational notebook** (`.ipynb`): preserve **cell structure** (markdown / code / output) and **embedded computed outputs**, surfacing each output **as a snapshot with its (weak) execution provenance** (run count, environment not guaranteed) — re-execution is **capability-gated**, default is present-the-snapshot + offer a static rendered projection | **new** |
| — | Notebook JSON / MIME-bundle outputs = non-Markdown content; Markdown projection is lossy | enrich **UC-55**, **UC-59** |
| — | Computed/evaluated cell = computation-defined content | enrich **UC-54** |
| — | Cell `id` (nbformat 4.5+) = sub-page address for anchor/transclusion | enrich **UC-35**, links **UC-32** |
| — | Stored outputs as derived snapshot (nbstripout/Jupytext ethos) = source-canonical/outputs-derived | links **UC-83**, **UC-79** |
## 7. Architecture notes for SHARD-WP-0002
- **T12 (page model):** add **computational-notebook** as a page shape — an **ordered cell
list** where code cells own **embedded computed outputs** (MIME bundles) with weak
execution provenance. Distinct from prose, typed records, query-defined, inline-embedded
objects (Quip/Notion), typed-graph (Wikibase), and the literate one-source-many-projection
shape (UC-83). The defining new attribute: **derived output stored *inside* the source**.
- **T15 (translation / fidelity):** `.ipynb` is non-Markdown; nbconvert→Markdown is **lossy
and directional** (drops live outputs/kernel/rich MIME). Keep JSON canonical; any Markdown
is a projection. MIME-bundle outputs map to the content-opacity spectrum (text→html→base64
image = transparent→opaque).
- **T13 (history):** JSON diffs are **noisy**; record **text-pairing (Jupytext)** and
**cell-aware diff/merge (nbdime)** as history-portability strategies for embedded-output
documents. Reinforces "source-canonical, outputs-derived."
- **T16 (projection):** captured outputs are a **derivation-projection snapshot**;
re-execution (kernel) and parameterized execution (papermill) are **capabilities**, not
assumptions; degrade to the stored snapshot / nbviewer-style static render.
## 8. Open questions
1. Does shard-wiki ever **re-execute** a notebook (host/broker a kernel), or strictly
attach + present captured outputs + static render? (Same scope boundary as UC-83/UC-56
"do we ever drive the derivation.")
2. Is **UC-84** distinct from **UC-83**, or is a notebook just the "outputs-stored-in-source"
special case of the literate one-source-many-projection pattern? (Kept separate: UC-84's
defining trait is *captured derived output embedded in the canonical source with weak
provenance* — a page-model attribute UC-83 doesn't carry.)
3. How are **MIME-bundle outputs** represented in the page model — opaque provenance-tagged
blobs, a typed-asset registry (UC-55 open question #10), or selected-MIME projection?
4. Default storage: attach `.ipynb` **as-is** (JSON, noisy diffs) or prefer a **paired text
representation** when the shard is a git repo? (Policy → configurable.)
## 9. Sources
- Jupyter `nbformat` reference (cells, outputs, MIME bundles, cell `id` 4.5+);
Jupyter messaging protocol / kernels docs.
- **nbconvert**, **nbviewer**, **JupyterLab**, **Colab** docs.
- **Jupytext**, **papermill**, **nbdime**, **nbstripout** project docs.
- prior: `research/260614-literate-programming-deep-dive/` (replication- vs
derivation-projection, UC-83); `research/260614-notion-deep-dive/` (block-JSON,
external-API), `research/260614-quip-deep-dive/` (inline embedded objects, UC-55/58/59).
## 10. Traceability
New UC **UC-84** carries the marker **⊜** in the wikiengines column of
`spec/UseCaseCatalog.md` (true lineage = this dive). Enriched: UC-54, UC-55, UC-59, UC-35;
links UC-32, UC-83, UC-79. Architecture cross-refs: SHARD-WP-0002 T12 (notebook page shape:
outputs embedded in source), T15 (lossy non-Markdown translation; MIME opacity), T13
(paired-text / nbdime history), T16 (output = derivation-projection snapshot; kernel =
capability).