Files
tegwick 25a714efa7 research: Jupyter Notebooks deep dive; UC-84 (SHARD-WP-0004 T3)
.ipynb JSON cells + embedded computed outputs with fragile execution
provenance; derived output stored inside the source. Non-Markdown/lossy;
kernel = capability, default = present snapshot + static render.
Enriches UC-54/55/59/35; links UC-32/83/79. Marks T3 done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 23:08:13 +02:00

12 KiB

Jupyter Notebooks — deep dive (findings)

Date: 2026-06-14 · Source: SHARD-WP-0004 T3 · Subject: Jupyter Notebooks — the .ipynb JSON document, kernels, embedded computed outputs, execution provenance.

Why this dive

T1 (literate programming) established one source → derived projections and split replication-projection from derivation-projection. Jupyter is the dominant modern computational document and the concrete case where the derived output is captured and stored inside the source — a non-Markdown, partially-executable content type whose provenance is real but fragile. It is the most plausible concrete "computational shard" content type, so it tests the page model (T12), lossy translation (T15), and the output-provenance question head-on.

1. The .ipynb document model

A notebook is a single JSON document (nbformat), not Markdown:

  • cells[] — an ordered list. Each cell has a cell_type:
    • markdown — prose (Markdown + LaTeX), the human-readable part.
    • code — source text (source), plus an execution_count and an outputs[] array captured from the last run.
    • raw — passthrough.
  • outputs[] (per code cell) carry results inline: stream (stdout/stderr), execute_result / display_data (a MIME bundletext/plain, text/html, image/png base64, application/json, vendor MIME types), and error (traceback).
  • metadata at notebook and cell level (kernelspec, language_info, tags like hide-input, scrolled, slideshow roles).

So an .ipynb is source + last-run computed outputs + environment metadata, fused in one JSON file. The Markdown cells are an island inside a JSON envelope — relevant to how shard-wiki extracts/round-trips content.

2. Kernels and execution

  • A kernel is a separate language process (IPython, IRkernel, IJulia, …) speaking the Jupyter messaging protocol (ZeroMQ). The document is decoupled from the kernel: the .ipynb persists captured outputs; re-running requires a live kernel + the right environment.
  • execution_count numbers the order cells were run, which need not match document order — the infamous hidden-state / out-of-order execution problem: stored outputs may reflect a run sequence that no longer corresponds to top-to-bottom reading.
  • Reproducibility therefore depends on out-of-band state: package versions, data files, environment, random seeds — none captured by nbformat itself.

Consequence for shard-wiki: the captured outputs are a snapshot projection with weak provenance — honest treatment must mark them as "computed at run N, environment not guaranteed," never as live or authoritative truth.

3. The ecosystem (relevant to attach/project/translate)

  • nbconvert — derives other forms from a notebook: HTML, Markdown, LaTeX/PDF, slides, script. This is derivation-projection (T1): notebook source → rendered view, lossy in both directions (HTML keeps outputs; --to script keeps only code, like tangle).
  • Jupytext — represents a notebook as a .py/.md text file (pairing), making it git-diffable plain text and round-trippable — directly relevant to storing notebooks in a git shard without JSON-diff noise.
  • papermill — parameterize + execute a notebook to produce a new output notebook (notebook as a runnable template — a derivation with inputs).
  • JupyterLab / Notebook / nbviewer / Colab — front-ends; nbviewer renders a static read-only projection from a URL (a natural projection target).
  • nbstripout — strips outputs before commit: teams treat outputs as derived noise, keeping only source under version control — an explicit "source canonical, outputs derived" stance mirroring T1.

4. Capability profile (as a shard / content type)

Dimension (synthesis spectrum) Jupyter notebook
Attachment mode file-store (.ipynb JSON in a repo) or via Jupyter Server REST API
Addressing granularity document; cell as sub-address (by index / id; nbformat 4.5+ adds stable cell id)
Content identity file path; cell id (4.5+) else positional
Structure ordered cell list (markdown / code+outputs / raw); MIME-bundle outputs
History VCS on the file; JSON diffs are noisy unless paired (Jupytext) or stripped
Merge model git on JSON (poor) → paired text (good) or nbdime (cell-aware diff/merge)
Native query none
Translation nbconvert → HTML/MD/script/PDF (lossy, directional); Jupytext text pairing
Write granularity file / cell
Operational envelope a kernel + environment to (re)execute; static render needs none
Content opacity mixed: source transparent; outputs = MIME blobs (some opaque, e.g. base64 PNG)
Provenance execution_count (weak, out-of-order); environment not captured
Computed-output stored inline, snapshot, reproducibility out-of-band

5. INTENT mapping

Reinforcements

  • Replication- vs derivation-projection (T1) confirmed and extended. nbconvert (→HTML/ script) and nbviewer are derivation-projections; --to script is literally tangle. Jupyter adds a third wrinkle: the derived output is also stored back inside the source (captured outputs), so the "source vs projection" line runs through the document.
  • Union without erasure / provenance honesty. Captured outputs must be surfaced as snapshots with weak provenance (run N, environment unguaranteed) — a concrete instance of "never hide freshness/authorship." The out-of-order execution_count is exactly the kind of fragility shard-wiki must show, not paper over.
  • Non-Markdown content + lossy translation (UC-55/UC-59). .ipynb is JSON with embedded MIME-bundle outputs; any Markdown projection is lossy (loses live outputs, kernel, rich MIME). Surface the lossiness; keep the JSON as canonical payload (T12/T15).
  • Markdown island. Markdown cells fit the text-first model, but only as fragments inside a JSON envelope — the adapter extracts/round-trips them, it does not pretend the notebook is a Markdown page.

Divergences / boundaries

  • shard-wiki is not a kernel host. Re-execution (driving a kernel) is out of scope/ capability-gated; default treatment is attach + present captured outputs as a snapshot projection + offer nbconvert-style static render. Executing/parameterizing (papermill) is an optional capability, never assumed.
  • Outputs-in-source is an anti-pattern to respect, not adopt. Teams strip/pair outputs precisely because mixing derived data into the source breaks diffs. shard-wiki should prefer the source-canonical, outputs-as-derived reading (Jupytext pairing / nbstripout ethos) and treat stored outputs as a capturable projection.

What to keep

  1. Computational-notebook as a first-class content type with cell structure + inline computed outputs carrying (weak) execution provenance — UC-84.
  2. Outputs = derivation-projection snapshot (T1 vocabulary): regenerable only with a kernel+environment; degrade gracefully to the stored snapshot / static render.
  3. Cell-level addressing (stable cell id, nbformat 4.5+) as the sub-page granularity for transclusion/anchoring (UC-32/UC-35).
  4. Text-pairing (Jupytext) as the git-friendly storage strategy — feeds the history-portability thread (poor JSON diffs → paired text / nbdime).

6. UC seed

# Seed Disposition
UC-84 Attach/project a computational notebook (.ipynb): preserve cell structure (markdown / code / output) and embedded computed outputs, surfacing each output as a snapshot with its (weak) execution provenance (run count, environment not guaranteed) — re-execution is capability-gated, default is present-the-snapshot + offer a static rendered projection new
Notebook JSON / MIME-bundle outputs = non-Markdown content; Markdown projection is lossy enrich UC-55, UC-59
Computed/evaluated cell = computation-defined content enrich UC-54
Cell id (nbformat 4.5+) = sub-page address for anchor/transclusion enrich UC-35, links UC-32
Stored outputs as derived snapshot (nbstripout/Jupytext ethos) = source-canonical/outputs-derived links UC-83, UC-79

7. Architecture notes for SHARD-WP-0002

  • T12 (page model): add computational-notebook as a page shape — an ordered cell list where code cells own embedded computed outputs (MIME bundles) with weak execution provenance. Distinct from prose, typed records, query-defined, inline-embedded objects (Quip/Notion), typed-graph (Wikibase), and the literate one-source-many-projection shape (UC-83). The defining new attribute: derived output stored inside the source.
  • T15 (translation / fidelity): .ipynb is non-Markdown; nbconvert→Markdown is lossy and directional (drops live outputs/kernel/rich MIME). Keep JSON canonical; any Markdown is a projection. MIME-bundle outputs map to the content-opacity spectrum (text→html→base64 image = transparent→opaque).
  • T13 (history): JSON diffs are noisy; record text-pairing (Jupytext) and cell-aware diff/merge (nbdime) as history-portability strategies for embedded-output documents. Reinforces "source-canonical, outputs-derived."
  • T16 (projection): captured outputs are a derivation-projection snapshot; re-execution (kernel) and parameterized execution (papermill) are capabilities, not assumptions; degrade to the stored snapshot / nbviewer-style static render.

8. Open questions

  1. Does shard-wiki ever re-execute a notebook (host/broker a kernel), or strictly attach + present captured outputs + static render? (Same scope boundary as UC-83/UC-56 "do we ever drive the derivation.")
  2. Is UC-84 distinct from UC-83, or is a notebook just the "outputs-stored-in-source" special case of the literate one-source-many-projection pattern? (Kept separate: UC-84's defining trait is captured derived output embedded in the canonical source with weak provenance — a page-model attribute UC-83 doesn't carry.)
  3. How are MIME-bundle outputs represented in the page model — opaque provenance-tagged blobs, a typed-asset registry (UC-55 open question #10), or selected-MIME projection?
  4. Default storage: attach .ipynb as-is (JSON, noisy diffs) or prefer a paired text representation when the shard is a git repo? (Policy → configurable.)

9. Sources

  • Jupyter nbformat reference (cells, outputs, MIME bundles, cell id 4.5+); Jupyter messaging protocol / kernels docs.
  • nbconvert, nbviewer, JupyterLab, Colab docs.
  • Jupytext, papermill, nbdime, nbstripout project docs.
  • prior: research/260614-literate-programming-deep-dive/ (replication- vs derivation-projection, UC-83); research/260614-notion-deep-dive/ (block-JSON, external-API), research/260614-quip-deep-dive/ (inline embedded objects, UC-55/58/59).

10. Traceability

New UC UC-84 carries the marker in the wikiengines column of spec/UseCaseCatalog.md (true lineage = this dive). Enriched: UC-54, UC-55, UC-59, UC-35; links UC-32, UC-83, UC-79. Architecture cross-refs: SHARD-WP-0002 T12 (notebook page shape: outputs embedded in source), T15 (lossy non-Markdown translation; MIME opacity), T13 (paired-text / nbdime history), T16 (output = derivation-projection snapshot; kernel = capability).