diff --git a/canon/standards/sbom-convention_v0.1.md b/canon/standards/sbom-convention_v0.1.md index 890f0c7..e65f478 100644 --- a/canon/standards/sbom-convention_v0.1.md +++ b/canon/standards/sbom-convention_v0.1.md @@ -6,7 +6,7 @@ domain: custodian status: active version: "0.1" created: "2026-03-01" -updated: "2026-03-01" +updated: "2026-03-12" --- # SBOM Convention v0.1 — Dependency Tracking & Licence Governance @@ -27,20 +27,23 @@ dashboard (`/sbom`) provides domain-level and repo-level drill-down. --- -## 1. Authoritative Lockfiles per Ecosystem +## 1. Capture Mechanisms -| Ecosystem | Authoritative file | Notes | -|-----------|-------------------|-------| -| Python | `uv.lock` | Preferred. `requirements.txt` accepted as fallback | -| Node / npm | `package-lock.json` | Preferred. `yarn.lock` accepted | -| Rust | `Cargo.lock` | Auto-detected | -| Terraform | `.terraform.lock.hcl` | Provider pins; ecosystem stored as `other` until ENUM extended | -| Go | `go.sum` | *Not yet parsed — planned* | -| Java / JVM | `gradle.lockfile` / `pom.xml` | *Not yet parsed — planned* | -| Ansible | `requirements.yml` | *Not yet parsed — planned* | +`ingest_sbom.py` runs all four mechanisms in a single scan when given `--repo-path`. +No flags needed — comprehensive detection is the default. -**Principle:** commit lockfiles to the repo. Lockfiles are the SBOM source -of truth; do not generate them at ingest time. +| Mechanism | File(s) | Ecosystem | Detection scope | +|-----------|---------|-----------|-----------------| +| **Package manager lockfiles** | `uv.lock`, `requirements.txt`, `package-lock.json`, `yarn.lock`, `Cargo.lock` | `python`, `node`, `rust` | Anywhere in tree | +| **Terraform provider lock** | `.terraform.lock.hcl` | `terraform` | Anywhere in tree | +| **Ansible Galaxy manifest** | `ansible/requirements.yml` or `.yaml` | `ansible` | Under directories named `ansible/` | +| **Tool manifest** | `sbom-tools.yaml` (repo root) | `tool`, `ansible`, `terraform`, etc. | Repo root only | + +**Go / Java parsers** (`go.sum`, `pom.xml`, `gradle.lockfile`) are *not yet +implemented* — planned for a future workplan. + +**Principle:** commit lockfiles and `sbom-tools.yaml` to the repo. These are +the SBOM source of truth; do not generate them at ingest time. --- @@ -64,27 +67,35 @@ curl -s http://127.0.0.1:8000/repos/ | python3 -m json.tool ## 3. SBOM Ingestion -### 3.1 Standard ingest (single lockfile at repo root) +### 3.1 Standard ingest (all mechanisms, recommended) ```bash cd ~/the-custodian/state-hub make ingest-sbom REPO= REPO_PATH=/path/to/repo ``` -The script auto-detects the first recognised lockfile at `REPO_PATH`. +`ingest_sbom.py` automatically runs all four mechanisms in one scan — lockfiles, +Terraform provider locks, Ansible Galaxy manifests, and `sbom-tools.yaml`. All +results are merged into a single snapshot. Non-dep directories (`.venv`, +`node_modules`, `.git`, `dist`, etc.) are automatically skipped. -### 3.2 Multi-ecosystem repos (recommended for complex repos) +### 3.2 Repos with system-level tools: capture first, then ingest -Use `SCAN=1` to walk the repo tree and combine **all** lockfiles into a single -snapshot. Non-dep directories (`.venv`, `node_modules`, `.git`, `dist`, etc.) -are automatically skipped. +For repos that use system-level tools not tracked by any lockfile (Terraform +binary, Helm, kubectl, k3s, goss, etc.): ```bash -make ingest-sbom REPO=the-custodian SCAN=1 REPO_PATH=/home/worsch/the-custodian -``` +# Step 1: generate sbom-tools.yaml via agent +make capture-tools REPO= REPO_PATH=/path/to/repo -This is the correct approach for repos that contain both a backend and a -frontend (e.g., a Python API + Node/Observable dashboard). +# Step 2: review sbom-tools.yaml — correct any confidence: low entries + +# Step 3: commit sbom-tools.yaml +git -C /path/to/repo add sbom-tools.yaml && git -C /path/to/repo commit -m "chore(sbom): add tool manifest" + +# Step 4: ingest everything +make ingest-sbom REPO= REPO_PATH=/path/to/repo +``` ### 3.3 Explicit lockfile path @@ -96,8 +107,7 @@ Multiple lockfiles can be passed by calling the script directly with repeated `--lockfile` flags: ```bash -cd ~/the-custodian/state-hub -.venv/bin/python scripts/ingest_sbom.py \ +uv run python scripts/ingest_sbom.py \ --repo \ --lockfile /path/to/uv.lock \ --lockfile /path/to/package-lock.json @@ -106,11 +116,40 @@ cd ~/the-custodian/state-hub ### 3.4 Dry run (inspect without submitting) ```bash -make ingest-sbom REPO= SCAN=1 REPO_PATH=/path/to/repo -# append: add --dry-run to the command, or run the script directly: -.venv/bin/python scripts/ingest_sbom.py --repo --scan --repo-path /path/to/repo --dry-run +make ingest-sbom REPO= REPO_PATH=/path/to/repo DRY_RUN=1 ``` +### 3.5 sbom-tools.yaml: the tool manifest + +Create `sbom-tools.yaml` at the repo root for any system-level tools not +covered by lockfiles. Schema: + +```yaml +# sbom-tools.yaml +tools: + - name: terraform + version: "1.9.5" # confidence: medium + ecosystem: terraform + license_spdx: BSL-1.1 + is_direct: true + is_dev: false + - name: helm + version: null # confidence: low (no version pin found) + ecosystem: tool + license_spdx: Apache-2.0 + is_direct: true + is_dev: false +``` + +**Valid ecosystem values:** `python`, `node`, `rust`, `go`, `java`, `terraform`, +`ansible`, `tool`, `other` + +Annotate each version with a `# confidence: high/medium/low` comment. +Entries with `confidence: low` need human verification before committing. + +The `make capture-tools` command generates this file automatically using the +SBOM capture agent prompt (`state-hub/prompts/sbom-capture-agent.md`). + --- ## 4. Snapshot Semantics @@ -248,10 +287,14 @@ The SBOM dashboard aggregates across all repos within a domain in the ## 10. Planned Enhancements -- **Go / Java parsers** — add to `ingest_sbom.py` +- **Go / Java parsers** — add `go.sum`, `pom.xml`, `gradle.lockfile` support to `ingest_sbom.py` - **Versioned snapshots** — retain history per repo for trend analysis - **Licence override file** — allow repos to document known-acceptable copyleft exceptions (`.sbom-overrides.yaml`) - **CI integration** — GitHub Actions step to run ingest on lockfile change - **Direct-dep detection for uv.lock** — parse `pyproject.toml` `[project.dependencies]` to mark direct deps accurately +- **Galaxy API licence lookup** — resolve `license_spdx` for Ansible collections + via the Galaxy API at ingest time +- **Tool version pinning guidance** — tooling to detect `confidence: low` entries + across all registered repos and flag them for resolution diff --git a/state-hub/Makefile b/state-hub/Makefile index 52db38f..9e6607a 100644 --- a/state-hub/Makefile +++ b/state-hub/Makefile @@ -133,16 +133,26 @@ list-repos: @test -n "$(DOMAIN)" || (echo "ERROR: DOMAIN is required."; exit 1) curl -sf "http://127.0.0.1:8000/repos/?domain=$(DOMAIN)" | python3 -m json.tool -## Ingest SBOM data for a repo. +## Ingest SBOM data for a repo (all mechanisms: lockfiles + ansible + sbom-tools.yaml). +## Auto-detect all sources: make ingest-sbom REPO=the-custodian REPO_PATH=/home/worsch/the-custodian ## Single lockfile (explicit): make ingest-sbom REPO=the-custodian LOCKFILE=/path/to/uv.lock -## Scan all lockfiles in tree: make ingest-sbom REPO=the-custodian SCAN=1 REPO_PATH=/home/worsch/the-custodian -## Auto-detect at repo root: make ingest-sbom REPO=the-custodian REPO_PATH=/home/worsch/the-custodian +## Dry-run (no submit): make ingest-sbom REPO=the-custodian REPO_PATH=... DRY_RUN=1 +## Tip: run capture-tools first for repos with system-level tool dependencies. ingest-sbom: @test -n "$(REPO)" || (echo "ERROR: REPO is required."; exit 1) uv run python scripts/ingest_sbom.py --repo "$(REPO)" \ $(if $(LOCKFILE),--lockfile "$(LOCKFILE)") \ - $(if $(SCAN),--scan) \ - $(if $(REPO_PATH),--repo-path "$(REPO_PATH)") + $(if $(REPO_PATH),--repo-path "$(REPO_PATH)") \ + $(if $(DRY_RUN),--dry-run) + +## Run SBOM capture agent for a repo — generates/updates sbom-tools.yaml. +## Usage: make capture-tools REPO=railiance-infra [REPO_PATH=/home/worsch/railiance-infra] +## Add DRY_RUN=1 to preview without writing. +capture-tools: + @test -n "$(REPO)" || (echo "ERROR: REPO is required."; exit 1) + uv run python scripts/capture_sbom_tools.py --repo "$(REPO)" \ + $(if $(REPO_PATH),--repo-path "$(REPO_PATH)") \ + $(if $(DRY_RUN),--dry-run) ## Check a repo for ADR-001 compliance: make validate-adr REPO=/path/to/repo [DOMAIN=custodian] validate-adr: diff --git a/state-hub/api/models/sbom_entry.py b/state-hub/api/models/sbom_entry.py index cece53e..70a15b1 100644 --- a/state-hub/api/models/sbom_entry.py +++ b/state-hub/api/models/sbom_entry.py @@ -15,6 +15,9 @@ class Ecosystem(str, enum.Enum): rust = "rust" go = "go" java = "java" + terraform = "terraform" + ansible = "ansible" + tool = "tool" other = "other" diff --git a/state-hub/migrations/versions/d6e7f8a9b0c1_sbom_ecosystem_expand.py b/state-hub/migrations/versions/d6e7f8a9b0c1_sbom_ecosystem_expand.py new file mode 100644 index 0000000..8e432b0 --- /dev/null +++ b/state-hub/migrations/versions/d6e7f8a9b0c1_sbom_ecosystem_expand.py @@ -0,0 +1,30 @@ +"""SBOM ecosystem enum expansion: add terraform, ansible, tool + +Revision ID: d6e7f8a9b0c1 +Revises: c5d6e7f8a9b0 +Create Date: 2026-03-12 00:00:00.000000 +""" +from typing import Sequence, Union + +from alembic import op + +revision: str = "d6e7f8a9b0c1" +down_revision: Union[str, None] = "c5d6e7f8a9b0" +branch_labels: Union[str, Sequence[str], None] = None +depends_on: Union[str, Sequence[str], None] = None + + +def upgrade() -> None: + # PostgreSQL requires each ADD VALUE in its own statement and cannot be + # run inside a transaction that also modifies data. ADD VALUE is + # transactional in PG 12+ (no COMMIT needed). + op.execute("ALTER TYPE ecosystem ADD VALUE IF NOT EXISTS 'terraform'") + op.execute("ALTER TYPE ecosystem ADD VALUE IF NOT EXISTS 'ansible'") + op.execute("ALTER TYPE ecosystem ADD VALUE IF NOT EXISTS 'tool'") + + +def downgrade() -> None: + # PostgreSQL does not support removing enum values without recreating the + # type. Document the limitation and do nothing — reverting this migration + # requires a full type recreation if needed. + pass diff --git a/state-hub/prompts/sbom-capture-agent.md b/state-hub/prompts/sbom-capture-agent.md new file mode 100644 index 0000000..2061821 --- /dev/null +++ b/state-hub/prompts/sbom-capture-agent.md @@ -0,0 +1,90 @@ +# SBOM Capture Agent Prompt + +**Task:** Generate or update `sbom-tools.yaml` for the repository at `{repo_path}` (slug: `{repo_slug}`). + +This file captures system-level tool dependencies that are not tracked by any package manager lockfile — tools that are installed via provisioning, Homebrew, system packages, or assumed present in the environment. + +--- + +## Instructions + +1. **Read the following files** in `{repo_path}` (read each that exists; skip gracefully if absent): + - `CLAUDE.md` — look for stack declarations, tool prerequisites, dev commands + - `README.md` / `QUICKSTART.md` — prerequisites sections, tool version requirements + - `Makefile` — tool invocations, version variables (e.g. `ANSIBLE_VERSION := 12.3`) + - `pyproject.toml` — Python tool dependencies (already covered by uv.lock; note but don't duplicate) + - `.tool-versions` — asdf version pins + - `.terraform-version` — tfenv pin + - `.ansible-version` — if present + - `Dockerfile` / `docker-compose.yml` — base image versions, tool installs + - `.github/workflows/*.yml` / `.gitlab-ci.yml` — CI tool install steps, version pins + - `ansible/requirements.yml` — **already captured by lockfile parser; do NOT include Galaxy collections here** + - Any `scripts/setup*.sh`, `scripts/bootstrap*.sh`, or `tools/` directory + +2. **Identify system-level tools only** — tools that: + - Are invoked as CLI commands (e.g. `ansible-playbook`, `terraform`, `helm`, `kubectl`, `k3s`, `goss`, `age`, `sops`) + - Are NOT installed via `uv`/`pip`/`npm`/`cargo` into a project virtualenv (those are in lockfiles) + - Note: `ansible` itself as a CLI tool is a system dep even if `ansible-core` appears in `uv.lock` + +3. **For each tool, determine**: + - `name`: canonical tool name (e.g. `ansible`, `terraform`, `helm`, `kubectl`, `k3s`, `goss`, `age`, `sops`, `cloud-init`) + - `version`: the pinned or documented version. Use `unknown` only if no evidence found anywhere. + - `ecosystem`: one of `python`, `node`, `rust`, `go`, `java`, `terraform`, `ansible`, `tool`, `other` + - Use `ansible` for Ansible itself; `terraform` for Terraform itself; `tool` for generic CLI tools + - `license_spdx`: the SPDX identifier. Common known licences (use these exact strings): + - ansible / ansible-core: `GPL-3.0-only` + - terraform ≤ 1.5.5: `MPL-2.0`; terraform ≥ 1.5.6: `BSL-1.1` + - helm: `Apache-2.0` + - kubectl: `Apache-2.0` + - k3s: `Apache-2.0` + - goss: `Apache-2.0` + - age: `BSD-3-Clause` + - sops: `MPL-2.0` + - cloud-init: `Apache-2.0` (or `GPL-3.0-only` for older versions — check) + - docker: `Apache-2.0` + - If unknown, use `null` + - `is_direct`: `true` if this repo directly declares/uses it; `false` if it's a transitive dependency of another tool + - `is_dev`: `true` only if the tool is only used for development/testing, not production operation + +4. **Confidence annotation**: Add a `# confidence: high/medium/low` comment after each entry: + - `high`: version found explicitly pinned in a file + - `medium`: version inferred from context (e.g. "Ansible 12" in README) + - `low`: version not found; using `unknown` or a reasonable guess + +5. **Do NOT include**: + - Python packages already covered by `uv.lock` or `requirements.txt` + - Ansible Galaxy collections (covered by `ansible/requirements.yml`) + - Terraform providers (covered by `.terraform.lock.hcl`) + - Node packages, Rust crates, etc. (covered by their lockfiles) + - Operating system packages unless the repo explicitly declares them + +6. **Output format**: Emit ONLY the YAML block below — no prose, no markdown fences, no explanation. The output must be valid YAML that can be written directly to `sbom-tools.yaml`. + +--- + +## Output format + +```yaml +# sbom-tools.yaml — system-level tool dependencies for {repo_slug} +# Generated by sbom-capture-agent on {date} +# Review each entry before committing. Entries with confidence: low need human verification. +tools: + - name: example-tool + version: "1.2.3" # confidence: high + ecosystem: tool + license_spdx: Apache-2.0 + is_direct: true + is_dev: false +``` + +If no system-level tools are found, output: +```yaml +# sbom-tools.yaml — system-level tool dependencies for {repo_slug} +# Generated by sbom-capture-agent on {date} +# No system-level tools identified — all dependencies are covered by lockfiles. +tools: [] +``` + +--- + +Now read `{repo_path}` and produce the `sbom-tools.yaml` content. diff --git a/state-hub/scripts/capture_sbom_tools.py b/state-hub/scripts/capture_sbom_tools.py new file mode 100644 index 0000000..a42ed6f --- /dev/null +++ b/state-hub/scripts/capture_sbom_tools.py @@ -0,0 +1,187 @@ +#!/usr/bin/env python3 +"""Invoke the SBOM capture agent to generate/update sbom-tools.yaml for a repo. + +Usage: + python capture_sbom_tools.py --repo [--repo-path ] [--dry-run] + +The script: +1. Resolves repo path from the state-hub API (if --repo-path is not given) +2. Loads the agent prompt from prompts/sbom-capture-agent.md +3. Substitutes {repo_slug}, {repo_path}, {date} placeholders +4. Invokes `claude -p ""` non-interactively +5. Extracts the YAML block from the response +6. Writes (or shows diff of) sbom-tools.yaml in the repo root + +Requirements: + - `claude` CLI must be on PATH (Claude Code) + - PyYAML must be available in the active venv +""" +from __future__ import annotations + +import argparse +import datetime +import difflib +import json +import os +import re +import subprocess +import sys +import urllib.error +import urllib.request +from pathlib import Path + +API_BASE = os.environ.get("API_BASE", "http://127.0.0.1:8000").rstrip("/") +SCRIPT_DIR = Path(__file__).parent +PROMPT_FILE = SCRIPT_DIR.parent / "prompts" / "sbom-capture-agent.md" + + +def resolve_repo_path(repo_slug: str) -> Path | None: + """Look up the registered path for a repo slug via the state-hub API.""" + url = f"{API_BASE}/repos/{repo_slug}/" + try: + with urllib.request.urlopen(url, timeout=10) as resp: + data = json.loads(resp.read()) + path_str = data.get("local_path") + if path_str: + return Path(path_str) + except (urllib.error.URLError, KeyError): + pass + return None + + +def load_prompt(repo_slug: str, repo_path: Path) -> str: + if not PROMPT_FILE.exists(): + print(f"Error: prompt file not found at {PROMPT_FILE}", file=sys.stderr) + sys.exit(1) + template = PROMPT_FILE.read_text() + today = datetime.date.today().isoformat() + return ( + template + .replace("{repo_slug}", repo_slug) + .replace("{repo_path}", str(repo_path)) + .replace("{date}", today) + ) + + +def invoke_agent(prompt: str) -> str: + """Run `claude -p ` and return stdout.""" + try: + result = subprocess.run( + ["claude", "-p", prompt], + capture_output=True, + text=True, + timeout=120, + ) + except FileNotFoundError: + print("Error: `claude` CLI not found on PATH. Install Claude Code.", file=sys.stderr) + sys.exit(1) + except subprocess.TimeoutExpired: + print("Error: claude invocation timed out after 120s.", file=sys.stderr) + sys.exit(1) + + if result.returncode != 0: + print(f"Error: claude exited with code {result.returncode}", file=sys.stderr) + if result.stderr: + print(result.stderr, file=sys.stderr) + sys.exit(1) + + return result.stdout + + +def extract_yaml(response: str) -> str: + """Extract YAML content from the agent response. + + Accepts: + - Raw YAML (starts with # or 'tools:') + - YAML wrapped in ```yaml ... ``` fences + """ + # Try fenced block first + m = re.search(r"```(?:yaml)?\s*\n(.*?)```", response, re.DOTALL) + if m: + return m.group(1).strip() + + # Otherwise treat entire response as YAML + stripped = response.strip() + if stripped.startswith("#") or stripped.startswith("tools:"): + return stripped + + print("Warning: could not extract YAML from agent response.", file=sys.stderr) + print("Raw response:", file=sys.stderr) + print(response[:500], file=sys.stderr) + sys.exit(1) + + +def show_diff(old: str | None, new: str, target: Path) -> None: + if old is None: + print(f"[new file] {target}") + for line in new.splitlines(): + print(f" + {line}") + else: + diff = list(difflib.unified_diff( + old.splitlines(keepends=True), + new.splitlines(keepends=True), + fromfile=f"a/{target.name}", + tofile=f"b/{target.name}", + )) + if diff: + print("".join(diff)) + else: + print(f"[no changes] {target}") + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Generate/update sbom-tools.yaml for a repo using the SBOM capture agent." + ) + parser.add_argument("--repo", required=True, help="Repo slug (e.g. 'railiance-infra')") + parser.add_argument("--repo-path", help="Path to repo root (auto-resolved from state-hub if omitted)") + parser.add_argument("--dry-run", action="store_true", + help="Show prompt and diff without writing sbom-tools.yaml") + parser.add_argument("--print-prompt", action="store_true", + help="Print the rendered prompt and exit (useful for inspection)") + args = parser.parse_args() + + # Resolve repo path + if args.repo_path: + repo_path = Path(args.repo_path).resolve() + else: + repo_path = resolve_repo_path(args.repo) + if repo_path is None: + # Fall back to ~/repo_slug convention + repo_path = Path.home() / args.repo + print(f"Could not resolve path from API; trying {repo_path}", file=sys.stderr) + + if not repo_path.exists(): + print(f"Error: repo path does not exist: {repo_path}", file=sys.stderr) + sys.exit(1) + + target = repo_path / "sbom-tools.yaml" + existing_content = target.read_text() if target.exists() else None + + prompt = load_prompt(args.repo, repo_path) + + if args.print_prompt: + print(prompt) + return + + print(f"Running SBOM capture agent for {args.repo} ({repo_path})…") + response = invoke_agent(prompt) + yaml_content = extract_yaml(response) + + # Ensure trailing newline + if not yaml_content.endswith("\n"): + yaml_content += "\n" + + show_diff(existing_content, yaml_content, target) + + if args.dry_run: + print("\n[dry-run] sbom-tools.yaml not written.") + return + + target.write_text(yaml_content) + print(f"\nWritten: {target}") + print("Review the file, correct any 'confidence: low' entries, then commit.") + + +if __name__ == "__main__": + main() diff --git a/state-hub/scripts/ingest_sbom.py b/state-hub/scripts/ingest_sbom.py index c024ac8..6696234 100644 --- a/state-hub/scripts/ingest_sbom.py +++ b/state-hub/scripts/ingest_sbom.py @@ -1,15 +1,19 @@ #!/usr/bin/env python3 -"""Ingest a repo's lockfile into the State Hub SBOM store. +"""Ingest a repo's lockfiles and tool manifests into the State Hub SBOM store. Usage: - python ingest_sbom.py --repo [--lockfile ] [--api-base ] + python ingest_sbom.py --repo [--repo-path ] [--dry-run] -Auto-detects lockfile type: - uv.lock → Python ecosystem - requirements.txt → Python ecosystem (basic) - package-lock.json → Node ecosystem - yarn.lock → Node ecosystem - Cargo.lock → Rust ecosystem +Auto-detects all of the following in one scan: + uv.lock → python + requirements.txt → python + package-lock.json → node + yarn.lock → node + Cargo.lock → rust + .terraform.lock.hcl → terraform (anywhere in tree) + ansible/requirements.yml → ansible (anywhere under ansible/ dirs) + ansible/requirements.yaml → ansible + sbom-tools.yaml → tool (repo root; agent-generated) """ from __future__ import annotations @@ -22,11 +26,17 @@ import urllib.error import urllib.request from pathlib import Path +try: + import yaml # optional; only needed for sbom-tools.yaml and ansible parsers + _YAML_AVAILABLE = True +except ImportError: + _YAML_AVAILABLE = False + API_BASE = os.environ.get("API_BASE", "http://127.0.0.1:8000").rstrip("/") # --------------------------------------------------------------------------- -# Lockfile parsers +# Lockfile parsers — each returns list[dict] # --------------------------------------------------------------------------- def _parse_uv_lock(path: Path) -> list[dict]: @@ -55,7 +65,7 @@ def _parse_uv_lock(path: Path) -> list[dict]: "package_version": e.get("package_version"), "ecosystem": "python", "license_spdx": None, - "is_direct": False, # uv.lock doesn't distinguish; treat all as transitive + "is_direct": False, "is_dev": False, } for e in entries @@ -70,7 +80,6 @@ def _parse_requirements_txt(path: Path) -> list[dict]: line = line.strip() if not line or line.startswith("#") or line.startswith("-"): continue - # Handle: pkg==1.2.3, pkg>=1.2, pkg m = re.match(r"^([A-Za-z0-9_.\-]+)(?:[>= list[dict]: packages = data.get("packages", {}) entries = [] for pkg_path, info in packages.items(): - if not pkg_path: # root package + if not pkg_path: continue name = info.get("name") or pkg_path.split("node_modules/")[-1] entries.append({ @@ -120,8 +129,6 @@ def _parse_yarn_lock(path: Path) -> list[dict]: if not stripped or stripped.startswith("#"): continue if not line.startswith(" ") and stripped.endswith(":"): - # New package block header: "name@version::" or "\"name@version\":" - # May list multiple versions: "name@^1.0, name@~1.0:" current_names = [] current_version = None for part in stripped.rstrip(":").split(","): @@ -188,12 +195,10 @@ def _parse_terraform_lock_hcl(path: Path) -> list[dict]: for line in path.read_text().splitlines(): stripped = line.strip() - # e.g.: provider "registry.terraform.io/hetznercloud/hcloud" { m = re.match(r'^provider\s+"([^"]+)"\s*\{', stripped) if m: - # Use full provider address as package_name, short name as display full = m.group(1) - current_name = full # e.g. "registry.terraform.io/hetznercloud/hcloud" + current_name = full current_version = None elif current_name is not None: vm = re.match(r'version\s*=\s*"([^"]+)"', stripped) @@ -203,7 +208,7 @@ def _parse_terraform_lock_hcl(path: Path) -> list[dict]: entries.append({ "package_name": current_name, "package_version": current_version, - "ecosystem": "other", # "terraform" not yet in ENUM; tracked as other + "ecosystem": "terraform", "license_spdx": None, "is_direct": True, "is_dev": False, @@ -214,7 +219,114 @@ def _parse_terraform_lock_hcl(path: Path) -> list[dict]: return entries -_LOCKFILE_PARSERS = { +def _parse_ansible_requirements(path: Path) -> list[dict]: + """Parse ansible/requirements.yml — collections and roles from Ansible Galaxy.""" + if not _YAML_AVAILABLE: + print(f"Warning: PyYAML not available; skipping {path}", file=sys.stderr) + return [] + + try: + data = yaml.safe_load(path.read_text()) + except yaml.YAMLError as e: + print(f"Warning: cannot parse {path}: {e}", file=sys.stderr) + return [] + + if not isinstance(data, dict): + return [] + + entries = [] + + for item in data.get("collections", []) or []: + if isinstance(item, str): + name, version = item, None + elif isinstance(item, dict): + name = item.get("name", "") + version = str(item.get("version", "")) if item.get("version") else None + else: + continue + if name: + entries.append({ + "package_name": name, + "package_version": version, + "ecosystem": "ansible", + "license_spdx": None, + "is_direct": True, + "is_dev": False, + }) + + for item in data.get("roles", []) or []: + if isinstance(item, str): + name, version = item, None + elif isinstance(item, dict): + name = item.get("name", item.get("src", "")) + version = str(item.get("version", "")) if item.get("version") else None + else: + continue + if name: + entries.append({ + "package_name": name, + "package_version": version, + "ecosystem": "ansible", + "license_spdx": None, + "is_direct": True, + "is_dev": False, + }) + + return entries + + +def _parse_sbom_tools_yaml(path: Path) -> list[dict]: + """Parse sbom-tools.yaml — agent-generated tool manifest at repo root.""" + if not _YAML_AVAILABLE: + print(f"Warning: PyYAML not available; skipping {path}", file=sys.stderr) + return [] + + try: + data = yaml.safe_load(path.read_text()) + except yaml.YAMLError as e: + print(f"Warning: cannot parse {path}: {e}", file=sys.stderr) + return [] + + if not isinstance(data, dict): + return [] + + entries = [] + valid_ecosystems = { + "python", "node", "rust", "go", "java", + "terraform", "ansible", "tool", "other", + } + + for item in data.get("tools", []) or []: + if not isinstance(item, dict): + continue + name = item.get("name", "") + version = str(item.get("version", "")) if item.get("version") else None + if version == "unknown": + print(f" Warning: tool '{name}' has version=unknown — flagged for review", file=sys.stderr) + version = None + ecosystem = item.get("ecosystem", "tool") + if ecosystem not in valid_ecosystems: + print(f" Warning: unknown ecosystem '{ecosystem}' for '{name}'; using 'tool'", file=sys.stderr) + ecosystem = "tool" + license_spdx = item.get("license_spdx") or None + entries.append({ + "package_name": name, + "package_version": version, + "ecosystem": ecosystem, + "license_spdx": license_spdx, + "is_direct": bool(item.get("is_direct", True)), + "is_dev": bool(item.get("is_dev", False)), + }) + + return entries + + +# --------------------------------------------------------------------------- +# Detection helpers +# --------------------------------------------------------------------------- + +# Filename → parser for standard lockfiles (detected by filename anywhere in tree) +_LOCKFILE_PARSERS: dict[str, object] = { "uv.lock": _parse_uv_lock, "requirements.txt": _parse_requirements_txt, "package-lock.json": _parse_package_lock_json, @@ -234,6 +346,47 @@ _SKIP_DIRS = { } +def detect_all(repo_path: Path) -> list[tuple[Path, str, object]]: + """Scan repo_path and return all discovered dependency sources. + + Returns list of (path, label, parser_fn) tuples covering: + - Standard lockfiles (anywhere in tree) + - Ansible requirements files (in ansible/ subdirs) + - sbom-tools.yaml at repo root + """ + found: list[tuple[Path, str, object]] = [] + seen_paths: set[Path] = set() + + # Walk tree for all source types + for dirpath, dirnames, filenames in os.walk(repo_path): + dirnames[:] = sorted(d for d in dirnames if d not in _SKIP_DIRS) + dirpath_p = Path(dirpath) + + # Standard lockfiles + for fname, parser in _LOCKFILE_PARSERS.items(): + if fname in filenames: + p = dirpath_p / fname + if p not in seen_paths: + found.append((p, fname, parser)) + seen_paths.add(p) + + # Ansible requirements files — only under directories named "ansible" + if dirpath_p.name == "ansible": + for fname in ("requirements.yml", "requirements.yaml"): + if fname in filenames: + p = dirpath_p / fname + if p not in seen_paths: + found.append((p, f"ansible/{fname}", _parse_ansible_requirements)) + seen_paths.add(p) + + # sbom-tools.yaml at repo root only + tools_manifest = repo_path / "sbom-tools.yaml" + if tools_manifest.exists() and tools_manifest not in seen_paths: + found.append((tools_manifest, "sbom-tools.yaml", _parse_sbom_tools_yaml)) + + return found + + def detect_lockfile(repo_path: Path) -> tuple[Path, str] | None: """Return (lockfile_path, filename) for the first recognised lockfile at repo root.""" for name in _LOCKFILE_PARSERS: @@ -244,7 +397,10 @@ def detect_lockfile(repo_path: Path) -> tuple[Path, str] | None: def detect_lockfiles_recursive(repo_path: Path) -> list[Path]: - """Walk repo_path and return all recognised lockfiles, skipping non-dep dirs.""" + """Walk repo_path and return all recognised lockfiles, skipping non-dep dirs. + + Kept for backwards compatibility; prefer detect_all() for new code. + """ found: list[Path] = [] for dirpath, dirnames, filenames in os.walk(repo_path): dirnames[:] = sorted(d for d in dirnames if d not in _SKIP_DIRS) @@ -292,52 +448,47 @@ def post_ingest(api_base: str, repo_slug: str, entries: list[dict]) -> dict: # --------------------------------------------------------------------------- def main() -> None: - parser = argparse.ArgumentParser(description="Ingest a repo's lockfiles into the State Hub SBOM store.") + parser = argparse.ArgumentParser( + description="Ingest a repo's lockfiles and tool manifests into the State Hub SBOM store." + ) parser.add_argument("--repo", required=True, help="Managed-repo slug (e.g. 'the-custodian')") parser.add_argument("--lockfile", action="append", dest="lockfiles", metavar="PATH", help="Path to a specific lockfile (repeatable)") parser.add_argument("--repo-path", default=".", help="Repo root for auto-detection/scan (default: cwd)") parser.add_argument("--scan", action="store_true", - help="Recursively find ALL lockfiles under --repo-path (handles multi-ecosystem repos)") + help="Recursively find ALL lockfiles under --repo-path (deprecated; now default behaviour)") parser.add_argument("--api-base", default=API_BASE, help="State Hub API base URL") parser.add_argument("--dry-run", action="store_true", help="Parse only — do not submit") args = parser.parse_args() repo_root = Path(args.repo_path).resolve() - lockfile_paths: list[Path] = [] + all_entries: list[dict] = [] if args.lockfiles: - lockfile_paths = [Path(lf).resolve() for lf in args.lockfiles] - elif args.scan: - lockfile_paths = detect_lockfiles_recursive(repo_root) - if not lockfile_paths: - print(f"No lockfiles found under '{repo_root}'.", file=sys.stderr) - sys.exit(1) - print(f"Scan found {len(lockfile_paths)} lockfile(s):") - for lf in lockfile_paths: - print(f" {lf.relative_to(repo_root) if lf.is_relative_to(repo_root) else lf}") + # Explicit paths: parse each, detect parser by filename + for lf_str in args.lockfiles: + lf = Path(lf_str).resolve() + parsed = parse_lockfile(lf) + rel = lf.relative_to(repo_root) if lf.is_relative_to(repo_root) else lf + print(f" {rel}: {len(parsed)} packages") + all_entries.extend(parsed) else: - found = detect_lockfile(repo_root) - if not found: + # Comprehensive auto-detection: all mechanisms in one scan + sources = detect_all(repo_root) + if not sources: print( - f"No recognised lockfile found in '{repo_root}'. " - f"Supported: {', '.join(_LOCKFILE_PARSERS)}. " - "Use --scan to search subdirectories.", + f"No recognised dependency sources found in '{repo_root}'.", file=sys.stderr, ) sys.exit(1) - lockfile_path, _ = found - print(f"Auto-detected: {lockfile_path}") - lockfile_paths = [lockfile_path] - all_entries: list[dict] = [] - for lf in lockfile_paths: - parsed = parse_lockfile(lf) - rel = lf.relative_to(repo_root) if lf.is_relative_to(repo_root) else lf - print(f" {rel}: {len(parsed)} packages") - all_entries.extend(parsed) + for src_path, label, parser_fn in sources: + parsed = parser_fn(src_path) + rel = src_path.relative_to(repo_root) if src_path.is_relative_to(repo_root) else src_path + print(f" {label} ({rel}): {len(parsed)} entries") + all_entries.extend(parsed) - print(f"Total: {len(all_entries)} packages across {len(lockfile_paths)} lockfile(s)") + print(f"Total: {len(all_entries)} entries") if args.dry_run: print(json.dumps(all_entries[:5], indent=2)) diff --git a/state-hub/tests/test_ingest_sbom.py b/state-hub/tests/test_ingest_sbom.py new file mode 100644 index 0000000..51dc48f --- /dev/null +++ b/state-hub/tests/test_ingest_sbom.py @@ -0,0 +1,397 @@ +"""Unit tests for ingest_sbom.py parsers and auto-detection.""" +from __future__ import annotations + +import json +import sys +import textwrap +from pathlib import Path + +import pytest + +# Make scripts/ importable +sys.path.insert(0, str(Path(__file__).parent.parent / "scripts")) +import ingest_sbom as sb + + +# --------------------------------------------------------------------------- +# Terraform parser +# --------------------------------------------------------------------------- + +TERRAFORM_LOCK = textwrap.dedent("""\ + provider "registry.terraform.io/hashicorp/template" { + version = "2.2.0" + constraints = ">= 2.0.0" + hashes = [ + "h1:abc123", + ] + } + + provider "registry.terraform.io/hetznercloud/hcloud" { + version = "1.52.0" + constraints = ">= 1.40.0" + } +""") + + +def test_terraform_parser_ecosystem(tmp_path): + lock = tmp_path / ".terraform.lock.hcl" + lock.write_text(TERRAFORM_LOCK) + entries = sb._parse_terraform_lock_hcl(lock) + assert len(entries) == 2 + for e in entries: + assert e["ecosystem"] == "terraform", f"expected terraform, got {e['ecosystem']}" + names = {e["package_name"] for e in entries} + assert "registry.terraform.io/hashicorp/template" in names + assert "registry.terraform.io/hetznercloud/hcloud" in names + + +def test_terraform_parser_versions(tmp_path): + lock = tmp_path / ".terraform.lock.hcl" + lock.write_text(TERRAFORM_LOCK) + entries = sb._parse_terraform_lock_hcl(lock) + by_name = {e["package_name"]: e for e in entries} + assert by_name["registry.terraform.io/hashicorp/template"]["package_version"] == "2.2.0" + assert by_name["registry.terraform.io/hetznercloud/hcloud"]["package_version"] == "1.52.0" + + +def test_terraform_parser_is_direct(tmp_path): + lock = tmp_path / ".terraform.lock.hcl" + lock.write_text(TERRAFORM_LOCK) + entries = sb._parse_terraform_lock_hcl(lock) + assert all(e["is_direct"] for e in entries) + + +def test_terraform_parser_empty(tmp_path): + lock = tmp_path / ".terraform.lock.hcl" + lock.write_text("# no providers\n") + entries = sb._parse_terraform_lock_hcl(lock) + assert entries == [] + + +# --------------------------------------------------------------------------- +# Ansible Galaxy parser +# --------------------------------------------------------------------------- + +ANSIBLE_REQUIREMENTS_FULL = textwrap.dedent("""\ + collections: + - name: community.general + version: "9.5.0" + - name: ansible.posix + version: "1.6.0" + - community.crypto + + roles: + - name: geerlingguy.docker + version: "6.1.0" + - geerlingguy.pip +""") + +ANSIBLE_REQUIREMENTS_EMPTY = textwrap.dedent("""\ + collections: [] + roles: [] +""") + +ANSIBLE_REQUIREMENTS_COLLECTIONS_ONLY = textwrap.dedent("""\ + collections: + - name: community.general + version: "9.0.0" +""") + + +def test_ansible_parser_collections_and_roles(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_FULL) + entries = sb._parse_ansible_requirements(req) + assert len(entries) == 5 + names = {e["package_name"] for e in entries} + assert "community.general" in names + assert "ansible.posix" in names + assert "community.crypto" in names + assert "geerlingguy.docker" in names + assert "geerlingguy.pip" in names + + +def test_ansible_parser_ecosystem(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_FULL) + entries = sb._parse_ansible_requirements(req) + for e in entries: + assert e["ecosystem"] == "ansible" + + +def test_ansible_parser_versions(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_FULL) + entries = sb._parse_ansible_requirements(req) + by_name = {e["package_name"]: e for e in entries} + assert by_name["community.general"]["package_version"] == "9.5.0" + assert by_name["ansible.posix"]["package_version"] == "1.6.0" + assert by_name["community.crypto"]["package_version"] is None # no version specified + assert by_name["geerlingguy.docker"]["package_version"] == "6.1.0" + assert by_name["geerlingguy.pip"]["package_version"] is None + + +def test_ansible_parser_is_direct(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_FULL) + entries = sb._parse_ansible_requirements(req) + assert all(e["is_direct"] for e in entries) + + +def test_ansible_parser_empty(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_EMPTY) + entries = sb._parse_ansible_requirements(req) + assert entries == [] + + +def test_ansible_parser_collections_only(tmp_path): + req = tmp_path / "requirements.yml" + req.write_text(ANSIBLE_REQUIREMENTS_COLLECTIONS_ONLY) + entries = sb._parse_ansible_requirements(req) + assert len(entries) == 1 + assert entries[0]["package_name"] == "community.general" + + +def test_ansible_parser_yaml_extension(tmp_path): + """Both .yml and .yaml extensions must work.""" + req = tmp_path / "requirements.yaml" + req.write_text(ANSIBLE_REQUIREMENTS_COLLECTIONS_ONLY) + entries = sb._parse_ansible_requirements(req) + assert len(entries) == 1 + + +def test_ansible_parser_invalid_yaml(tmp_path, capsys): + req = tmp_path / "requirements.yml" + req.write_text("collections: [unclosed") + entries = sb._parse_ansible_requirements(req) + assert entries == [] + captured = capsys.readouterr() + assert "Warning" in captured.err + + +# --------------------------------------------------------------------------- +# sbom-tools.yaml parser +# --------------------------------------------------------------------------- + +SBOM_TOOLS_YAML = textwrap.dedent("""\ + tools: + - name: ansible + version: "12.3.0" + ecosystem: ansible + license_spdx: GPL-3.0-only + is_direct: true + is_dev: false + - name: terraform + version: "1.10.5" + ecosystem: terraform + license_spdx: BSL-1.1 + is_direct: true + is_dev: false + - name: helm + version: "3.17.1" + ecosystem: tool + license_spdx: Apache-2.0 + is_direct: true + is_dev: false + - name: k3s + version: unknown + ecosystem: other + license_spdx: Apache-2.0 + is_direct: true + is_dev: false +""") + +SBOM_TOOLS_YAML_MINIMAL = textwrap.dedent("""\ + tools: + - name: kubectl + ecosystem: tool +""") + + +def test_sbom_tools_parser_basic(tmp_path): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text(SBOM_TOOLS_YAML) + entries = sb._parse_sbom_tools_yaml(manifest) + assert len(entries) == 4 + names = {e["package_name"] for e in entries} + assert {"ansible", "terraform", "helm", "k3s"} == names + + +def test_sbom_tools_parser_ecosystems(tmp_path): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text(SBOM_TOOLS_YAML) + entries = sb._parse_sbom_tools_yaml(manifest) + by_name = {e["package_name"]: e for e in entries} + assert by_name["ansible"]["ecosystem"] == "ansible" + assert by_name["terraform"]["ecosystem"] == "terraform" + assert by_name["helm"]["ecosystem"] == "tool" + assert by_name["k3s"]["ecosystem"] == "other" + + +def test_sbom_tools_parser_licenses(tmp_path): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text(SBOM_TOOLS_YAML) + entries = sb._parse_sbom_tools_yaml(manifest) + by_name = {e["package_name"]: e for e in entries} + assert by_name["ansible"]["license_spdx"] == "GPL-3.0-only" + assert by_name["terraform"]["license_spdx"] == "BSL-1.1" + assert by_name["helm"]["license_spdx"] == "Apache-2.0" + + +def test_sbom_tools_parser_unknown_version_becomes_none(tmp_path, capsys): + """version: unknown must be converted to None and emit a warning.""" + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text(SBOM_TOOLS_YAML) + entries = sb._parse_sbom_tools_yaml(manifest) + by_name = {e["package_name"]: e for e in entries} + assert by_name["k3s"]["package_version"] is None + captured = capsys.readouterr() + assert "unknown" in captured.err + + +def test_sbom_tools_parser_minimal_entry(tmp_path): + """Only 'name' and 'ecosystem' required; version and license default to None.""" + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text(SBOM_TOOLS_YAML_MINIMAL) + entries = sb._parse_sbom_tools_yaml(manifest) + assert len(entries) == 1 + e = entries[0] + assert e["package_name"] == "kubectl" + assert e["ecosystem"] == "tool" + assert e["package_version"] is None + assert e["license_spdx"] is None + assert e["is_direct"] is True + assert e["is_dev"] is False + + +def test_sbom_tools_parser_invalid_ecosystem_falls_back(tmp_path, capsys): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text("tools:\n - name: foo\n ecosystem: nonsense\n") + entries = sb._parse_sbom_tools_yaml(manifest) + assert entries[0]["ecosystem"] == "tool" + captured = capsys.readouterr() + assert "Warning" in captured.err + + +def test_sbom_tools_parser_empty_tools(tmp_path): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text("tools: []\n") + entries = sb._parse_sbom_tools_yaml(manifest) + assert entries == [] + + +def test_sbom_tools_parser_invalid_yaml(tmp_path, capsys): + manifest = tmp_path / "sbom-tools.yaml" + manifest.write_text("tools: {bad yaml: [unclosed") + entries = sb._parse_sbom_tools_yaml(manifest) + assert entries == [] + captured = capsys.readouterr() + assert "Warning" in captured.err + + +# --------------------------------------------------------------------------- +# detect_all — comprehensive multi-parser scan +# --------------------------------------------------------------------------- + +def test_detect_all_uv_lock(tmp_path): + (tmp_path / "uv.lock").write_text("[[package]]\nname = \"typer\"\nversion = \"0.12.0\"\n") + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + assert "uv.lock" in labels + + +def test_detect_all_terraform_lock(tmp_path): + tf_dir = tmp_path / "terraform" / "hetzner" + tf_dir.mkdir(parents=True) + (tf_dir / ".terraform.lock.hcl").write_text( + 'provider "registry.terraform.io/hetznercloud/hcloud" {\n version = "1.52.0"\n}\n' + ) + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + assert ".terraform.lock.hcl" in labels + + +def test_detect_all_ansible_requirements(tmp_path): + ansible_dir = tmp_path / "ansible" + ansible_dir.mkdir() + (ansible_dir / "requirements.yml").write_text("collections:\n - name: community.general\n") + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + assert "ansible/requirements.yml" in labels + + +def test_detect_all_sbom_tools_yaml(tmp_path): + (tmp_path / "sbom-tools.yaml").write_text("tools:\n - name: helm\n ecosystem: tool\n") + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + assert "sbom-tools.yaml" in labels + + +def test_detect_all_multi_ecosystem(tmp_path): + """A repo with Python + Terraform + Ansible + tools manifest yields all four.""" + # Python + (tmp_path / "uv.lock").write_text("[[package]]\nname = \"typer\"\nversion = \"0.12.0\"\n") + # Terraform + tf_dir = tmp_path / "terraform" + tf_dir.mkdir() + (tf_dir / ".terraform.lock.hcl").write_text( + 'provider "registry.terraform.io/hashicorp/null" {\n version = "3.2.3"\n}\n' + ) + # Ansible + ansible_dir = tmp_path / "ansible" + ansible_dir.mkdir() + (ansible_dir / "requirements.yml").write_text("collections:\n - name: ansible.posix\n version: \"1.6.0\"\n") + # Tool manifest + (tmp_path / "sbom-tools.yaml").write_text("tools:\n - name: helm\n ecosystem: tool\n version: \"3.17.1\"\n") + + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + assert "uv.lock" in labels + assert ".terraform.lock.hcl" in labels + assert "ansible/requirements.yml" in labels + assert "sbom-tools.yaml" in labels + + # Parse all and verify merged entries + all_entries = [] + for path, label, parser_fn in sources: + all_entries.extend(parser_fn(path)) + + ecosystems = {e["ecosystem"] for e in all_entries} + assert "python" in ecosystems + assert "terraform" in ecosystems + assert "ansible" in ecosystems + assert "tool" in ecosystems + + +def test_detect_all_skips_venv(tmp_path): + """Lockfiles inside .venv must be ignored.""" + venv_dir = tmp_path / ".venv" / "lib" + venv_dir.mkdir(parents=True) + (venv_dir / "requirements.txt").write_text("requests==2.31.0\n") + sources = sb.detect_all(tmp_path) + paths = {str(p) for p, _, _ in sources} + assert not any(".venv" in p for p in paths) + + +def test_detect_all_ansible_req_only_in_ansible_dir(tmp_path): + """requirements.yml at repo root (not in ansible/) should not be picked up as ansible.""" + (tmp_path / "requirements.yml").write_text("collections:\n - name: community.general\n") + sources = sb.detect_all(tmp_path) + labels = {label for _, label, _ in sources} + # Should NOT be detected since it's not under an 'ansible/' directory + assert "ansible/requirements.yml" not in labels + assert "ansible/requirements.yaml" not in labels + + +def test_detect_all_no_duplicates(tmp_path): + """Same file should not appear twice.""" + (tmp_path / "uv.lock").write_text("[[package]]\nname = \"x\"\nversion = \"1.0\"\n") + sources = sb.detect_all(tmp_path) + paths = [p for p, _, _ in sources] + assert len(paths) == len(set(paths)) + + +def test_detect_all_empty_repo(tmp_path): + sources = sb.detect_all(tmp_path) + assert sources == [] diff --git a/workplans/CUST-WP-0013-sbom-infra-expansion.md b/workplans/CUST-WP-0013-sbom-infra-expansion.md new file mode 100644 index 0000000..e3b1df0 --- /dev/null +++ b/workplans/CUST-WP-0013-sbom-infra-expansion.md @@ -0,0 +1,386 @@ +--- +id: CUST-WP-0013 +type: workplan +title: "SBOM Infrastructure Expansion" +domain: custodian +repo: the-custodian +status: completed +owner: custodian +topic_slug: custodian +state_hub_workstream_id: f4ba84c8-4d47-492d-b65e-73b157271a2b +created: "2026-03-12" +updated: "2026-03-12" +--- + +# CUST-WP-0013 — SBOM Infrastructure Expansion + +**Scope:** Extend SBOM capture beyond Python packages to cover Terraform providers, +Ansible Galaxy collections, and system-level tools (Ansible, Terraform, Helm, k3s, +cloud-init, etc.). Introduces an agent-assisted tool manifest capture workflow, +new ecosystem enum values, comprehensive auto-detection in `ingest_sbom.py`, and +delivers full SBOM coverage for `railiance-infra` and `railiance-cluster`. + +**Drives:** Licence risk visibility across the full dependency graph, not just +language-level packages. + +--- + +## Design Decisions + +### Tool manifest: agent-generated, not hand-written + +System tools (Ansible, Terraform, Helm, k3s, etc.) live outside any lockfile — +they are provisioned, not installed by a package manager. Rather than asking +operators to maintain a hand-written manifest, the SBOM capture agent inspects +the repo and generates/updates `sbom-tools.yaml` automatically. + +The agent prompt (`state-hub/prompts/sbom-capture-agent.md`) is parameterised +per repo. It reads the repo's CLAUDE.md, Makefile, README, CI configs, version +pins, and provisioning files, then emits a structured `sbom-tools.yaml` with +tool name, version, ecosystem, SPDX licence, and directness flags. + +A thin wrapper script (`state-hub/scripts/capture_sbom_tools.py`) invokes the +agent prompt via `claude -p` (or prints it for manual use) and writes the result +to `/sbom-tools.yaml`. + +### Comprehensive ingest: all mechanisms per repo + +`make ingest-sbom REPO=` must run all applicable parsers, not just +whichever lockfile happens to be auto-detected first. The updated auto-detection +in `ingest_sbom.py` scans: + +1. Package manager lockfiles (`uv.lock`, `requirements.txt`, `package-lock.json`, + `yarn.lock`, `Cargo.lock`, `go.sum`) +2. Terraform provider locks (`.terraform.lock.hcl`, anywhere in the tree) +3. Ansible Galaxy manifests (`requirements.yml` / `requirements.yaml`, anywhere + in the tree under `ansible/`) +4. Agent-generated tool manifest (`sbom-tools.yaml` at repo root) + +All parsers run and their results are merged into a single snapshot. + +--- + +## Phase 1 — Schema: Ecosystem Enum Extension + +**Acceptance:** `terraform` and `ansible` are valid ecosystem values; existing +`other` entries are unaffected; migration applies cleanly. + +### T01 — Alembic migration: add terraform and ansible enum values + +```task +id: CUST-WP-0013-T01 +state_hub_task_id: c0b6edc4-86ab-4cee-88a8-6c66fb81adee +status: done +priority: high +``` + +Add `terraform` and `ansible` to the `Ecosystem` enum in the DB. Check whether +the column uses a native PostgreSQL ENUM type (requiring `ALTER TYPE`) or a +`String` column (requiring no migration). Write the migration accordingly. +Also add `tool` as a catch-all for tool-manifest entries that don't fit a +named ecosystem. + +--- + +## Phase 2 — Parser Improvements in ingest_sbom.py + +**Acceptance:** `--dry-run` on railiance-infra shows terraform providers and +ansible collections correctly labelled; tool manifest entries appear with the +declared ecosystem. + +### T02 — Promote Terraform parser: other → terraform ecosystem + +```task +id: CUST-WP-0013-T02 +state_hub_task_id: 7686bccd-022c-4e30-8081-c8487eb82253 +status: done +priority: high +``` + +The `.terraform.lock.hcl` parser already exists in `ingest_sbom.py` but stores +entries as `ecosystem="other"`. Change to `ecosystem="terraform"` after T01 +migration lands. Re-ingest any repos that previously ingested terraform entries +as `other` to correct the label. + +### T03 — Implement Ansible Galaxy requirements.yml parser + +```task +id: CUST-WP-0013-T03 +state_hub_task_id: 48658bdd-4d16-4be0-a87e-45df4f4901b0 +status: done +priority: high +``` + +Parse `requirements.yml` / `requirements.yaml` files found in `ansible/` +subdirectories. Standard format: + +```yaml +collections: + - name: community.general + version: "9.5.0" +roles: + - name: geerlingguy.docker + version: "6.x" +``` + +Store as `ecosystem="ansible"`, `is_direct=True`. Licence left `null` (Galaxy +API lookup is deferred). Handle both `collections:` and `roles:` blocks. + +### T04 — Implement sbom-tools.yaml manifest parser + +```task +id: CUST-WP-0013-T04 +state_hub_task_id: 4522ea04-134b-40ee-a7a2-ea0e4c1c061d +status: done +priority: high +``` + +Parse `sbom-tools.yaml` at the repo root (written by the capture agent). Schema: + +```yaml +# Generated by sbom-capture-agent — review before committing +tools: + - name: ansible + version: "12.3.0" + ecosystem: ansible # or: terraform, other, python, etc. + license_spdx: GPL-3.0-only + is_direct: true + is_dev: false + - name: helm + version: "3.17.x" + ecosystem: other + license_spdx: Apache-2.0 + is_direct: true + is_dev: false +``` + +Supports all existing ecosystem values plus `tool`. Pass entries through the +same normalisation as lockfile entries. Skip entries with `version: unknown` +with a warning (agent could not determine version). + +### T05 — Comprehensive auto-detection: all formats in one scan + +```task +id: CUST-WP-0013-T05 +state_hub_task_id: cdda6bf2-2a44-4444-a04a-ac2fe2314923 +status: done +priority: high +``` + +Refactor the `--repo-path` scan to discover and run all applicable parsers, +not just the first match. Scan order: + +1. Walk tree for all `uv.lock`, `requirements.txt`, `package-lock.json`, + `yarn.lock`, `Cargo.lock` +2. Walk tree for all `.terraform.lock.hcl` +3. Walk tree for `ansible/requirements.yml` and `ansible/requirements.yaml` +4. Check repo root for `sbom-tools.yaml` + +Merge all results into a single batch for the snapshot ingest call. Log a +summary line per parser: ` : N packages from `. + +### T06 — Unit tests for new parsers + +```task +id: CUST-WP-0013-T06 +state_hub_task_id: fee37e66-8f41-4dba-995b-97fc66493caf +status: done +priority: medium +``` + +Add test fixtures and unit tests for: +- Ansible Galaxy requirements.yml (collections + roles, version pinned and + unpinned) +- sbom-tools.yaml (valid, missing version, unknown ecosystem) +- Multi-parser scan: repo root with uv.lock + .terraform.lock.hcl + + sbom-tools.yaml produces merged results + +--- + +## Phase 3 — SBOM Capture Agent + +**Acceptance:** `make capture-tools REPO=railiance-infra` produces a reviewed +`sbom-tools.yaml` that correctly identifies Ansible, Terraform, Helm, and other +declared tools with versions and SPDX licences. + +### T07 — Write SBOM capture agent prompt + +```task +id: CUST-WP-0013-T07 +state_hub_task_id: a3b919b5-63b0-44f7-a048-ebfae603ef7b +status: done +priority: high +``` + +Write `state-hub/prompts/sbom-capture-agent.md` — a Claude agent prompt +parameterised with `{repo_slug}` and `{repo_path}`. The prompt instructs the +agent to: + +1. Read `CLAUDE.md`, `Makefile`, `README.md`, `pyproject.toml`, `.tool-versions`, + CI configs, Dockerfiles, and provisioning files in `{repo_path}` +2. Identify all system-level tools: name, version (from version pins, Makefile + vars, or documented prerequisites), ecosystem, SPDX licence +3. Identify indirect/transitive tool deps (e.g. Ansible → Python; Terraform → + provider plugins already captured by `.terraform.lock.hcl`) +4. Emit a well-formed `sbom-tools.yaml` with a comment header noting generation + date and confidence level per entry (`# confidence: high/medium/low`) +5. Flag any tools where version could not be determined (`version: unknown`) for + human review + +The prompt must not hallucinate versions — it must derive them from evidence in +the repo or mark them unknown. + +### T08 — Implement capture_sbom_tools.py + +```task +id: CUST-WP-0013-T08 +state_hub_task_id: 9593dca7-e713-4d7a-b4f2-c5333ae0b3d2 +status: done +priority: high +``` + +Write `state-hub/scripts/capture_sbom_tools.py`: + +- Accepts `--repo SLUG` and `--repo-path PATH` +- Resolves repo path from slug via the state-hub API if `--repo-path` is omitted +- Loads the agent prompt from `prompts/sbom-capture-agent.md`, substitutes + `{repo_slug}` and `{repo_path}` +- Invokes `claude -p ""` (non-interactive) and captures stdout +- Parses the YAML block from the response +- Writes or updates `/sbom-tools.yaml` +- Prints a diff of changes if the file already exists +- `--dry-run` flag: print the prompt and diff without writing + +### T09 — Add make capture-tools target + +```task +id: CUST-WP-0013-T09 +state_hub_task_id: 6948e1d2-9c97-4709-bdb0-4b6ded700a22 +status: done +priority: medium +``` + +Add to `state-hub/Makefile`: + +```makefile +capture-tools: ## Run SBOM capture agent for a repo (REPO=slug, REPO_PATH=path) + uv run python scripts/capture_sbom_tools.py --repo $(REPO) $(if $(REPO_PATH),--repo-path $(REPO_PATH),) +``` + +Also update `make ingest-sbom` to note that `capture-tools` should be run first +for repos that have system-level tool dependencies. + +--- + +## Phase 4 — Ingest railiance-infra + +**Acceptance:** `make ingest-sbom REPO=railiance-infra` shows terraform providers, +ansible collections, and tool manifest entries in one snapshot. + +### T10 — Capture tools manifest for railiance-infra + +```task +id: CUST-WP-0013-T10 +state_hub_task_id: 99b23998-5129-4777-9d42-7bee5981cdbb +status: done +priority: medium +``` + +Run `make capture-tools REPO=railiance-infra`. Review the generated +`railiance-infra/sbom-tools.yaml` — verify Ansible, Terraform, cloud-init, goss, +and any other tools with their versions and licences. Correct any `unknown` +versions by consulting the repo. Commit the file. + +### T11 — Ingest railiance-infra + +```task +id: CUST-WP-0013-T11 +state_hub_task_id: bb516909-f903-48ce-b60b-a24245e7382e +status: done +priority: medium +``` + +Run `make ingest-sbom REPO=railiance-infra REPO_PATH=~/railiance-infra`. Verify +the snapshot contains: +- Terraform providers (from `.terraform.lock.hcl`) +- Ansible Galaxy collections (from `ansible/requirements.yaml`) +- System tools (from `sbom-tools.yaml`) + +Check the licence report for any copyleft or BSL flags. + +--- + +## Phase 5 — Ingest railiance-cluster + +**Acceptance:** railiance-cluster SBOM covers both Python packages (uv.lock) and +system tools in a single snapshot. + +### T12 — Capture tools manifest for railiance-cluster + +```task +id: CUST-WP-0013-T12 +state_hub_task_id: 7a890f1a-da9f-4e6d-86a7-4fd1aefd5b3f +status: done +priority: medium +``` + +Run `make capture-tools REPO=railiance-cluster`. Review the generated +`railiance-cluster/sbom-tools.yaml` — verify Helm, kubectl, k3s, and any other +operational tools. Commit the file. + +### T13 — Re-ingest railiance-cluster + +```task +id: CUST-WP-0013-T13 +state_hub_task_id: 789dbe93-011a-4470-9fec-ebf249cd7134 +status: done +priority: medium +``` + +Run `make ingest-sbom REPO=railiance-cluster REPO_PATH=~/railiance-cluster`. +Verify the snapshot merges uv.lock (Python packages including ansible-core) and +sbom-tools.yaml entries into one coherent snapshot. Confirm ansible-core GPL-3.0 +flag appears in the licence report. + +--- + +## Phase 6 — Convention Documentation + +**Acceptance:** A developer reading the SBOM convention doc knows exactly how to +add a new repo to SBOM coverage. + +### T14 — Document SBOM capture convention in canon/standards + +```task +id: CUST-WP-0013-T14 +state_hub_task_id: dc3bb2a3-882e-4dd7-ab7c-8b1e88279a7d +status: done +priority: low +``` + +Write `canon/standards/sbom-convention_v0.1.md` documenting: +- The four capture mechanisms and when each applies +- The `sbom-tools.yaml` schema (with confidence annotation convention) +- The `make capture-tools` → review → commit → `make ingest-sbom` workflow +- Licence risk thresholds: copyleft = flag for review; BSL = flag for review; + null licence = acceptable for infra tools if well-known open source + +--- + +## Licence Risk Preview + +Based on known tool licences, expect these flags once ingested: + +| Tool / Package | Licence | Risk level | +|---|---|---| +| ansible-core | GPL-3.0-only | Copyleft — flag (ops toolchain, not shipped) | +| terraform ≥ 1.5.6 | BSL-1.1 | Non-OSI — flag for review | +| hashicorp providers | BSL-1.1 | Same | +| community.general | GPL-3.0 | Copyleft — flag (ops toolchain) | +| Helm | Apache-2.0 | Clean | +| k3s | Apache-2.0 | Clean | +| cloud-init | Apache-2.0 / GPL-3.0 | Mixed — check version | +| goss | Apache-2.0 | Clean | + +All copyleft/BSL entries here are **operational toolchain** dependencies, not +shipped code — risk is low but worth tracking for compliance awareness.