Add Railiance promote rollback tooling

2026-06-27 17:01:11 +02:00
parent 6d862e68be
commit 87bd73b26b
9 changed files with 484 additions and 15 deletions
--- a/bin/railiance
+++ b/bin/railiance
@@ -20,6 +20,8 @@ Commands:
  run           Run Stage 1 local validation from railiance/app.toml
  deploy        Plan/apply Stage 2 canary deployment
  observe       Plan/run Stage 2 observation checks
+  promote       Plan/apply Stage 3 stable promotion
+  rollback      Plan/apply rollback to previous stable
  build-spore   Build a distributable "Spore" bundle
  seed-local    Run the seed script on this machine
  checklist     Pre-VM checklist
@@ -47,6 +49,8 @@ case "$cmd" in
  run) exec railiance-run "$@" ;;
  deploy) exec railiance-stage2 deploy "$@" ;;
  observe) exec railiance-stage2 observe "$@" ;;
+  promote) exec railiance-stage3 promote "$@" ;;
+  rollback) exec railiance-stage3 rollback "$@" ;;
  build-spore) bash "$ROOT/tools/build_spore.sh" ;;
  seed-local) bash "$ROOT/tools/seed_node.sh" ;;
  checklist)
--- a/docs/README.md
+++ b/docs/README.md
@@ -78,6 +78,7 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
 - [Railiance overlay repo pattern](overlay-repo-pattern.md)
 - [Canary Helm template](canary-helm-template.md)
 - [Stage 2 deploy and observe](stage2-deploy-observe.md)
+- [Promote, rollback, and onboarding](promote-rollback-onboarding.md)
 - [Railiance run command](railiance-run-command.md)

 ## 👥 Contributing
--- a/docs/app-toml-contract.md
+++ b/docs/app-toml-contract.md
@@ -186,17 +186,17 @@ records only the route, target object, and pass/fail state.

 ## Command Semantics

-Commands in `app.toml` are declarations for Railiance tooling. Stage 1 and
-Stage 2 commands now have local CLI support; Stage 3 commands may still point
-to existing scripts or runbook commands until T07 lands.
+Commands in `app.toml` are declarations for Railiance tooling. Stage 1, Stage
+2, and Stage 3 commands now have local CLI support; workload scripts may still
+wrap them for service-specific checks.

 Expected mapping:

 - Stage 1 commands are consumed by `bin/railiance run <overlay-dir>`.
 - Stage 2 commands are consumed by `bin/railiance deploy --stage 2 <overlay-dir>`
  and `bin/railiance observe --stage 2 <overlay-dir>`.
- Stage 3 commands are consumed by future `bin/railiance promote <overlay-dir>`
-  and `bin/railiance rollback <overlay-dir>` commands.
+- Stage 3 commands are consumed by `bin/railiance promote <overlay-dir>` and
+  `bin/railiance rollback <overlay-dir>`.

 Tooling must emit machine-readable results with workload identity, candidate
 revision, checks run, pass/fail status, non-secret evidence, rollback target,
--- a/docs/deployment-lifecycle.md
+++ b/docs/deployment-lifecycle.md
@@ -317,14 +317,14 @@ must not cut over to Stage 3.

 ## Minimum Command Contract

-Future CLI tasks should make these lifecycle operations repeatable:
+The Railiance CLI makes these lifecycle operations repeatable:

 ```text
-bin/railiance run <overlay-dir>                    # Stage 1 local validation
-bin/railiance deploy --stage 2 <overlay-dir> --plan  # Stage 2 canary plan
-bin/railiance observe --stage 2 <overlay-dir> --plan # Stage 2 evidence targets
-bin/railiance promote <overlay-dir>                  # Stage 3 production promotion
-bin/railiance rollback <overlay-dir>                 # rollback to previous stable
+bin/railiance run <overlay-dir>                       # Stage 1 local validation
+bin/railiance deploy --stage 2 <overlay-dir> --plan     # Stage 2 canary plan
+bin/railiance observe --stage 2 <overlay-dir> --plan    # Stage 2 evidence targets
+bin/railiance promote <overlay-dir> --plan              # Stage 3 production promotion
+bin/railiance rollback <overlay-dir> --plan             # rollback to previous stable
 ```

 The exact command names may change as implementation lands, but the behavior
--- a/docs/promote-rollback-onboarding.md
+++ b/docs/promote-rollback-onboarding.md
@@ -0,0 +1,71 @@
+# Promote, Rollback, And Onboarding
+
+This guide shows the representative Railiance lifecycle for an overlay repo.
+Commands default to plan mode so the path is repeatable before cluster access or
+operator approval exists.
+
+## Stage 1
+
+```bash
+bin/railiance run /path/to/overlay --pretty
+```
+
+Stage 1 validates `railiance/app.toml`, local commands, and local checks. Save
+the JSON result as non-secret evidence before Stage 2.
+
+## Stage 2
+
+```bash
+bin/railiance deploy --stage 2 /path/to/overlay --plan --pretty
+bin/railiance observe --stage 2 /path/to/overlay --plan --pretty
+```
+
+When Helm, kubectl, cluster access, and approval evidence are ready:
+
+```bash
+bin/railiance deploy --stage 2 /path/to/overlay --apply --approval-id <state-hub-id>
+bin/railiance observe --stage 2 /path/to/overlay --live --pretty
+```
+
+For critical workloads, Stage 2 apply must not run until the operator has
+approved canary exposure and rollback context is known.
+
+## Stage 3
+
+```bash
+bin/railiance promote /path/to/overlay --plan --pretty
+bin/railiance rollback /path/to/overlay --plan --pretty
+```
+
+Promotion plan mode emits a `railiance.stage3-promote-result.v1` JSON result
+with stable release identity, chart and values paths, previous-stable target,
+expected evidence, and approval requirements.
+
+Rollback plan mode emits a `railiance.stage3-rollback-result.v1` JSON result
+with rollback strategy, release identity, verification text, and apply-time
+requirements.
+
+When approval evidence and Helm access are ready:
+
+```bash
+bin/railiance promote /path/to/overlay --apply --approval-id <state-hub-id>
+bin/railiance rollback /path/to/overlay --apply --approval-id <state-hub-id> --revision <helm-revision>
+```
+
+Stage 3 apply fails closed if the chart or values are missing, previous stable
+is not recorded, Helm is unavailable, or approval evidence is missing. Rollback
+apply fails closed if the rollback strategy is missing, Helm is unavailable,
+approval evidence is missing, or a Helm revision is required but absent.
+
+## Human Approval Points
+
+Critical infrastructure workloads require explicit operator approval before:
+
+- Stage 2 canary exposure;
+- Stage 3 stable promotion;
+- rollback apply, unless an incident runbook defines a narrower break-glass
+  process and records the evidence id.
+
+Progress notes should include only non-secret result summaries: schema version,
+status, release, namespace, approval id, check counts, and command byte counts.
+Do not paste command logs, kubeconfigs, tokens, or private service output.
--- a/tools/README_tools.md
+++ b/tools/README_tools.md
@@ -60,6 +60,11 @@ This model emphasizes:
 ---


+
+### `railiance-stage3`
+- Backs `bin/railiance promote` and `bin/railiance rollback`.
+- Emits non-secret JSON plans/results for stable promotion and rollback.
+
 ### `railiance-stage2`
 - Backs `bin/railiance deploy --stage 2` and `bin/railiance observe --stage 2`.
 - Emits non-secret JSON plans/results for canary deployment and observation.
--- a/tools/cmd/railiance-stage3
+++ b/tools/cmd/railiance-stage3
@@ -0,0 +1,377 @@
+#!/usr/bin/env python3
+"""Railiance Stage 3 promote and rollback tooling."""
+
+from __future__ import annotations
+
+import argparse
+import json
+import shutil
+import subprocess
+import sys
+import time
+import tomllib
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any
+
+SUPPORTED_SCHEMA = "railiance.app.v1"
+
+
+def utc_now() -> str:
+    return datetime.now(UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z")
+
+
+def load_contract(app_dir: Path) -> tuple[Path, dict[str, Any]]:
+    contract_path = app_dir / "railiance" / "app.toml"
+    if not contract_path.exists():
+        raise SystemExit(f"Missing Railiance contract: {contract_path}")
+    with contract_path.open("rb") as handle:
+        data = tomllib.load(handle)
+    if data.get("schema_version") != SUPPORTED_SCHEMA:
+        raise SystemExit(
+            f"Unsupported schema_version {data.get('schema_version')!r}; expected {SUPPORTED_SCHEMA}"
+        )
+    return contract_path, data
+
+
+def app_identity(data: dict[str, Any]) -> dict[str, Any]:
+    app = data.get("app", {})
+    source = data.get("source", {})
+    return {
+        "app": {
+            "id": app.get("id"),
+            "name": app.get("name"),
+            "repo": app.get("repo"),
+            "owner": app.get("owner"),
+            "criticality": app.get("criticality"),
+        },
+        "source": {
+            "revision": source.get("revision"),
+            "artifact": source.get("artifact"),
+            "digest_policy": source.get("digest_policy"),
+        },
+    }
+
+
+def checks_by_id(data: dict[str, Any]) -> dict[str, dict[str, Any]]:
+    return {check.get("id"): check for check in data.get("checks", [])}
+
+
+def stage_checks(data: dict[str, Any], stage_name: str) -> list[dict[str, Any]]:
+    stage = data.get("stages", {}).get(stage_name, {})
+    lookup = checks_by_id(data)
+    return [lookup[item] for item in stage.get("checks", []) if item in lookup]
+
+
+def stage2_helm_check(data: dict[str, Any]) -> dict[str, Any] | None:
+    for check in stage_checks(data, "stage2"):
+        if check.get("type") == "helm":
+            return check
+    return None
+
+
+def precheck(name: str, status: str, required: bool, detail: str | None = None) -> dict[str, Any]:
+    item: dict[str, Any] = {"name": name, "status": status, "required": required}
+    if detail:
+        item["detail"] = detail
+    return item
+
+
+def required_failures(items: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    return [item for item in items if item.get("required", True) and item.get("status") != "passed"]
+
+
+def run_command(args: list[str], cwd: Path, timeout: int, command_ref: str) -> dict[str, Any]:
+    started = time.monotonic()
+    try:
+        completed = subprocess.run(
+            args,
+            cwd=cwd,
+            text=True,
+            capture_output=True,
+            timeout=timeout,
+            check=False,
+        )
+        return {
+            "command_ref": command_ref,
+            "status": "passed" if completed.returncode == 0 else "failed",
+            "exit_code": completed.returncode,
+            "duration_seconds": round(time.monotonic() - started, 3),
+            "stdout_bytes": len(completed.stdout.encode()),
+            "stderr_bytes": len(completed.stderr.encode()),
+        }
+    except subprocess.TimeoutExpired as exc:
+        stdout = exc.stdout if isinstance(exc.stdout, str) else ""
+        stderr = exc.stderr if isinstance(exc.stderr, str) else ""
+        return {
+            "command_ref": command_ref,
+            "status": "failed",
+            "exit_code": None,
+            "duration_seconds": round(time.monotonic() - started, 3),
+            "error": f"timeout after {timeout}s",
+            "stdout_bytes": len(stdout.encode()),
+            "stderr_bytes": len(stderr.encode()),
+        }
+
+
+def stage3_context(app_dir: Path, contract_path: Path, data: dict[str, Any]) -> dict[str, Any]:
+    stage = data.get("stages", {}).get("stage3", {})
+    if not stage.get("enabled", False):
+        raise SystemExit("Stage 3 is disabled in railiance/app.toml")
+    app = data.get("app", {})
+    helm = stage2_helm_check(data) or {}
+    chart = app_dir / str(helm.get("chart", f"charts/{app.get('id', 'app')}"))
+    values = app_dir / "values" / "stage3-production.yaml"
+    release = str(stage.get("release", app.get("id", "app")))
+    namespace = str(stage.get("namespace", app.get("id", "default")))
+    context = {
+        "contract": str(contract_path),
+        "app_dir": str(app_dir),
+        "stage": "stage3",
+        "namespace": namespace,
+        "release": release,
+        "chart": str(chart),
+        "values": str(values),
+        "promotion_mode": stage.get("promotion_mode"),
+        "previous_stable": stage.get("previous_stable"),
+        "requires_approval": bool(stage.get("requires_approval", False)),
+        "evidence_expected": list(stage.get("evidence", [])),
+        "checks_expected": list(stage.get("checks", [])),
+    }
+    context.update(app_identity(data))
+    return context
+
+
+def rollback_context(app_dir: Path, contract_path: Path, data: dict[str, Any]) -> dict[str, Any]:
+    context = stage3_context(app_dir, contract_path, data)
+    rollback = data.get("rollback", {})
+    context["rollback"] = {
+        "strategy": rollback.get("strategy"),
+        "command_ref": "rollback.command",
+        "verification": rollback.get("verification"),
+    }
+    return context
+
+
+def promote_prechecks(app_dir: Path, context: dict[str, Any], mode: str, approval_id: str | None) -> list[dict[str, Any]]:
+    checks = [precheck("app.toml", "passed", True)]
+    chart = Path(context["chart"])
+    values = Path(context["values"])
+    checks.append(precheck("stage3-chart", "passed" if chart.exists() else "failed", True, str(chart)))
+    checks.append(precheck("stage3-values", "passed" if values.exists() else "failed", True, str(values)))
+    checks.append(
+        precheck(
+            "previous-stable",
+            "passed" if context.get("previous_stable") else "failed",
+            True,
+            "Stage 3 must record the rollback target before promotion",
+        )
+    )
+    if mode == "apply":
+        checks.append(precheck("helm", "passed" if shutil.which("helm") else "failed", True, "helm executable"))
+    else:
+        checks.append(precheck("helm", "not_required", False, "plan mode does not execute helm"))
+    if mode == "apply" and context.get("requires_approval"):
+        checks.append(
+            precheck(
+                "approval-id",
+                "passed" if approval_id else "failed",
+                True,
+                "Stage 3 requires approval before stable promotion",
+            )
+        )
+    elif context.get("requires_approval"):
+        checks.append(precheck("approval-id", "required_before_apply", False))
+    return checks
+
+
+def rollback_prechecks(context: dict[str, Any], mode: str, approval_id: str | None, revision: str | None) -> list[dict[str, Any]]:
+    checks = [precheck("app.toml", "passed", True)]
+    strategy = context.get("rollback", {}).get("strategy")
+    checks.append(precheck("rollback-strategy", "passed" if strategy else "failed", True, str(strategy or "")))
+    if mode == "apply":
+        checks.append(precheck("helm", "passed" if shutil.which("helm") else "failed", True, "helm executable"))
+        checks.append(
+            precheck(
+                "approval-id",
+                "passed" if approval_id else "failed",
+                True,
+                "Rollback apply requires approval or incident evidence",
+            )
+        )
+        if strategy == "helm-revision":
+            checks.append(precheck("helm-revision", "passed" if revision else "failed", True))
+    else:
+        checks.append(precheck("helm", "not_required", False, "plan mode does not execute helm"))
+        checks.append(precheck("approval-id", "required_before_apply", False))
+        if strategy == "helm-revision":
+            checks.append(precheck("helm-revision", "required_before_apply", False))
+    return checks
+
+
+def promote_args(context: dict[str, Any], timeout: int) -> list[str]:
+    return [
+        "helm",
+        "upgrade",
+        "--install",
+        context["release"],
+        context["chart"],
+        "--namespace",
+        context["namespace"],
+        "--create-namespace",
+        "-f",
+        context["values"],
+        "--atomic",
+        "--wait",
+        "--timeout",
+        f"{timeout}m",
+    ]
+
+
+def rollback_args(context: dict[str, Any], revision: str, timeout: int) -> list[str]:
+    return [
+        "helm",
+        "rollback",
+        context["release"],
+        revision,
+        "--namespace",
+        context["namespace"],
+        "--wait",
+        "--timeout",
+        f"{timeout}m",
+    ]
+
+
+def promote(argv: list[str]) -> int:
+    parser = argparse.ArgumentParser(description="Plan or apply a Stage 3 stable promotion.")
+    parser.add_argument("app_dir", nargs="?", default=".")
+    parser.add_argument("--mode", choices=["plan", "apply"], default="plan")
+    parser.add_argument("--plan", action="store_const", const="plan", dest="mode")
+    parser.add_argument("--apply", action="store_const", const="apply", dest="mode")
+    parser.add_argument("--approval-id")
+    parser.add_argument("--timeout-minutes", type=int, default=10)
+    parser.add_argument("--json-out")
+    parser.add_argument("--pretty", action="store_true")
+    args = parser.parse_args(argv)
+
+    app_dir = Path(args.app_dir).resolve()
+    contract_path, data = load_contract(app_dir)
+    context = stage3_context(app_dir, contract_path, data)
+    checks = promote_prechecks(app_dir, context, args.mode, args.approval_id)
+    failures = required_failures(checks)
+    actions: list[dict[str, Any]] = []
+    status = "planned" if not failures else "blocked"
+    if args.mode == "apply" and not failures:
+        action = run_command(promote_args(context, args.timeout_minutes), app_dir, args.timeout_minutes * 60, "stage3.helm-promote")
+        actions.append(action)
+        status = "applied" if action.get("status") == "passed" else "failed"
+    result: dict[str, Any] = {
+        "schema_version": "railiance.stage3-promote-result.v1",
+        "status": status,
+        "mode": args.mode,
+        "generated_at": utc_now(),
+        **context,
+        "approval_id": args.approval_id,
+        "prechecks": checks,
+        "actions": actions,
+        "planned_actions": [
+            {
+                "action_ref": "stage3.helm-promote",
+                "tool": "helm",
+                "release": context["release"],
+                "namespace": context["namespace"],
+                "chart": context["chart"],
+                "values": context["values"],
+            }
+        ],
+        "summary": {
+            "required_prechecks_failed": len(failures),
+            "actions_total": len(actions),
+            "actions_failed": len([item for item in actions if item.get("status") != "passed"]),
+        },
+    }
+    return emit(result, args.json_out, args.pretty, {"planned", "applied"})
+
+
+def rollback(argv: list[str]) -> int:
+    parser = argparse.ArgumentParser(description="Plan or apply a rollback to the previous stable release.")
+    parser.add_argument("app_dir", nargs="?", default=".")
+    parser.add_argument("--mode", choices=["plan", "apply"], default="plan")
+    parser.add_argument("--plan", action="store_const", const="plan", dest="mode")
+    parser.add_argument("--apply", action="store_const", const="apply", dest="mode")
+    parser.add_argument("--approval-id")
+    parser.add_argument("--revision", help="Helm revision to roll back to for helm-revision strategy.")
+    parser.add_argument("--timeout-minutes", type=int, default=10)
+    parser.add_argument("--json-out")
+    parser.add_argument("--pretty", action="store_true")
+    args = parser.parse_args(argv)
+
+    app_dir = Path(args.app_dir).resolve()
+    contract_path, data = load_contract(app_dir)
+    context = rollback_context(app_dir, contract_path, data)
+    checks = rollback_prechecks(context, args.mode, args.approval_id, args.revision)
+    failures = required_failures(checks)
+    actions: list[dict[str, Any]] = []
+    status = "planned" if not failures else "blocked"
+    if args.mode == "apply" and not failures:
+        action = run_command(
+            rollback_args(context, str(args.revision), args.timeout_minutes),
+            app_dir,
+            args.timeout_minutes * 60,
+            "stage3.helm-rollback",
+        )
+        actions.append(action)
+        status = "applied" if action.get("status") == "passed" else "failed"
+    result: dict[str, Any] = {
+        "schema_version": "railiance.stage3-rollback-result.v1",
+        "status": status,
+        "mode": args.mode,
+        "generated_at": utc_now(),
+        **context,
+        "approval_id": args.approval_id,
+        "revision": args.revision,
+        "prechecks": checks,
+        "actions": actions,
+        "planned_actions": [
+            {
+                "action_ref": "stage3.helm-rollback",
+                "tool": "helm",
+                "release": context["release"],
+                "namespace": context["namespace"],
+                "revision": args.revision,
+            }
+        ],
+        "summary": {
+            "required_prechecks_failed": len(failures),
+            "actions_total": len(actions),
+            "actions_failed": len([item for item in actions if item.get("status") != "passed"]),
+        },
+    }
+    return emit(result, args.json_out, args.pretty, {"planned", "applied"})
+
+
+def emit(result: dict[str, Any], json_out: str | None, pretty: bool, success_statuses: set[str]) -> int:
+    rendered = json.dumps(result, indent=2 if pretty else None, sort_keys=True)
+    print(rendered)
+    if json_out:
+        output = Path(json_out)
+        output.parent.mkdir(parents=True, exist_ok=True)
+        output.write_text(rendered + "\n", encoding="utf-8")
+    return 0 if result["status"] in success_statuses else 1
+
+
+def main(argv: list[str]) -> int:
+    if not argv:
+        print("Usage: railiance-stage3 <promote|rollback> [args]", file=sys.stderr)
+        return 2
+    command = argv[0]
+    if command == "promote":
+        return promote(argv[1:])
+    if command == "rollback":
+        return rollback(argv[1:])
+    print(f"Unknown Stage 3 command: {command}", file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    raise SystemExit(main(sys.argv[1:]))
--- a/tools/create_railiance_overlay_repo.sh
+++ b/tools/create_railiance_overlay_repo.sh
@@ -152,7 +152,7 @@ digest_policy = "preferred"

 [rollback]
 strategy = "helm-revision"
-command = "bin/railiance rollback ${APP_ID}"
+command = "railiance rollback . --plan"
 verification = "Stable release health check returns 200 after rollback."

 [platform]
@@ -197,7 +197,7 @@ observation_minutes = 30
 enabled = true
 namespace = "${APP_ID}"
 release = "${APP_ID}"
-commands = ["bin/railiance promote ${APP_ID}", "bin/railiance observe ${APP_ID}"]
+commands = ["railiance promote . --plan", "railiance rollback . --plan"]
 checks = ["stage2-accepted", "rollback-target", "cluster-health"]
 evidence = ["promotion command id", "new stable digest", "post-promotion smoke"]
 requires_approval = true
@@ -748,7 +748,9 @@ change \`railiance.traffic.mode\` to \`weighted\`, set \`provider: traefik\`,
 and choose explicit stable/canary weights in \`values/stage2-canary.yaml\`.

 Before Stage 2 apply, fill in real image repositories, platform dependencies,
-observability endpoints, rollback target details, and approval evidence.
+observability endpoints, rollback target details, and approval evidence. Before
+Stage 3, run \`railiance promote . --plan\` and \`railiance rollback . --plan\`
+so stable promotion and rollback evidence can be reviewed together.
 EOF

 cat > "${OUT_DIR}/.gitignore" <<'EOF'
--- a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md
+++ b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md
@@ -224,7 +224,7 @@ generated overlays to declare the repeatable Stage 2 plan commands.

 ```task
 id: RAIL-BS-WP-0006-T07
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "476198f6-0049-4ac4-9593-6723c86c9602"
 ```
@@ -242,6 +242,15 @@ Expected output:
 **Done when:** a representative app can move Stage 1 -> Stage 2 -> Stage 3 and
 back through rollback using documented commands.

+2026-06-27: Added `tools/cmd/railiance-stage3` and dispatcher entries for
+`bin/railiance promote` and `bin/railiance rollback`. Both commands default to
+non-mutating JSON plans, apply modes require approval evidence and Helm, and
+rollback apply also requires a Helm revision for `helm-revision` strategy.
+Added `docs/promote-rollback-onboarding.md` with the representative Stage 1 ->
+Stage 2 -> Stage 3 -> rollback path and explicit human approval points for
+critical workloads. Updated generated overlays to declare promote/rollback plan
+commands.
+
 ## Dependencies

 This workplan should be done before the Forgejo production cutover. It can run