From 87bd73b26beac77e6afa56e432e8b96bb379af94 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sat, 27 Jun 2026 17:01:11 +0200 Subject: [PATCH] Add Railiance promote rollback tooling --- bin/railiance | 4 + docs/README.md | 1 + docs/app-toml-contract.md | 10 +- docs/deployment-lifecycle.md | 12 +- docs/promote-rollback-onboarding.md | 71 ++++ tools/README_tools.md | 5 + tools/cmd/railiance-stage3 | 377 ++++++++++++++++++ tools/create_railiance_overlay_repo.sh | 8 +- ...L-BS-WP-0006-staged-promotion-lifecycle.md | 11 +- 9 files changed, 484 insertions(+), 15 deletions(-) create mode 100644 docs/promote-rollback-onboarding.md create mode 100755 tools/cmd/railiance-stage3 diff --git a/bin/railiance b/bin/railiance index cbdbc5d..52d17ff 100755 --- a/bin/railiance +++ b/bin/railiance @@ -20,6 +20,8 @@ Commands: run Run Stage 1 local validation from railiance/app.toml deploy Plan/apply Stage 2 canary deployment observe Plan/run Stage 2 observation checks + promote Plan/apply Stage 3 stable promotion + rollback Plan/apply rollback to previous stable build-spore Build a distributable "Spore" bundle seed-local Run the seed script on this machine checklist Pre-VM checklist @@ -47,6 +49,8 @@ case "$cmd" in run) exec railiance-run "$@" ;; deploy) exec railiance-stage2 deploy "$@" ;; observe) exec railiance-stage2 observe "$@" ;; + promote) exec railiance-stage3 promote "$@" ;; + rollback) exec railiance-stage3 rollback "$@" ;; build-spore) bash "$ROOT/tools/build_spore.sh" ;; seed-local) bash "$ROOT/tools/seed_node.sh" ;; checklist) diff --git a/docs/README.md b/docs/README.md index 17dd73e..c8fb424 100644 --- a/docs/README.md +++ b/docs/README.md @@ -78,6 +78,7 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild - [Railiance overlay repo pattern](overlay-repo-pattern.md) - [Canary Helm template](canary-helm-template.md) - [Stage 2 deploy and observe](stage2-deploy-observe.md) +- [Promote, rollback, and onboarding](promote-rollback-onboarding.md) - [Railiance run command](railiance-run-command.md) ## 👥 Contributing diff --git a/docs/app-toml-contract.md b/docs/app-toml-contract.md index 7d73753..ca7f0c1 100644 --- a/docs/app-toml-contract.md +++ b/docs/app-toml-contract.md @@ -186,17 +186,17 @@ records only the route, target object, and pass/fail state. ## Command Semantics -Commands in `app.toml` are declarations for Railiance tooling. Stage 1 and -Stage 2 commands now have local CLI support; Stage 3 commands may still point -to existing scripts or runbook commands until T07 lands. +Commands in `app.toml` are declarations for Railiance tooling. Stage 1, Stage +2, and Stage 3 commands now have local CLI support; workload scripts may still +wrap them for service-specific checks. Expected mapping: - Stage 1 commands are consumed by `bin/railiance run `. - Stage 2 commands are consumed by `bin/railiance deploy --stage 2 ` and `bin/railiance observe --stage 2 `. -- Stage 3 commands are consumed by future `bin/railiance promote ` - and `bin/railiance rollback ` commands. +- Stage 3 commands are consumed by `bin/railiance promote ` and + `bin/railiance rollback `. Tooling must emit machine-readable results with workload identity, candidate revision, checks run, pass/fail status, non-secret evidence, rollback target, diff --git a/docs/deployment-lifecycle.md b/docs/deployment-lifecycle.md index 75e8e85..b7bcaad 100644 --- a/docs/deployment-lifecycle.md +++ b/docs/deployment-lifecycle.md @@ -317,14 +317,14 @@ must not cut over to Stage 3. ## Minimum Command Contract -Future CLI tasks should make these lifecycle operations repeatable: +The Railiance CLI makes these lifecycle operations repeatable: ```text -bin/railiance run # Stage 1 local validation -bin/railiance deploy --stage 2 --plan # Stage 2 canary plan -bin/railiance observe --stage 2 --plan # Stage 2 evidence targets -bin/railiance promote # Stage 3 production promotion -bin/railiance rollback # rollback to previous stable +bin/railiance run # Stage 1 local validation +bin/railiance deploy --stage 2 --plan # Stage 2 canary plan +bin/railiance observe --stage 2 --plan # Stage 2 evidence targets +bin/railiance promote --plan # Stage 3 production promotion +bin/railiance rollback --plan # rollback to previous stable ``` The exact command names may change as implementation lands, but the behavior diff --git a/docs/promote-rollback-onboarding.md b/docs/promote-rollback-onboarding.md new file mode 100644 index 0000000..99ced46 --- /dev/null +++ b/docs/promote-rollback-onboarding.md @@ -0,0 +1,71 @@ +# Promote, Rollback, And Onboarding + +This guide shows the representative Railiance lifecycle for an overlay repo. +Commands default to plan mode so the path is repeatable before cluster access or +operator approval exists. + +## Stage 1 + +```bash +bin/railiance run /path/to/overlay --pretty +``` + +Stage 1 validates `railiance/app.toml`, local commands, and local checks. Save +the JSON result as non-secret evidence before Stage 2. + +## Stage 2 + +```bash +bin/railiance deploy --stage 2 /path/to/overlay --plan --pretty +bin/railiance observe --stage 2 /path/to/overlay --plan --pretty +``` + +When Helm, kubectl, cluster access, and approval evidence are ready: + +```bash +bin/railiance deploy --stage 2 /path/to/overlay --apply --approval-id +bin/railiance observe --stage 2 /path/to/overlay --live --pretty +``` + +For critical workloads, Stage 2 apply must not run until the operator has +approved canary exposure and rollback context is known. + +## Stage 3 + +```bash +bin/railiance promote /path/to/overlay --plan --pretty +bin/railiance rollback /path/to/overlay --plan --pretty +``` + +Promotion plan mode emits a `railiance.stage3-promote-result.v1` JSON result +with stable release identity, chart and values paths, previous-stable target, +expected evidence, and approval requirements. + +Rollback plan mode emits a `railiance.stage3-rollback-result.v1` JSON result +with rollback strategy, release identity, verification text, and apply-time +requirements. + +When approval evidence and Helm access are ready: + +```bash +bin/railiance promote /path/to/overlay --apply --approval-id +bin/railiance rollback /path/to/overlay --apply --approval-id --revision +``` + +Stage 3 apply fails closed if the chart or values are missing, previous stable +is not recorded, Helm is unavailable, or approval evidence is missing. Rollback +apply fails closed if the rollback strategy is missing, Helm is unavailable, +approval evidence is missing, or a Helm revision is required but absent. + +## Human Approval Points + +Critical infrastructure workloads require explicit operator approval before: + +- Stage 2 canary exposure; +- Stage 3 stable promotion; +- rollback apply, unless an incident runbook defines a narrower break-glass + process and records the evidence id. + +Progress notes should include only non-secret result summaries: schema version, +status, release, namespace, approval id, check counts, and command byte counts. +Do not paste command logs, kubeconfigs, tokens, or private service output. diff --git a/tools/README_tools.md b/tools/README_tools.md index b57e6e9..512082e 100644 --- a/tools/README_tools.md +++ b/tools/README_tools.md @@ -60,6 +60,11 @@ This model emphasizes: --- + +### `railiance-stage3` +- Backs `bin/railiance promote` and `bin/railiance rollback`. +- Emits non-secret JSON plans/results for stable promotion and rollback. + ### `railiance-stage2` - Backs `bin/railiance deploy --stage 2` and `bin/railiance observe --stage 2`. - Emits non-secret JSON plans/results for canary deployment and observation. diff --git a/tools/cmd/railiance-stage3 b/tools/cmd/railiance-stage3 new file mode 100755 index 0000000..0aa76e8 --- /dev/null +++ b/tools/cmd/railiance-stage3 @@ -0,0 +1,377 @@ +#!/usr/bin/env python3 +"""Railiance Stage 3 promote and rollback tooling.""" + +from __future__ import annotations + +import argparse +import json +import shutil +import subprocess +import sys +import time +import tomllib +from datetime import UTC, datetime +from pathlib import Path +from typing import Any + +SUPPORTED_SCHEMA = "railiance.app.v1" + + +def utc_now() -> str: + return datetime.now(UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z") + + +def load_contract(app_dir: Path) -> tuple[Path, dict[str, Any]]: + contract_path = app_dir / "railiance" / "app.toml" + if not contract_path.exists(): + raise SystemExit(f"Missing Railiance contract: {contract_path}") + with contract_path.open("rb") as handle: + data = tomllib.load(handle) + if data.get("schema_version") != SUPPORTED_SCHEMA: + raise SystemExit( + f"Unsupported schema_version {data.get('schema_version')!r}; expected {SUPPORTED_SCHEMA}" + ) + return contract_path, data + + +def app_identity(data: dict[str, Any]) -> dict[str, Any]: + app = data.get("app", {}) + source = data.get("source", {}) + return { + "app": { + "id": app.get("id"), + "name": app.get("name"), + "repo": app.get("repo"), + "owner": app.get("owner"), + "criticality": app.get("criticality"), + }, + "source": { + "revision": source.get("revision"), + "artifact": source.get("artifact"), + "digest_policy": source.get("digest_policy"), + }, + } + + +def checks_by_id(data: dict[str, Any]) -> dict[str, dict[str, Any]]: + return {check.get("id"): check for check in data.get("checks", [])} + + +def stage_checks(data: dict[str, Any], stage_name: str) -> list[dict[str, Any]]: + stage = data.get("stages", {}).get(stage_name, {}) + lookup = checks_by_id(data) + return [lookup[item] for item in stage.get("checks", []) if item in lookup] + + +def stage2_helm_check(data: dict[str, Any]) -> dict[str, Any] | None: + for check in stage_checks(data, "stage2"): + if check.get("type") == "helm": + return check + return None + + +def precheck(name: str, status: str, required: bool, detail: str | None = None) -> dict[str, Any]: + item: dict[str, Any] = {"name": name, "status": status, "required": required} + if detail: + item["detail"] = detail + return item + + +def required_failures(items: list[dict[str, Any]]) -> list[dict[str, Any]]: + return [item for item in items if item.get("required", True) and item.get("status") != "passed"] + + +def run_command(args: list[str], cwd: Path, timeout: int, command_ref: str) -> dict[str, Any]: + started = time.monotonic() + try: + completed = subprocess.run( + args, + cwd=cwd, + text=True, + capture_output=True, + timeout=timeout, + check=False, + ) + return { + "command_ref": command_ref, + "status": "passed" if completed.returncode == 0 else "failed", + "exit_code": completed.returncode, + "duration_seconds": round(time.monotonic() - started, 3), + "stdout_bytes": len(completed.stdout.encode()), + "stderr_bytes": len(completed.stderr.encode()), + } + except subprocess.TimeoutExpired as exc: + stdout = exc.stdout if isinstance(exc.stdout, str) else "" + stderr = exc.stderr if isinstance(exc.stderr, str) else "" + return { + "command_ref": command_ref, + "status": "failed", + "exit_code": None, + "duration_seconds": round(time.monotonic() - started, 3), + "error": f"timeout after {timeout}s", + "stdout_bytes": len(stdout.encode()), + "stderr_bytes": len(stderr.encode()), + } + + +def stage3_context(app_dir: Path, contract_path: Path, data: dict[str, Any]) -> dict[str, Any]: + stage = data.get("stages", {}).get("stage3", {}) + if not stage.get("enabled", False): + raise SystemExit("Stage 3 is disabled in railiance/app.toml") + app = data.get("app", {}) + helm = stage2_helm_check(data) or {} + chart = app_dir / str(helm.get("chart", f"charts/{app.get('id', 'app')}")) + values = app_dir / "values" / "stage3-production.yaml" + release = str(stage.get("release", app.get("id", "app"))) + namespace = str(stage.get("namespace", app.get("id", "default"))) + context = { + "contract": str(contract_path), + "app_dir": str(app_dir), + "stage": "stage3", + "namespace": namespace, + "release": release, + "chart": str(chart), + "values": str(values), + "promotion_mode": stage.get("promotion_mode"), + "previous_stable": stage.get("previous_stable"), + "requires_approval": bool(stage.get("requires_approval", False)), + "evidence_expected": list(stage.get("evidence", [])), + "checks_expected": list(stage.get("checks", [])), + } + context.update(app_identity(data)) + return context + + +def rollback_context(app_dir: Path, contract_path: Path, data: dict[str, Any]) -> dict[str, Any]: + context = stage3_context(app_dir, contract_path, data) + rollback = data.get("rollback", {}) + context["rollback"] = { + "strategy": rollback.get("strategy"), + "command_ref": "rollback.command", + "verification": rollback.get("verification"), + } + return context + + +def promote_prechecks(app_dir: Path, context: dict[str, Any], mode: str, approval_id: str | None) -> list[dict[str, Any]]: + checks = [precheck("app.toml", "passed", True)] + chart = Path(context["chart"]) + values = Path(context["values"]) + checks.append(precheck("stage3-chart", "passed" if chart.exists() else "failed", True, str(chart))) + checks.append(precheck("stage3-values", "passed" if values.exists() else "failed", True, str(values))) + checks.append( + precheck( + "previous-stable", + "passed" if context.get("previous_stable") else "failed", + True, + "Stage 3 must record the rollback target before promotion", + ) + ) + if mode == "apply": + checks.append(precheck("helm", "passed" if shutil.which("helm") else "failed", True, "helm executable")) + else: + checks.append(precheck("helm", "not_required", False, "plan mode does not execute helm")) + if mode == "apply" and context.get("requires_approval"): + checks.append( + precheck( + "approval-id", + "passed" if approval_id else "failed", + True, + "Stage 3 requires approval before stable promotion", + ) + ) + elif context.get("requires_approval"): + checks.append(precheck("approval-id", "required_before_apply", False)) + return checks + + +def rollback_prechecks(context: dict[str, Any], mode: str, approval_id: str | None, revision: str | None) -> list[dict[str, Any]]: + checks = [precheck("app.toml", "passed", True)] + strategy = context.get("rollback", {}).get("strategy") + checks.append(precheck("rollback-strategy", "passed" if strategy else "failed", True, str(strategy or ""))) + if mode == "apply": + checks.append(precheck("helm", "passed" if shutil.which("helm") else "failed", True, "helm executable")) + checks.append( + precheck( + "approval-id", + "passed" if approval_id else "failed", + True, + "Rollback apply requires approval or incident evidence", + ) + ) + if strategy == "helm-revision": + checks.append(precheck("helm-revision", "passed" if revision else "failed", True)) + else: + checks.append(precheck("helm", "not_required", False, "plan mode does not execute helm")) + checks.append(precheck("approval-id", "required_before_apply", False)) + if strategy == "helm-revision": + checks.append(precheck("helm-revision", "required_before_apply", False)) + return checks + + +def promote_args(context: dict[str, Any], timeout: int) -> list[str]: + return [ + "helm", + "upgrade", + "--install", + context["release"], + context["chart"], + "--namespace", + context["namespace"], + "--create-namespace", + "-f", + context["values"], + "--atomic", + "--wait", + "--timeout", + f"{timeout}m", + ] + + +def rollback_args(context: dict[str, Any], revision: str, timeout: int) -> list[str]: + return [ + "helm", + "rollback", + context["release"], + revision, + "--namespace", + context["namespace"], + "--wait", + "--timeout", + f"{timeout}m", + ] + + +def promote(argv: list[str]) -> int: + parser = argparse.ArgumentParser(description="Plan or apply a Stage 3 stable promotion.") + parser.add_argument("app_dir", nargs="?", default=".") + parser.add_argument("--mode", choices=["plan", "apply"], default="plan") + parser.add_argument("--plan", action="store_const", const="plan", dest="mode") + parser.add_argument("--apply", action="store_const", const="apply", dest="mode") + parser.add_argument("--approval-id") + parser.add_argument("--timeout-minutes", type=int, default=10) + parser.add_argument("--json-out") + parser.add_argument("--pretty", action="store_true") + args = parser.parse_args(argv) + + app_dir = Path(args.app_dir).resolve() + contract_path, data = load_contract(app_dir) + context = stage3_context(app_dir, contract_path, data) + checks = promote_prechecks(app_dir, context, args.mode, args.approval_id) + failures = required_failures(checks) + actions: list[dict[str, Any]] = [] + status = "planned" if not failures else "blocked" + if args.mode == "apply" and not failures: + action = run_command(promote_args(context, args.timeout_minutes), app_dir, args.timeout_minutes * 60, "stage3.helm-promote") + actions.append(action) + status = "applied" if action.get("status") == "passed" else "failed" + result: dict[str, Any] = { + "schema_version": "railiance.stage3-promote-result.v1", + "status": status, + "mode": args.mode, + "generated_at": utc_now(), + **context, + "approval_id": args.approval_id, + "prechecks": checks, + "actions": actions, + "planned_actions": [ + { + "action_ref": "stage3.helm-promote", + "tool": "helm", + "release": context["release"], + "namespace": context["namespace"], + "chart": context["chart"], + "values": context["values"], + } + ], + "summary": { + "required_prechecks_failed": len(failures), + "actions_total": len(actions), + "actions_failed": len([item for item in actions if item.get("status") != "passed"]), + }, + } + return emit(result, args.json_out, args.pretty, {"planned", "applied"}) + + +def rollback(argv: list[str]) -> int: + parser = argparse.ArgumentParser(description="Plan or apply a rollback to the previous stable release.") + parser.add_argument("app_dir", nargs="?", default=".") + parser.add_argument("--mode", choices=["plan", "apply"], default="plan") + parser.add_argument("--plan", action="store_const", const="plan", dest="mode") + parser.add_argument("--apply", action="store_const", const="apply", dest="mode") + parser.add_argument("--approval-id") + parser.add_argument("--revision", help="Helm revision to roll back to for helm-revision strategy.") + parser.add_argument("--timeout-minutes", type=int, default=10) + parser.add_argument("--json-out") + parser.add_argument("--pretty", action="store_true") + args = parser.parse_args(argv) + + app_dir = Path(args.app_dir).resolve() + contract_path, data = load_contract(app_dir) + context = rollback_context(app_dir, contract_path, data) + checks = rollback_prechecks(context, args.mode, args.approval_id, args.revision) + failures = required_failures(checks) + actions: list[dict[str, Any]] = [] + status = "planned" if not failures else "blocked" + if args.mode == "apply" and not failures: + action = run_command( + rollback_args(context, str(args.revision), args.timeout_minutes), + app_dir, + args.timeout_minutes * 60, + "stage3.helm-rollback", + ) + actions.append(action) + status = "applied" if action.get("status") == "passed" else "failed" + result: dict[str, Any] = { + "schema_version": "railiance.stage3-rollback-result.v1", + "status": status, + "mode": args.mode, + "generated_at": utc_now(), + **context, + "approval_id": args.approval_id, + "revision": args.revision, + "prechecks": checks, + "actions": actions, + "planned_actions": [ + { + "action_ref": "stage3.helm-rollback", + "tool": "helm", + "release": context["release"], + "namespace": context["namespace"], + "revision": args.revision, + } + ], + "summary": { + "required_prechecks_failed": len(failures), + "actions_total": len(actions), + "actions_failed": len([item for item in actions if item.get("status") != "passed"]), + }, + } + return emit(result, args.json_out, args.pretty, {"planned", "applied"}) + + +def emit(result: dict[str, Any], json_out: str | None, pretty: bool, success_statuses: set[str]) -> int: + rendered = json.dumps(result, indent=2 if pretty else None, sort_keys=True) + print(rendered) + if json_out: + output = Path(json_out) + output.parent.mkdir(parents=True, exist_ok=True) + output.write_text(rendered + "\n", encoding="utf-8") + return 0 if result["status"] in success_statuses else 1 + + +def main(argv: list[str]) -> int: + if not argv: + print("Usage: railiance-stage3 [args]", file=sys.stderr) + return 2 + command = argv[0] + if command == "promote": + return promote(argv[1:]) + if command == "rollback": + return rollback(argv[1:]) + print(f"Unknown Stage 3 command: {command}", file=sys.stderr) + return 2 + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv[1:])) diff --git a/tools/create_railiance_overlay_repo.sh b/tools/create_railiance_overlay_repo.sh index ffc65f9..8a44e30 100755 --- a/tools/create_railiance_overlay_repo.sh +++ b/tools/create_railiance_overlay_repo.sh @@ -152,7 +152,7 @@ digest_policy = "preferred" [rollback] strategy = "helm-revision" -command = "bin/railiance rollback ${APP_ID}" +command = "railiance rollback . --plan" verification = "Stable release health check returns 200 after rollback." [platform] @@ -197,7 +197,7 @@ observation_minutes = 30 enabled = true namespace = "${APP_ID}" release = "${APP_ID}" -commands = ["bin/railiance promote ${APP_ID}", "bin/railiance observe ${APP_ID}"] +commands = ["railiance promote . --plan", "railiance rollback . --plan"] checks = ["stage2-accepted", "rollback-target", "cluster-health"] evidence = ["promotion command id", "new stable digest", "post-promotion smoke"] requires_approval = true @@ -748,7 +748,9 @@ change \`railiance.traffic.mode\` to \`weighted\`, set \`provider: traefik\`, and choose explicit stable/canary weights in \`values/stage2-canary.yaml\`. Before Stage 2 apply, fill in real image repositories, platform dependencies, -observability endpoints, rollback target details, and approval evidence. +observability endpoints, rollback target details, and approval evidence. Before +Stage 3, run \`railiance promote . --plan\` and \`railiance rollback . --plan\` +so stable promotion and rollback evidence can be reviewed together. EOF cat > "${OUT_DIR}/.gitignore" <<'EOF' diff --git a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md index 488f971..96759d4 100644 --- a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md +++ b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md @@ -224,7 +224,7 @@ generated overlays to declare the repeatable Stage 2 plan commands. ```task id: RAIL-BS-WP-0006-T07 -status: todo +status: done priority: medium state_hub_task_id: "476198f6-0049-4ac4-9593-6723c86c9602" ``` @@ -242,6 +242,15 @@ Expected output: **Done when:** a representative app can move Stage 1 -> Stage 2 -> Stage 3 and back through rollback using documented commands. +2026-06-27: Added `tools/cmd/railiance-stage3` and dispatcher entries for +`bin/railiance promote` and `bin/railiance rollback`. Both commands default to +non-mutating JSON plans, apply modes require approval evidence and Helm, and +rollback apply also requires a Helm revision for `helm-revision` strategy. +Added `docs/promote-rollback-onboarding.md` with the representative Stage 1 -> +Stage 2 -> Stage 3 -> rollback path and explicit human approval points for +critical workloads. Updated generated overlays to declare promote/rollback plan +commands. + ## Dependencies This workplan should be done before the Forgejo production cutover. It can run