Add Railiance Stage 2 deploy observe tooling
Some checks failed
railiance-tests / smoke (push) Has been cancelled

This commit is contained in:
2026-06-27 16:51:02 +02:00
parent 11ceeed03c
commit 9a463e0749
9 changed files with 529 additions and 20 deletions

View File

@@ -18,6 +18,8 @@ Commands:
init-repo Idempotently furnish repo housekeeping
create-overlay Scaffold a Railiance overlay repo for an upstream app
run Run Stage 1 local validation from railiance/app.toml
deploy Plan/apply Stage 2 canary deployment
observe Plan/run Stage 2 observation checks
build-spore Build a distributable "Spore" bundle
seed-local Run the seed script on this machine
checklist Pre-VM checklist
@@ -43,6 +45,8 @@ case "$cmd" in
init-repo) bash "$ROOT/tools/furnish_railiance_repo.sh" ;;
create-overlay) bash "$ROOT/tools/create_railiance_overlay_repo.sh" "$@" ;;
run) exec railiance-run "$@" ;;
deploy) exec railiance-stage2 deploy "$@" ;;
observe) exec railiance-stage2 observe "$@" ;;
build-spore) bash "$ROOT/tools/build_spore.sh" ;;
seed-local) bash "$ROOT/tools/seed_node.sh" ;;
checklist)

View File

@@ -77,6 +77,7 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
- [Railiance app.toml contract](app-toml-contract.md)
- [Railiance overlay repo pattern](overlay-repo-pattern.md)
- [Canary Helm template](canary-helm-template.md)
- [Stage 2 deploy and observe](stage2-deploy-observe.md)
- [Railiance run command](railiance-run-command.md)
## 👥 Contributing

View File

@@ -186,16 +186,17 @@ records only the route, target object, and pass/fail state.
## Command Semantics
Commands in `app.toml` are declarations for future tooling. Until T04-T07
implement the CLI, they may point to existing scripts or runbook commands.
Commands in `app.toml` are declarations for Railiance tooling. Stage 1 and
Stage 2 commands now have local CLI support; Stage 3 commands may still point
to existing scripts or runbook commands until T07 lands.
Expected mapping:
- Stage 1 commands are consumed by `bin/railiance run <app>`.
- Stage 2 commands are consumed by `bin/railiance deploy --stage 2 <app>` and
`bin/railiance observe <app>`.
- Stage 3 commands are consumed by `bin/railiance promote <app>` and
`bin/railiance rollback <app>`.
- Stage 1 commands are consumed by `bin/railiance run <overlay-dir>`.
- Stage 2 commands are consumed by `bin/railiance deploy --stage 2 <overlay-dir>`
and `bin/railiance observe --stage 2 <overlay-dir>`.
- Stage 3 commands are consumed by future `bin/railiance promote <overlay-dir>`
and `bin/railiance rollback <overlay-dir>` commands.
Tooling must emit machine-readable results with workload identity, candidate
revision, checks run, pass/fail status, non-secret evidence, rollback target,

View File

@@ -320,11 +320,11 @@ must not cut over to Stage 3.
Future CLI tasks should make these lifecycle operations repeatable:
```text
bin/railiance run <app> # Stage 1 local validation
bin/railiance deploy --stage 2 <app> # Stage 2 canary deployment
bin/railiance observe <app> # Stage 2/3 evidence collection
bin/railiance promote <app> # Stage 3 production promotion
bin/railiance rollback <app> # rollback to previous stable
bin/railiance run <overlay-dir> # Stage 1 local validation
bin/railiance deploy --stage 2 <overlay-dir> --plan # Stage 2 canary plan
bin/railiance observe --stage 2 <overlay-dir> --plan # Stage 2 evidence targets
bin/railiance promote <overlay-dir> # Stage 3 production promotion
bin/railiance rollback <overlay-dir> # rollback to previous stable
```
The exact command names may change as implementation lands, but the behavior

View File

@@ -0,0 +1,49 @@
# Stage 2 Deploy And Observe
`bin/railiance deploy --stage 2` and `bin/railiance observe --stage 2` provide
the repeatable command path for production canaries declared in
`railiance/app.toml`.
Both commands default to non-mutating plan mode.
## Deploy
```bash
bin/railiance deploy --stage 2 /path/to/overlay --pretty
bin/railiance deploy --stage 2 /path/to/overlay --server-dry-run --pretty
bin/railiance deploy --stage 2 /path/to/overlay --apply --approval-id <state-hub-id>
```
Plan mode validates the local Stage 2 chart and values paths and emits a
`railiance.stage2-deploy-result.v1` JSON plan. It does not contact the cluster.
`--server-dry-run` runs `helm upgrade --install --dry-run=server` when Helm and
cluster access are available. `--apply` runs the Helm canary apply path with
`--atomic --wait`. If Stage 2 declares `requires_approval = true`, apply mode
fails closed unless `--approval-id` is provided.
The result records release identity, namespace, chart path, values path,
expected checks/evidence, precheck status, and command byte counts. It does not
embed Helm or kubectl logs.
## Observe
```bash
bin/railiance observe --stage 2 /path/to/overlay --pretty
bin/railiance observe --stage 2 /path/to/overlay --live --pretty
```
Plan mode emits the rollout, pod selector, ingress selector, health URL, and
metrics targets that live observation will query.
Live mode uses `kubectl` to check rollout status, deployment JSON, canary pods,
ingress/routing resources, and pod metrics when metrics-server is available.
Metrics unavailability is reported separately so a canary can fail for rollout
or readiness problems without hiding missing observability.
## Safety
Stage 2 remains blocked when required local paths are missing, Helm is missing
for dry-run/apply, `kubectl` is missing for live observe, or approval evidence
is missing for an apply that requires approval. Use the emitted JSON as
non-secret evidence in State Hub progress notes.

View File

@@ -59,6 +59,11 @@ This model emphasizes:
---
### `railiance-stage2`
- Backs `bin/railiance deploy --stage 2` and `bin/railiance observe --stage 2`.
- Emits non-secret JSON plans/results for canary deployment and observation.
### `railiance-run`
- Executes Stage 1 local validation from `railiance/app.toml`.
- Emits a `railiance.run-result.v1` JSON result without command logs or secrets.

439
tools/cmd/railiance-stage2 Executable file
View File

@@ -0,0 +1,439 @@
#!/usr/bin/env python3
"""Railiance Stage 2 deploy and observe tooling."""
from __future__ import annotations
import argparse
import json
import shutil
import subprocess
import sys
import time
import tomllib
import urllib.parse
import urllib.request
import urllib.error
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
SUPPORTED_SCHEMA = "railiance.app.v1"
def utc_now() -> str:
return datetime.now(UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def scrub_url(url: str) -> str:
try:
parts = urllib.parse.urlsplit(url)
except ValueError:
return "<invalid-url>"
netloc = parts.netloc.rsplit("@", 1)[-1]
return urllib.parse.urlunsplit((parts.scheme, netloc, parts.path, "", ""))
def load_contract(app_dir: Path) -> tuple[Path, dict[str, Any]]:
contract_path = app_dir / "railiance" / "app.toml"
if not contract_path.exists():
raise SystemExit(f"Missing Railiance contract: {contract_path}")
with contract_path.open("rb") as handle:
data = tomllib.load(handle)
if data.get("schema_version") != SUPPORTED_SCHEMA:
raise SystemExit(
f"Unsupported schema_version {data.get('schema_version')!r}; expected {SUPPORTED_SCHEMA}"
)
return contract_path, data
def check_required(check: dict[str, Any]) -> bool:
return bool(check.get("required", True))
def checks_by_id(data: dict[str, Any]) -> dict[str, dict[str, Any]]:
return {check.get("id"): check for check in data.get("checks", [])}
def stage2_checks(data: dict[str, Any]) -> list[dict[str, Any]]:
stage = data.get("stages", {}).get("stage2", {})
lookup = checks_by_id(data)
return [lookup[item] for item in stage.get("checks", []) if item in lookup]
def helm_check(data: dict[str, Any]) -> dict[str, Any] | None:
for check in stage2_checks(data):
if check.get("type") == "helm":
return check
return None
def kubernetes_check(data: dict[str, Any]) -> dict[str, Any] | None:
for check in stage2_checks(data):
if check.get("type") == "kubernetes":
return check
return None
def http_checks(data: dict[str, Any]) -> list[dict[str, Any]]:
return [check for check in stage2_checks(data) if check.get("type") == "http"]
def precheck(name: str, status: str, required: bool, detail: str | None = None) -> dict[str, Any]:
item: dict[str, Any] = {"name": name, "status": status, "required": required}
if detail:
item["detail"] = detail
return item
def required_failures(items: list[dict[str, Any]]) -> list[dict[str, Any]]:
return [item for item in items if item.get("required", True) and item.get("status") != "passed"]
def run_command(args: list[str], cwd: Path, timeout: int, command_ref: str) -> dict[str, Any]:
started = time.monotonic()
try:
completed = subprocess.run(
args,
cwd=cwd,
text=True,
capture_output=True,
timeout=timeout,
check=False,
)
return {
"command_ref": command_ref,
"status": "passed" if completed.returncode == 0 else "failed",
"exit_code": completed.returncode,
"duration_seconds": round(time.monotonic() - started, 3),
"stdout_bytes": len(completed.stdout.encode()),
"stderr_bytes": len(completed.stderr.encode()),
}
except subprocess.TimeoutExpired as exc:
stdout = exc.stdout if isinstance(exc.stdout, str) else ""
stderr = exc.stderr if isinstance(exc.stderr, str) else ""
return {
"command_ref": command_ref,
"status": "failed",
"exit_code": None,
"duration_seconds": round(time.monotonic() - started, 3),
"error": f"timeout after {timeout}s",
"stdout_bytes": len(stdout.encode()),
"stderr_bytes": len(stderr.encode()),
}
def app_identity(data: dict[str, Any]) -> dict[str, Any]:
app = data.get("app", {})
source = data.get("source", {})
return {
"app": {
"id": app.get("id"),
"name": app.get("name"),
"repo": app.get("repo"),
"owner": app.get("owner"),
"criticality": app.get("criticality"),
},
"source": {
"revision": source.get("revision"),
"artifact": source.get("artifact"),
"digest_policy": source.get("digest_policy"),
},
}
def stage2_context(app_dir: Path, contract_path: Path, data: dict[str, Any]) -> dict[str, Any]:
stage = data.get("stages", {}).get("stage2", {})
if not stage.get("enabled", False):
raise SystemExit("Stage 2 is disabled in railiance/app.toml")
helm = helm_check(data) or {}
chart = app_dir / str(helm.get("chart", f"charts/{data.get('app', {}).get('id', 'app')}"))
values = app_dir / str(helm.get("values", "values/stage2-canary.yaml"))
release = str(stage.get("release", f"{data.get('app', {}).get('id', 'app')}-canary"))
namespace = str(stage.get("namespace", data.get("app", {}).get("id", "default")))
context = {
"contract": str(contract_path),
"app_dir": str(app_dir),
"stage": "stage2",
"namespace": namespace,
"release": release,
"canary_mode": stage.get("canary_mode"),
"observation_minutes": stage.get("observation_minutes"),
"requires_approval": bool(stage.get("requires_approval", False)),
"chart": str(chart),
"values": str(values),
"evidence_expected": list(stage.get("evidence", [])),
"checks_expected": list(stage.get("checks", [])),
}
context.update(app_identity(data))
return context
def local_prechecks(app_dir: Path, data: dict[str, Any], mode: str, approval_id: str | None) -> list[dict[str, Any]]:
stage = data.get("stages", {}).get("stage2", {})
helm = helm_check(data)
checks: list[dict[str, Any]] = []
checks.append(precheck("app.toml", "passed", True))
if helm is None:
checks.append(precheck("stage2-helm-check", "failed", True, "no Stage 2 helm check declared"))
else:
chart = app_dir / str(helm.get("chart", ""))
values = app_dir / str(helm.get("values", ""))
checks.append(precheck("stage2-chart", "passed" if chart.exists() else "failed", True, str(chart)))
checks.append(precheck("stage2-values", "passed" if values.exists() else "failed", True, str(values)))
if mode in {"server-dry-run", "apply"}:
checks.append(
precheck("helm", "passed" if shutil.which("helm") else "failed", True, "helm executable")
)
else:
checks.append(precheck("helm", "not_required", False, "plan mode does not execute helm"))
if mode == "apply" and stage.get("requires_approval", False):
checks.append(
precheck(
"approval-id",
"passed" if approval_id else "failed",
True,
"Stage 2 requires approval before canary exposure",
)
)
elif stage.get("requires_approval", False):
checks.append(precheck("approval-id", "required_before_apply", False))
else:
checks.append(precheck("approval-id", "not_required", False))
return checks
def helm_args(context: dict[str, Any], mode: str, timeout: int) -> list[str]:
args = [
"helm",
"upgrade",
"--install",
context["release"],
context["chart"],
"--namespace",
context["namespace"],
"--create-namespace",
"-f",
context["values"],
]
if mode == "server-dry-run":
args.extend(["--dry-run=server", "--debug"])
if mode == "apply":
args.extend(["--atomic", "--wait", "--timeout", f"{timeout}m"])
return args
def deploy(argv: list[str]) -> int:
parser = argparse.ArgumentParser(description="Plan or apply a Stage 2 Railiance canary.")
parser.add_argument("app_dir", nargs="?", default=".")
parser.add_argument("--stage", default="2", choices=["2", "stage2"])
parser.add_argument("--mode", choices=["plan", "server-dry-run", "apply"], default="plan")
parser.add_argument("--plan", action="store_const", const="plan", dest="mode")
parser.add_argument("--apply", action="store_const", const="apply", dest="mode")
parser.add_argument("--server-dry-run", action="store_const", const="server-dry-run", dest="mode")
parser.add_argument("--approval-id", help="Operator approval/progress id required before apply when declared.")
parser.add_argument("--stage1-result", help="Optional Stage 1 result JSON for same-candidate evidence.")
parser.add_argument("--timeout-minutes", type=int, default=10)
parser.add_argument("--json-out")
parser.add_argument("--pretty", action="store_true")
args = parser.parse_args(argv)
app_dir = Path(args.app_dir).resolve()
contract_path, data = load_contract(app_dir)
context = stage2_context(app_dir, contract_path, data)
checks = local_prechecks(app_dir, data, args.mode, args.approval_id)
if args.stage1_result:
try:
stage1 = json.loads(Path(args.stage1_result).read_text(encoding="utf-8"))
checks.append(
precheck(
"stage1-result",
"passed" if stage1.get("status") == "passed" else "failed",
args.mode == "apply",
Path(args.stage1_result).name,
)
)
except (OSError, json.JSONDecodeError) as exc:
checks.append(precheck("stage1-result", "failed", args.mode == "apply", str(exc)))
else:
checks.append(precheck("stage1-result", "recommended_before_apply", False))
actions: list[dict[str, Any]] = []
failures = required_failures(checks)
status = "planned" if args.mode == "plan" else "blocked"
if not failures and args.mode in {"server-dry-run", "apply"}:
action = run_command(helm_args(context, args.mode, args.timeout_minutes), app_dir, args.timeout_minutes * 60, "stage2.helm")
actions.append(action)
status = "passed" if action.get("status") == "passed" and args.mode == "server-dry-run" else "applied"
if action.get("status") != "passed":
status = "failed"
elif failures:
status = "blocked"
result: dict[str, Any] = {
"schema_version": "railiance.stage2-deploy-result.v1",
"status": status,
"mode": args.mode,
"generated_at": utc_now(),
**context,
"approval_id": args.approval_id,
"prechecks": checks,
"actions": actions,
"planned_actions": [
{
"action_ref": "stage2.helm",
"tool": "helm",
"mode": args.mode,
"release": context["release"],
"namespace": context["namespace"],
"chart": context["chart"],
"values": context["values"],
}
],
"summary": {
"required_prechecks_failed": len(failures),
"actions_total": len(actions),
"actions_failed": len([item for item in actions if item.get("status") != "passed"]),
},
}
rendered = json.dumps(result, indent=2 if args.pretty else None, sort_keys=True)
print(rendered)
if args.json_out:
output = Path(args.json_out)
output.parent.mkdir(parents=True, exist_ok=True)
output.write_text(rendered + "\n", encoding="utf-8")
return 0 if result["status"] in {"planned", "passed", "applied"} else 1
def observation_targets(data: dict[str, Any], context: dict[str, Any]) -> dict[str, Any]:
kube = kubernetes_check(data) or {}
return {
"rollout": kube.get("resource", f"deploy/{context['release']}"),
"pod_selector": f"app.kubernetes.io/instance={context['release']}",
"ingress_selector": f"app.kubernetes.io/instance={context['release']}",
"health_urls": [scrub_url(str(check.get("url", ""))) for check in http_checks(data)],
"metrics": {
"tool": "kubectl top pods",
"selector": f"app.kubernetes.io/instance={context['release']}",
},
}
def observe(argv: list[str]) -> int:
parser = argparse.ArgumentParser(description="Plan or run Stage 2 Railiance observation checks.")
parser.add_argument("app_dir", nargs="?", default=".")
parser.add_argument("--stage", default="2", choices=["2", "stage2"])
parser.add_argument("--mode", choices=["plan", "live"], default="plan")
parser.add_argument("--plan", action="store_const", const="plan", dest="mode")
parser.add_argument("--live", action="store_const", const="live", dest="mode")
parser.add_argument("--timeout-seconds", type=int, default=120)
parser.add_argument("--json-out")
parser.add_argument("--pretty", action="store_true")
args = parser.parse_args(argv)
app_dir = Path(args.app_dir).resolve()
contract_path, data = load_contract(app_dir)
context = stage2_context(app_dir, contract_path, data)
targets = observation_targets(data, context)
checks = [precheck("app.toml", "passed", True)]
if args.mode == "live":
checks.append(
precheck("kubectl", "passed" if shutil.which("kubectl") else "failed", True, "kubectl executable")
)
else:
checks.append(precheck("kubectl", "not_required", False, "plan mode does not query cluster"))
actions: list[dict[str, Any]] = []
failures = required_failures(checks)
status = "planned"
if args.mode == "live" and not failures:
ns = context["namespace"]
rollout = str(targets["rollout"])
actions.append(
run_command(
["kubectl", "-n", ns, "rollout", "status", rollout, f"--timeout={args.timeout_seconds}s"],
app_dir,
args.timeout_seconds,
"stage2.rollout-status",
)
)
actions.append(
run_command(
["kubectl", "-n", ns, "get", rollout, "-o", "json"],
app_dir,
args.timeout_seconds,
"stage2.rollout-json",
)
)
actions.append(
run_command(
["kubectl", "-n", ns, "get", "pods", "-l", str(targets["pod_selector"]), "-o", "json"],
app_dir,
args.timeout_seconds,
"stage2.pods-json",
)
)
actions.append(
run_command(
["kubectl", "-n", ns, "get", "ingress", "-l", str(targets["ingress_selector"]), "-o", "json"],
app_dir,
args.timeout_seconds,
"stage2.ingress-json",
)
)
metrics = run_command(
["kubectl", "-n", ns, "top", "pods", "-l", str(targets["pod_selector"]), "--no-headers"],
app_dir,
args.timeout_seconds,
"stage2.metrics",
)
if metrics.get("status") != "passed":
metrics["optional"] = True
metrics["status"] = "unavailable"
actions.append(metrics)
status = "passed" if not [item for item in actions if item.get("status") == "failed"] else "failed"
elif failures:
status = "blocked"
result: dict[str, Any] = {
"schema_version": "railiance.stage2-observe-result.v1",
"status": status,
"mode": args.mode,
"generated_at": utc_now(),
**context,
"targets": targets,
"prechecks": checks,
"actions": actions,
"summary": {
"required_prechecks_failed": len(failures),
"actions_total": len(actions),
"actions_failed": len([item for item in actions if item.get("status") == "failed"]),
"metrics_unavailable": len([item for item in actions if item.get("status") == "unavailable"]),
},
}
rendered = json.dumps(result, indent=2 if args.pretty else None, sort_keys=True)
print(rendered)
if args.json_out:
output = Path(args.json_out)
output.parent.mkdir(parents=True, exist_ok=True)
output.write_text(rendered + "\n", encoding="utf-8")
return 0 if result["status"] in {"planned", "passed"} else 1
def main(argv: list[str]) -> int:
parser = argparse.ArgumentParser(description="Railiance Stage 2 tooling.")
subparsers = parser.add_subparsers(dest="command", required=True)
deploy_parser = subparsers.add_parser("deploy", help="Plan or apply a Stage 2 canary.")
deploy_parser.add_argument("args", nargs=argparse.REMAINDER)
observe_parser = subparsers.add_parser("observe", help="Plan or run Stage 2 observation.")
observe_parser.add_argument("args", nargs=argparse.REMAINDER)
parsed = parser.parse_args(argv[:1])
if parsed.command == "deploy":
return deploy(argv[1:])
if parsed.command == "observe":
return observe(argv[1:])
return 2
if __name__ == "__main__":
raise SystemExit(main(sys.argv[1:]))

View File

@@ -186,7 +186,7 @@ requires_approval = false
enabled = true
namespace = "${APP_ID}"
release = "${APP_ID}-canary"
commands = ["bin/railiance deploy --stage 2 ${APP_ID}", "bin/railiance observe ${APP_ID}"]
commands = ["railiance deploy --stage 2 . --plan", "railiance observe --stage 2 . --plan"]
checks = ["server-dry-run", "canary-ready", "cluster-health"]
evidence = ["release name", "pod readiness", "health 200", "State Hub progress id"]
requires_approval = true
@@ -741,13 +741,14 @@ This overlay follows the Railiance three-stage lifecycle.
- Stage 2 deploys an isolated canary by default.
- Stage 3 replaces the stable release only after Stage 2 acceptance.
Run \`tests/stage2-template.sh\` before the first Stage 2 attempt. To use
weighted Traefik routing, change \`railiance.traffic.mode\` to \`weighted\`, set
\`provider: traefik\`, and choose explicit stable/canary weights in
\`values/stage2-canary.yaml\`.
Run \`tests/stage2-template.sh\` before the first Stage 2 attempt, then run
\`railiance deploy --stage 2 . --plan\` and
\`railiance observe --stage 2 . --plan\`. To use weighted Traefik routing,
change \`railiance.traffic.mode\` to \`weighted\`, set \`provider: traefik\`,
and choose explicit stable/canary weights in \`values/stage2-canary.yaml\`.
Before Stage 2, fill in real image repositories, platform dependencies,
observability endpoints, and rollback target details.
Before Stage 2 apply, fill in real image repositories, platform dependencies,
observability endpoints, rollback target details, and approval evidence.
EOF
cat > "${OUT_DIR}/.gitignore" <<'EOF'

View File

@@ -194,7 +194,7 @@ environment.
```task
id: RAIL-BS-WP-0006-T06
status: todo
status: done
priority: medium
state_hub_task_id: "6a5c7422-fcb1-49d1-8153-e891bd1c27fa"
```
@@ -209,6 +209,15 @@ Expected behavior:
**Done when:** Stage 2 can be run and observed from a repeatable command path.
2026-06-27: Added `tools/cmd/railiance-stage2` and dispatcher entries for
`bin/railiance deploy` and `bin/railiance observe`. Deploy emits a
`railiance.stage2-deploy-result.v1` plan by default, can run Helm server dry-run
or apply when tools and cluster access are present, and fails closed when
required paths, Helm, or approval evidence are missing. Observe emits a
`railiance.stage2-observe-result.v1` target plan by default and runs live
kubectl rollout, pod, ingress, and metrics checks only with `--live`. Updated
generated overlays to declare the repeatable Stage 2 plan commands.
---
### T07 - railiance promote, rollback, and onboarding guide