diff --git a/docs/README.md b/docs/README.md index 415bf43..bb6a067 100644 --- a/docs/README.md +++ b/docs/README.md @@ -74,6 +74,7 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild ## Operations - [Deployment lifecycle](deployment-lifecycle.md) +- [Railiance app.toml contract](app-toml-contract.md) ## 👥 Contributing diff --git a/docs/app-toml-contract.md b/docs/app-toml-contract.md new file mode 100644 index 0000000..82d72e4 --- /dev/null +++ b/docs/app-toml-contract.md @@ -0,0 +1,233 @@ +# Railiance app.toml Contract + +This document defines the repository-local `railiance/app.toml` contract used by +Railiance staged promotion tooling. The file tells Railiance how a workload +moves through Stage 1 local validation, Stage 2 production canary, and Stage 3 +production promotion without relying on bespoke operator notes. + +The contract is intentionally declarative. Commands, health checks, platform +dependencies, and secret references are described by stable names. Plaintext +secrets, bearer tokens, kubeconfigs, and private key material must never appear +in `railiance/app.toml`. + +The machine-readable schema lives at `schemas/railiance-app.schema.json`. A +minimal example lives at `examples/railiance/app.toml`. + +## File Location + +Participating workload repositories declare the contract at: + +```text +railiance/app.toml +``` + +Overlay repositories for third-party applications use the same path in the +overlay repo, not in the upstream source repository. + +## Versioning + +Every file must include: + +```toml +schema_version = "railiance.app.v1" +``` + +Breaking contract changes require a new schema version. Tooling must fail closed +when it sees an unsupported `schema_version`. + +## Top-Level Sections + +### app + +Identifies the workload and its ownership boundary. + +Required fields: + +- `id`: stable lowercase id using letters, numbers, and hyphens. +- `name`: human-readable workload name. +- `repo`: owning source or overlay repository slug. +- `owner`: owning team, domain, or operator group. +- `criticality`: one of `low`, `medium`, `high`, or `critical`. +- `description`: short purpose statement. + +Production-critical workloads include source forge, identity, State Hub, +Inter-Hub, databases, object stores, backup systems, ingress, and cluster-wide +policy controllers. For those workloads, `criticality = "critical"` requires +explicit human approval before Stage 2 traffic exposure and Stage 3 promotion. + +### source + +Identifies the candidate under promotion. + +Required fields: + +- `revision`: commit id, tag, or immutable source revision expression. +- `artifact`: artifact kind, normally `image`, `helm-chart`, or `bundle`. +- `digest_policy`: one of `required`, `preferred`, or `not-applicable`. + +If an image is promoted, Stage 2 and Stage 3 tooling should prefer immutable +image digests over mutable tags. + +### platform.dependencies + +Declares platform services required before canary or production promotion. + +Each dependency has: + +- `name`: stable service name. +- `kind`: dependency kind such as `postgres`, `redis`, `object-store`, + `identity`, `state-hub`, `inter-hub`, `network`, or `other`. +- `required`: boolean. +- `stage`: earliest stage that needs it, one of `stage1`, `stage2`, `stage3`. +- `evidence`: non-secret evidence expected before promotion, such as a health + endpoint result, Kubernetes Ready condition, or State Hub progress id. + +### secrets.references + +Declares required secret references without secret values. + +Each reference has: + +- `name`: workload-local secret name. +- `route`: approved credential route id, for example `openbao-api-key`, + `key-cape-oidc-login`, or `activity-core-issue-sink`. +- `target`: non-secret target reference such as a Kubernetes Secret name, + ExternalSecret name, OpenBao path, or environment variable name. +- `stage`: earliest stage that needs the secret. +- `required`: boolean. + +Forbidden fields include plaintext values, tokens, passwords, kubeconfigs, or +private keys. Tooling must reject suspicious field names such as `value`, +`token`, `password`, `secret`, `private_key`, or `kubeconfig` inside secret +reference objects unless they are part of the approved non-secret `target` text. + +### observability + +Defines how promotion tooling proves the workload is alive and observable. + +Required fields: + +- `health_endpoints`: one or more HTTP health endpoint declarations. +- `metrics`: optional metrics endpoint or query references. +- `logs`: optional log selectors or query references. + +Health endpoint declarations include `name`, `url`, `stage`, and expected +status code. URLs may be internal service URLs for Stage 2/3; they must not +embed credentials. + +### rollback + +Defines how the workload returns to a previous stable state. + +Required fields: + +- `strategy`: one of `helm-revision`, `image-digest`, `traffic-shift`, + `manual-runbook`, or `none`. +- `command`: command name or runbook path. This may be a placeholder before + T07 implements automation, but it must tell the operator where rollback lives. +- `verification`: non-secret check to confirm rollback succeeded. + +`strategy = "none"` is allowed only for Stage 1-only workloads and must not be +used for production-critical workloads. + +## Stage Sections + +The contract has one table for each stage: + +```toml +[stages.stage1] +[stages.stage2] +[stages.stage3] +``` + +Each stage includes: + +- `enabled`: boolean. +- `namespace`: target Kubernetes namespace, or a local namespace for Stage 1. +- `release`: release identity. +- `commands`: ordered command aliases or shell commands that tooling may run. +- `checks`: ordered check ids to evaluate. +- `evidence`: expected non-secret evidence outputs. +- `requires_approval`: boolean. + +Stage 2 additionally includes `canary_mode`, one of `weighted`, `header`, +`path`, `shadow`, or `isolated`, plus `observation_minutes` and optional +`traffic_percent` when weighted routing is used. + +Stage 3 additionally includes `promotion_mode`, one of `traffic-shift`, +`release-replace`, `selector-switch`, or `workflow`, plus `previous_stable`. + +## Check Definitions + +Checks live under `[[checks]]` entries and are referenced by stage `checks`. + +Required fields: + +- `id`: stable check id. +- `type`: one of `command`, `http`, `kubernetes`, `helm`, `metric`, `log`, or + `manual`. +- `stage`: earliest stage that may run the check. +- `description`: human-readable purpose. +- `required`: boolean. + +Type-specific fields: + +- `command`: `run` command string and optional `timeout_seconds`. +- `http`: `url`, `expected_status`, and optional `timeout_seconds`. +- `kubernetes`: `namespace`, `resource`, and `condition`. +- `helm`: `chart`, `values`, and `mode` such as `template` or + `server-dry-run`. +- `metric`: `query`, `window_minutes`, and `threshold`. +- `log`: `selector`, `window_minutes`, and `forbidden_patterns`. +- `manual`: `evidence_required` text. + +Checks must not print secrets. If a check needs secret-backed access, the result +records only the route, target object, and pass/fail state. + +## Command Semantics + +Commands in `app.toml` are declarations for future tooling. Until T04-T07 +implement the CLI, they may point to existing scripts or runbook commands. + +Expected mapping: + +- Stage 1 commands are consumed by `bin/railiance run `. +- Stage 2 commands are consumed by `bin/railiance deploy --stage 2 ` and + `bin/railiance observe `. +- Stage 3 commands are consumed by `bin/railiance promote ` and + `bin/railiance rollback `. + +Tooling must emit machine-readable results with workload identity, candidate +revision, checks run, pass/fail status, non-secret evidence, rollback target, +and approval state. + +## Minimal Example + +See `examples/railiance/app.toml`. It declares a critical internal service with: + +- immutable image digest requirement; +- Stage 1 local validation; +- Stage 2 isolated canary; +- Stage 3 release replacement; +- OpenBao-routed secret references without values; +- HTTP, Helm, Kubernetes, and manual approval checks. + +## Adoption Rules + +A workload can enter Stage 1 when `app.toml` passes schema validation and all +Stage 1 required checks are declared. + +A workload can enter Stage 2 only when: + +- Stage 1 passed for the same candidate artifact; +- Stage 2 namespace, release, canary mode, health checks, dependencies, and + rollback target are declared; +- secret references use approved routes and contain no values; +- production-critical workloads have explicit approval. + +A workload can enter Stage 3 only when: + +- Stage 2 acceptance gates passed for the same candidate artifact; +- `previous_stable` and rollback verification are recorded; +- backup/restore posture is current for stateful workloads; +- production-critical workloads have explicit human approval. diff --git a/docs/deployment-lifecycle.md b/docs/deployment-lifecycle.md index c61a404..3e38d09 100644 --- a/docs/deployment-lifecycle.md +++ b/docs/deployment-lifecycle.md @@ -54,9 +54,10 @@ Each stage emits a machine-readable result with: ## Workload Declaration Each participating workload should declare its promotion contract in a -repository-local `railiance/app.toml`. The full schema is defined by the next -workplan task, but this lifecycle expects every workload declaration to provide -at least: +repository-local `railiance/app.toml`. The contract is defined in +`docs/app-toml-contract.md`, with a machine-readable schema at +`schemas/railiance-app.schema.json`. This lifecycle expects every workload +declaration to provide at least: - stable workload name and owning repo; - source revision, image tag, or image digest policy; diff --git a/examples/railiance/app.toml b/examples/railiance/app.toml new file mode 100644 index 0000000..d97f37b --- /dev/null +++ b/examples/railiance/app.toml @@ -0,0 +1,176 @@ +schema_version = "railiance.app.v1" + +[app] +id = "example-service" +name = "Example Service" +repo = "railiance-apps/example-service" +owner = "platform" +criticality = "critical" +description = "Reference declaration for the Railiance staged promotion lifecycle." + +[source] +revision = "git:main" +artifact = "image" +digest_policy = "required" + +[rollback] +strategy = "helm-revision" +command = "bin/railiance rollback example-service" +verification = "GET /health returns 200 on the restored stable release." + +[[platform.dependencies]] +name = "state-hub" +kind = "state-hub" +required = true +stage = "stage2" +evidence = "State Hub /healthz returns ok from the cluster path." + +[[platform.dependencies]] +name = "postgres" +kind = "postgres" +required = true +stage = "stage2" +evidence = "Target database reports Ready and backup posture is current." + +[[secrets.references]] +name = "runtime-api-key" +route = "openbao-api-key" +target = "ExternalSecret/example-service-runtime" +stage = "stage2" +required = true + +[[observability.health_endpoints]] +name = "local-health" +url = "http://127.0.0.1:8080/health" +stage = "stage1" +expected_status = 200 + +[[observability.health_endpoints]] +name = "cluster-health" +url = "http://example-service.example-service.svc.cluster.local:8080/health" +stage = "stage2" +expected_status = 200 + +[[observability.metrics]] +name = "request-errors" +reference = 'promql:rate(http_requests_total{status=~"5.."}[5m])' +stage = "stage2" + +[[observability.logs]] +name = "secret-leak-scan" +reference = "kubectl logs -n example-service deploy/example-service-canary" +stage = "stage2" + +[stages.stage1] +enabled = true +namespace = "local" +release = "example-service-local" +commands = ["make test", "helm template charts/example-service"] +checks = ["unit-tests", "helm-template", "local-health"] +evidence = ["pytest output", "helm template success", "local health 200"] +requires_approval = false + +[stages.stage2] +enabled = true +namespace = "example-service" +release = "example-service-canary" +commands = ["bin/railiance deploy --stage 2 example-service", "bin/railiance observe example-service"] +checks = ["server-dry-run", "canary-ready", "cluster-health", "operator-approval"] +evidence = ["release name", "pod readiness", "health 200", "State Hub progress id"] +requires_approval = true +canary_mode = "isolated" +observation_minutes = 60 + +[stages.stage3] +enabled = true +namespace = "example-service" +release = "example-service" +commands = ["bin/railiance promote example-service", "bin/railiance observe example-service"] +checks = ["stage2-accepted", "rollback-target", "cluster-health", "operator-approval"] +evidence = ["promotion command id", "new stable digest", "post-promotion smoke"] +requires_approval = true +promotion_mode = "release-replace" +previous_stable = "helm:example-service:previous" + +[[checks]] +id = "unit-tests" +type = "command" +stage = "stage1" +description = "Run repository unit tests." +required = true +run = "make test" +timeout_seconds = 600 + +[[checks]] +id = "helm-template" +type = "helm" +stage = "stage1" +description = "Render Helm templates locally." +required = true +chart = "charts/example-service" +values = "values/local.yaml" +mode = "template" + +[[checks]] +id = "local-health" +type = "http" +stage = "stage1" +description = "Confirm local service health." +required = true +url = "http://127.0.0.1:8080/health" +expected_status = 200 +timeout_seconds = 10 + +[[checks]] +id = "server-dry-run" +type = "helm" +stage = "stage2" +description = "Render and submit a server-side dry run before canary." +required = true +chart = "charts/example-service" +values = "values/canary.yaml" +mode = "server-dry-run" + +[[checks]] +id = "canary-ready" +type = "kubernetes" +stage = "stage2" +description = "Canary deployment reaches Available." +required = true +namespace = "example-service" +resource = "deploy/example-service-canary" +condition = "Available" + +[[checks]] +id = "cluster-health" +type = "http" +stage = "stage2" +description = "Cluster health endpoint returns 200." +required = true +url = "http://example-service.example-service.svc.cluster.local:8080/health" +expected_status = 200 +timeout_seconds = 10 + +[[checks]] +id = "operator-approval" +type = "manual" +stage = "stage2" +description = "Human approval is recorded before production-critical traffic changes." +required = true +evidence_required = "State Hub approval note id, candidate digest, rollback target." + +[[checks]] +id = "stage2-accepted" +type = "manual" +stage = "stage3" +description = "Stage 2 gates passed for the same candidate artifact." +required = true +evidence_required = "State Hub Stage 2 acceptance progress id." + +[[checks]] +id = "rollback-target" +type = "manual" +stage = "stage3" +description = "Previous stable release is recorded before promotion." +required = true +evidence_required = "Previous Helm revision or image digest." diff --git a/schemas/railiance-app.schema.json b/schemas/railiance-app.schema.json new file mode 100644 index 0000000..2ff92a5 --- /dev/null +++ b/schemas/railiance-app.schema.json @@ -0,0 +1,596 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://railiance.local/schemas/railiance-app.schema.json", + "title": "Railiance app.toml contract", + "type": "object", + "additionalProperties": false, + "required": [ + "schema_version", + "app", + "source", + "platform", + "secrets", + "observability", + "rollback", + "stages", + "checks" + ], + "properties": { + "schema_version": { + "const": "railiance.app.v1" + }, + "app": { + "type": "object", + "additionalProperties": false, + "required": [ + "id", + "name", + "repo", + "owner", + "criticality", + "description" + ], + "properties": { + "id": { + "type": "string", + "pattern": "^[a-z0-9][a-z0-9-]*$" + }, + "name": { + "type": "string", + "minLength": 1 + }, + "repo": { + "type": "string", + "minLength": 1 + }, + "owner": { + "type": "string", + "minLength": 1 + }, + "criticality": { + "enum": [ + "low", + "medium", + "high", + "critical" + ] + }, + "description": { + "type": "string", + "minLength": 1 + } + } + }, + "source": { + "type": "object", + "additionalProperties": false, + "required": [ + "revision", + "artifact", + "digest_policy" + ], + "properties": { + "revision": { + "type": "string", + "minLength": 1 + }, + "artifact": { + "enum": [ + "image", + "helm-chart", + "bundle", + "manifest", + "other" + ] + }, + "digest_policy": { + "enum": [ + "required", + "preferred", + "not-applicable" + ] + } + } + }, + "platform": { + "type": "object", + "additionalProperties": false, + "required": [ + "dependencies" + ], + "properties": { + "dependencies": { + "type": "array", + "items": { + "$ref": "#/$defs/dependency" + } + } + } + }, + "secrets": { + "type": "object", + "additionalProperties": false, + "required": [ + "references" + ], + "properties": { + "references": { + "type": "array", + "items": { + "$ref": "#/$defs/secretReference" + } + } + } + }, + "observability": { + "type": "object", + "additionalProperties": false, + "required": [ + "health_endpoints" + ], + "properties": { + "health_endpoints": { + "type": "array", + "minItems": 1, + "items": { + "$ref": "#/$defs/healthEndpoint" + } + }, + "metrics": { + "type": "array", + "default": [], + "items": { + "$ref": "#/$defs/observationReference" + } + }, + "logs": { + "type": "array", + "default": [], + "items": { + "$ref": "#/$defs/observationReference" + } + } + } + }, + "rollback": { + "type": "object", + "additionalProperties": false, + "required": [ + "strategy", + "command", + "verification" + ], + "properties": { + "strategy": { + "enum": [ + "helm-revision", + "image-digest", + "traffic-shift", + "manual-runbook", + "none" + ] + }, + "command": { + "type": "string", + "minLength": 1 + }, + "verification": { + "type": "string", + "minLength": 1 + } + } + }, + "stages": { + "type": "object", + "additionalProperties": false, + "required": [ + "stage1", + "stage2", + "stage3" + ], + "properties": { + "stage1": { + "$ref": "#/$defs/stage1" + }, + "stage2": { + "$ref": "#/$defs/stage2" + }, + "stage3": { + "$ref": "#/$defs/stage3" + } + } + }, + "checks": { + "type": "array", + "minItems": 1, + "items": { + "$ref": "#/$defs/check" + } + } + }, + "$defs": { + "stageName": { + "enum": [ + "stage1", + "stage2", + "stage3" + ] + }, + "dependency": { + "type": "object", + "additionalProperties": false, + "required": [ + "name", + "kind", + "required", + "stage", + "evidence" + ], + "properties": { + "name": { + "type": "string", + "minLength": 1 + }, + "kind": { + "enum": [ + "postgres", + "redis", + "object-store", + "identity", + "state-hub", + "inter-hub", + "network", + "other" + ] + }, + "required": { + "type": "boolean" + }, + "stage": { + "$ref": "#/$defs/stageName" + }, + "evidence": { + "type": "string", + "minLength": 1 + } + } + }, + "secretReference": { + "type": "object", + "additionalProperties": false, + "required": [ + "name", + "route", + "target", + "stage", + "required" + ], + "properties": { + "name": { + "type": "string", + "minLength": 1 + }, + "route": { + "type": "string", + "minLength": 1 + }, + "target": { + "type": "string", + "minLength": 1 + }, + "stage": { + "$ref": "#/$defs/stageName" + }, + "required": { + "type": "boolean" + } + }, + "not": { + "anyOf": [ + { + "required": [ + "value" + ] + }, + { + "required": [ + "token" + ] + }, + { + "required": [ + "password" + ] + }, + { + "required": [ + "secret" + ] + }, + { + "required": [ + "private_key" + ] + }, + { + "required": [ + "kubeconfig" + ] + } + ] + } + }, + "healthEndpoint": { + "type": "object", + "additionalProperties": false, + "required": [ + "name", + "url", + "stage", + "expected_status" + ], + "properties": { + "name": { + "type": "string", + "minLength": 1 + }, + "url": { + "type": "string", + "minLength": 1 + }, + "stage": { + "$ref": "#/$defs/stageName" + }, + "expected_status": { + "type": "integer", + "minimum": 100, + "maximum": 599 + } + } + }, + "observationReference": { + "type": "object", + "additionalProperties": false, + "required": [ + "name", + "reference", + "stage" + ], + "properties": { + "name": { + "type": "string", + "minLength": 1 + }, + "reference": { + "type": "string", + "minLength": 1 + }, + "stage": { + "$ref": "#/$defs/stageName" + } + } + }, + "check": { + "type": "object", + "additionalProperties": true, + "required": [ + "id", + "type", + "stage", + "description", + "required" + ], + "properties": { + "id": { + "type": "string", + "minLength": 1 + }, + "type": { + "enum": [ + "command", + "http", + "kubernetes", + "helm", + "metric", + "log", + "manual" + ] + }, + "stage": { + "$ref": "#/$defs/stageName" + }, + "description": { + "type": "string", + "minLength": 1 + }, + "required": { + "type": "boolean" + } + } + }, + "stage1": { + "type": "object", + "additionalProperties": false, + "required": [ + "enabled", + "namespace", + "release", + "commands", + "checks", + "evidence", + "requires_approval" + ], + "properties": { + "enabled": { + "type": "boolean" + }, + "namespace": { + "type": "string", + "minLength": 1 + }, + "release": { + "type": "string", + "minLength": 1 + }, + "commands": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "checks": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "evidence": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "requires_approval": { + "type": "boolean" + } + } + }, + "stage2": { + "type": "object", + "additionalProperties": false, + "required": [ + "enabled", + "namespace", + "release", + "commands", + "checks", + "evidence", + "requires_approval", + "canary_mode", + "observation_minutes" + ], + "properties": { + "enabled": { + "type": "boolean" + }, + "namespace": { + "type": "string", + "minLength": 1 + }, + "release": { + "type": "string", + "minLength": 1 + }, + "commands": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "checks": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "evidence": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "requires_approval": { + "type": "boolean" + }, + "canary_mode": { + "enum": [ + "weighted", + "header", + "path", + "shadow", + "isolated" + ] + }, + "observation_minutes": { + "type": "integer", + "minimum": 1 + }, + "traffic_percent": { + "type": "integer", + "minimum": 0, + "maximum": 100 + } + } + }, + "stage3": { + "type": "object", + "additionalProperties": false, + "required": [ + "enabled", + "namespace", + "release", + "commands", + "checks", + "evidence", + "requires_approval", + "promotion_mode", + "previous_stable" + ], + "properties": { + "enabled": { + "type": "boolean" + }, + "namespace": { + "type": "string", + "minLength": 1 + }, + "release": { + "type": "string", + "minLength": 1 + }, + "commands": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "checks": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "evidence": { + "type": "array", + "items": { + "type": "string", + "minLength": 1 + } + }, + "requires_approval": { + "type": "boolean" + }, + "promotion_mode": { + "enum": [ + "traffic-shift", + "release-replace", + "selector-switch", + "workflow" + ] + }, + "previous_stable": { + "type": "string", + "minLength": 1 + } + } + } + } +} diff --git a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md index 3db3826..26e6d8b 100644 --- a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md +++ b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md @@ -10,7 +10,7 @@ topic_slug: railiance repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580" state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603" created: "2026-02-24" -updated: "2026-06-16" +updated: "2026-06-27" --- # Staged Promotion Lifecycle @@ -85,7 +85,7 @@ answered before cutover. ```task id: RAIL-BS-WP-0006-T02 -status: todo +status: done priority: high state_hub_task_id: "523cf928-bb0e-4109-a172-abf029c62885" ``` @@ -105,6 +105,8 @@ Minimum contract: **Done when:** a repo can declare how it moves through the Railiance promotion lifecycle without bespoke instructions. +2026-06-27: Added `docs/app-toml-contract.md`, `schemas/railiance-app.schema.json`, and `examples/railiance/app.toml`. The v1 contract covers app identity, ownership, source/artifact policy, platform dependencies, secret references without plaintext values, health and observability endpoints, stage commands/checks/evidence, canary and promotion modes, rollback strategy, and human approval gates. + --- ### T03 - Overlay repo pattern and creation script