coulomb/railiance-cluster

Fork 0

Files

tegwick 9a463e0749

railiance-tests / smoke (push) Has been cancelled

Details

Add Railiance Stage 2 deploy observe tooling

2026-06-27 16:51:02 +02:00

8.1 KiB

Raw Blame History

Railiance app.toml Contract

This document defines the repository-local railiance/app.toml contract used by Railiance staged promotion tooling. The file tells Railiance how a workload moves through Stage 1 local validation, Stage 2 production canary, and Stage 3 production promotion without relying on bespoke operator notes.

The contract is intentionally declarative. Commands, health checks, platform dependencies, and secret references are described by stable names. Plaintext secrets, bearer tokens, kubeconfigs, and private key material must never appear in railiance/app.toml.

The machine-readable schema lives at schemas/railiance-app.schema.json. A minimal example lives at examples/railiance/app.toml.

File Location

Participating workload repositories declare the contract at:

railiance/app.toml

Overlay repositories for third-party applications use the same path in the overlay repo, not in the upstream source repository.

Versioning

Every file must include:

schema_version = "railiance.app.v1"

Breaking contract changes require a new schema version. Tooling must fail closed when it sees an unsupported schema_version.

Top-Level Sections

app

Identifies the workload and its ownership boundary.

Required fields:

id: stable lowercase id using letters, numbers, and hyphens.
name: human-readable workload name.
repo: owning source or overlay repository slug.
owner: owning team, domain, or operator group.
criticality: one of low, medium, high, or critical.
description: short purpose statement.

Production-critical workloads include source forge, identity, State Hub, Inter-Hub, databases, object stores, backup systems, ingress, and cluster-wide policy controllers. For those workloads, criticality = "critical" requires explicit human approval before Stage 2 traffic exposure and Stage 3 promotion.

source

Identifies the candidate under promotion.

Required fields:

revision: commit id, tag, or immutable source revision expression.
artifact: artifact kind, normally image, helm-chart, or bundle.
digest_policy: one of required, preferred, or not-applicable.

If an image is promoted, Stage 2 and Stage 3 tooling should prefer immutable image digests over mutable tags.

platform.dependencies

Declares platform services required before canary or production promotion.

Each dependency has:

name: stable service name.
kind: dependency kind such as postgres, redis, object-store, identity, state-hub, inter-hub, network, or other.
required: boolean.
stage: earliest stage that needs it, one of stage1, stage2, stage3.
evidence: non-secret evidence expected before promotion, such as a health endpoint result, Kubernetes Ready condition, or State Hub progress id.

secrets.references

Declares required secret references without secret values.

Each reference has:

name: workload-local secret name.
route: approved credential route id, for example openbao-api-key, key-cape-oidc-login, or activity-core-issue-sink.
target: non-secret target reference such as a Kubernetes Secret name, ExternalSecret name, OpenBao path, or environment variable name.
stage: earliest stage that needs the secret.
required: boolean.

Forbidden fields include plaintext values, tokens, passwords, kubeconfigs, or private keys. Tooling must reject suspicious field names such as value, token, password, secret, private_key, or kubeconfig inside secret reference objects unless they are part of the approved non-secret target text.

observability

Defines how promotion tooling proves the workload is alive and observable.

Required fields:

health_endpoints: one or more HTTP health endpoint declarations.
metrics: optional metrics endpoint or query references.
logs: optional log selectors or query references.

Health endpoint declarations include name, url, stage, and expected status code. URLs may be internal service URLs for Stage 2/3; they must not embed credentials.

rollback

Defines how the workload returns to a previous stable state.

Required fields:

strategy: one of helm-revision, image-digest, traffic-shift, manual-runbook, or none.
command: command name or runbook path. This may be a placeholder before T07 implements automation, but it must tell the operator where rollback lives.
verification: non-secret check to confirm rollback succeeded.

strategy = "none" is allowed only for Stage 1-only workloads and must not be used for production-critical workloads.

Stage Sections

The contract has one table for each stage:

[stages.stage1]
[stages.stage2]
[stages.stage3]

Each stage includes:

enabled: boolean.
namespace: target Kubernetes namespace, or a local namespace for Stage 1.
release: release identity.
commands: ordered command aliases or shell commands that tooling may run.
checks: ordered check ids to evaluate.
evidence: expected non-secret evidence outputs.
requires_approval: boolean.

Stage 2 additionally includes canary_mode, one of weighted, header, path, shadow, or isolated, plus observation_minutes and optional traffic_percent when weighted routing is used.

Stage 3 additionally includes promotion_mode, one of traffic-shift, release-replace, selector-switch, or workflow, plus previous_stable.

Check Definitions

Checks live under [[checks]] entries and are referenced by stage checks.

Required fields:

id: stable check id.
type: one of command, http, kubernetes, helm, metric, log, or manual.
stage: earliest stage that may run the check.
description: human-readable purpose.
required: boolean.

Type-specific fields:

command: run command string and optional timeout_seconds.
http: url, expected_status, and optional timeout_seconds.
kubernetes: namespace, resource, and condition.
helm: chart, values, and mode such as template or server-dry-run.
metric: query, window_minutes, and threshold.
log: selector, window_minutes, and forbidden_patterns.
manual: evidence_required text.

Checks must not print secrets. If a check needs secret-backed access, the result records only the route, target object, and pass/fail state.

Command Semantics

Commands in app.toml are declarations for Railiance tooling. Stage 1 and Stage 2 commands now have local CLI support; Stage 3 commands may still point to existing scripts or runbook commands until T07 lands.

Expected mapping:

Stage 1 commands are consumed by bin/railiance run <overlay-dir>.
Stage 2 commands are consumed by bin/railiance deploy --stage 2 <overlay-dir> and bin/railiance observe --stage 2 <overlay-dir>.
Stage 3 commands are consumed by future bin/railiance promote <overlay-dir> and bin/railiance rollback <overlay-dir> commands.

Tooling must emit machine-readable results with workload identity, candidate revision, checks run, pass/fail status, non-secret evidence, rollback target, and approval state.

Minimal Example

See examples/railiance/app.toml. It declares a critical internal service with:

immutable image digest requirement;
Stage 1 local validation;
Stage 2 isolated canary;
Stage 3 release replacement;
OpenBao-routed secret references without values;
HTTP, Helm, Kubernetes, and manual approval checks.

Adoption Rules

A workload can enter Stage 1 when app.toml passes schema validation and all Stage 1 required checks are declared.

A workload can enter Stage 2 only when:

Stage 1 passed for the same candidate artifact;
Stage 2 namespace, release, canary mode, health checks, dependencies, and rollback target are declared;
secret references use approved routes and contain no values;
production-critical workloads have explicit approval.

A workload can enter Stage 3 only when:

Stage 2 acceptance gates passed for the same candidate artifact;
previous_stable and rollback verification are recorded;
backup/restore posture is current for stateful workloads;
production-critical workloads have explicit human approval.

8.1 KiB Raw Blame History

Railiance app.toml Contract

File Location

Versioning

Top-Level Sections

app

source

platform.dependencies

secrets.references

observability

rollback

Stage Sections

Check Definitions

Command Semantics

Minimal Example

Adoption Rules

8.1 KiB

Raw Blame History