Files

tegwick 45e57b0a11 Add staged deployment lifecycle spec

2026-06-16 08:01:25 +02:00

12 KiB

Raw Blame History

Railiance Deployment Lifecycle

This document defines the Railiance three-stage promotion lifecycle for workloads that run on the Railiance Kubernetes substrate.

The lifecycle exists so production workloads move through repeatable gates instead of one-off operator memory. It is intentionally conservative: every stage must leave evidence, every promotion must have a rollback path, and critical workloads require explicit human approval before production traffic is changed.

Scope

This specification is owned by railiance-cluster because it defines the cluster runtime contract for promotion gates, canary validation, production routing, and rollback expectations.

Repo boundaries:

railiance-cluster owns the lifecycle semantics, cluster prerequisites, ingress/routing expectations, and acceptance gates.
railiance-apps owns workload-specific Helm values, application release definitions, and production workload configuration.
railiance-platform owns shared platform services such as databases, caches, object storage, and backup targets.
railiance-enablement owns developer-facing templates, CI workflows, and local ergonomics.
railiance-infra owns host provisioning, OS hardening, SSH, firewall, and node bootstrap below Kubernetes.

Lifecycle Overview

Railiance promotes a workload through three stages:

Stage 1: local validation.
Stage 2: production canary.
Stage 3: production promotion.

The stages are sequential. A workload may return to an earlier stage at any time, but it must not skip a stage when moving toward production unless an operator records an emergency exception in State Hub.

Each stage emits a machine-readable result with:

workload identity;
source revision or image digest;
target environment;
checks run;
pass/fail status;
non-secret evidence references;
rollback target, when applicable;
approving human or explicit "not required" decision.

Workload Declaration

Each participating workload should declare its promotion contract in a repository-local railiance/app.toml. The full schema is defined by the next workplan task, but this lifecycle expects every workload declaration to provide at least:

stable workload name and owning repo;
source revision, image tag, or image digest policy;
stage-specific namespaces or release names;
health checks and observability endpoints;
ingress or routing targets;
platform dependencies;
rollback command or previous-stable reference;
secret references by name or path, never plaintext secret values.

If a workload cannot provide a machine-readable declaration yet, it may still use this lifecycle through a written operator runbook, but that is a temporary compatibility path. The runbook must identify the missing declaration fields.

Stage 1: Local Validation

Stage 1 proves that the workload can be built, configured, and checked outside production traffic.

Typical Stage 1 targets:

local container runtime;
local Kubernetes such as k3d, kind, or a disposable namespace;
dry-run Helm rendering;
unit, integration, migration, and smoke checks that do not require production credentials.

Required Stage 1 checks:

source revision is cleanly identified;
build or artifact selection is deterministic;
Helm templates or manifests render without invalid Kubernetes objects;
local health checks pass;
required secrets are referenced by name only and are not printed;
database migrations, if any, are classified as reversible, forward-only, or requiring human approval;
a Stage 2 candidate artifact is named by immutable digest or equivalent immutable revision.

Stage 1 fails closed when:

local checks are skipped without an approved reason;
generated manifests contain plaintext secrets;
the artifact cannot be traced to source;
the workload cannot state how it will be observed in Stage 2.

Stage 1 completion does not authorize production traffic. It only makes a workload eligible for Stage 2 review.

Stage 2: Production Canary

Stage 2 deploys the candidate to production infrastructure with limited or isolated exposure. The goal is to observe the candidate against real platform dependencies while keeping blast radius small.

Acceptable canary forms:

weighted ingress split between stable and canary;
header-based or path-based routing for operator traffic only;
shadow deployment receiving replicated non-mutating traffic;
isolated production namespace with manually triggered probes.

The selected canary form must be declared before deployment. If weighted routing is unavailable, the fallback must preserve the same safety property: the candidate can be observed without silently replacing stable production.

Required Stage 2 prechecks:

Stage 1 result passed for the same candidate artifact;
cluster connectivity and namespace readiness are verified;
target image digest or immutable tag exists in the registry;
Helm server-side dry-run succeeds;
ingress, certificate, and DNS prerequisites are present where applicable;
platform dependencies are healthy or explicitly degraded with operator approval;
rollback target is known before the canary is applied;
monitoring and log queries are available for the canary release.

Required Stage 2 evidence:

rendered release identity;
applied namespace and release name;
pod readiness and restart status;
ingress or routing state;
key health endpoint result;
relevant metrics window or explicit "metrics unavailable" note;
State Hub progress note with non-secret evidence;
operator approval when the workload is production-critical.

Canary acceptance gates:

canary pods remain ready for the configured observation window;
no crash loops, repeated restarts, or pending pods remain unexplained;
health checks pass from inside and outside the cluster when both are applicable;
error rate, latency, and saturation do not regress beyond the workload's declared threshold;
no unexpected schema, storage, or queue side effects are observed;
logs show no secret leakage and no repeated authorization failures;
rollback has been tested previously or is a single documented command with a known previous-stable target.

Default observation windows:

non-critical internal service: 15 minutes;
user-facing or shared platform service: 30 minutes;
production-critical infrastructure such as Forgejo, identity, registry, or State Hub: operator-defined window, minimum 60 minutes unless explicitly waived.

Stage 2 fails closed when:

the canary cannot be distinguished from stable production;
production routing changes more traffic than intended;
any required evidence is missing and no operator waiver is recorded;
rollback target is unknown;
the candidate needs a secret, credential, or platform dependency that was not declared before the canary.

Stage 3: Production Promotion

Stage 3 promotes the accepted candidate to the stable production path.

Promotion may mean:

shifting weighted traffic to the canary release;
replacing the stable Helm release with the accepted candidate;
changing an ingress selector or service target;
activating an operator-approved rollout workflow.

Required Stage 3 prechecks:

Stage 2 acceptance gates passed for the same candidate artifact;
the previous stable version is recorded;
backup and restore posture is current for stateful workloads;
migrations are approved and sequenced;
production-critical workloads have explicit human approval;
a rollback command and rollback verification check are available.

Required Stage 3 evidence:

promotion command or workflow id;
previous stable version;
new stable version;
production routing state after promotion;
smoke result after promotion;
rollback target retained;
State Hub progress note with non-secret evidence.

Stage 3 is complete only after the post-promotion smoke passes and the workload's stable routing points at the promoted candidate.

Rollback Expectations

Rollback is part of every promotion, not an afterthought.

Every Stage 2 and Stage 3 action must identify one of:

previous stable Helm release revision;
previous image digest and values file;
previous ingress/routing configuration;
documented manual recovery path when automation is not yet safe.

Rollback must be immediate when:

production availability is degraded;
canary traffic escapes the declared blast radius;
the workload emits repeated authorization or secret-handling errors;
data integrity is at risk;
an operator revokes approval during the observation window.

Rollback may be deferred only when the rollback itself is more dangerous than the incident state. That decision requires a State Hub note and human approval.

After rollback, record:

triggering symptom;
rollback action;
final stable version;
remaining cleanup;
whether the failed candidate is blocked, abandoned, or returned to Stage 1.

Human Approval Gates

Human approval is required before production traffic changes for production-critical workloads.

Production-critical workloads include:

source forge and package registry workloads such as Forgejo or Gitea;
identity, MFA, SSO, or authorization systems;
State Hub, Inter-Hub, and operator coordination services;
databases, object stores, and backup systems;
ingress, certificate, or cluster-wide policy controllers;
any workload whose failure blocks multiple repos or domains.

Approval must be recorded as a non-secret State Hub note or task comment. The approval record should name:

approving operator;
candidate artifact;
stage being approved;
observation window;
rollback target;
any waived gates and why.

Emergency approval can be retrospective only when delaying the action would increase production risk. Retrospective approval must be recorded immediately after stabilization.

Evidence And Secret Handling

Lifecycle evidence must be useful without being sensitive.

Allowed evidence:

commit ids;
image tags and digests;
workflow ids;
Kubernetes object names;
pod status summaries;
HTTP status codes;
timestamps;
State Hub progress ids;
pass/fail summaries.

Forbidden evidence:

plaintext secrets;
bearer tokens;
static API keys;
kubeconfigs;
private key material;
full environment dumps;
logs that contain credentials or user private data.

When a check needs secret-backed access, record only the access path and result, for example: "OpenBao path configured, token exchange returned 200".

Forgejo Readiness Interpretation

This lifecycle is clear enough for Forgejo when a future Forgejo workplan can answer these questions before production cutover:

What source revision and image digest are being promoted?
What local checks prove the candidate is viable?
How is the production canary isolated or traffic-limited?
Which health, registry, SSH, web, Actions, and email recovery checks define acceptance?
Who approves the Stage 3 traffic switch?
What is the previous stable target?
How is repository data protected before and after promotion?
How will rollback be verified without losing package or repository state?

If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and must not cut over to Stage 3.

Minimum Command Contract

Future CLI tasks should make these lifecycle operations repeatable:

bin/railiance run <app>             # Stage 1 local validation
bin/railiance deploy --stage 2 <app> # Stage 2 canary deployment
bin/railiance observe <app>          # Stage 2/3 evidence collection
bin/railiance promote <app>          # Stage 3 production promotion
bin/railiance rollback <app>         # rollback to previous stable

The exact command names may change as implementation lands, but the behavior must preserve the stage gates and evidence requirements in this document.

12 KiB Raw Blame History