# Railiance Deployment Lifecycle

This document defines the Railiance three-stage promotion lifecycle for
workloads that run on the Railiance Kubernetes substrate.

The lifecycle exists so production workloads move through repeatable gates
instead of one-off operator memory. It is intentionally conservative: every
stage must leave evidence, every promotion must have a rollback path, and
critical workloads require explicit human approval before production traffic is
changed.

## Scope

This specification is owned by `railiance-cluster` because it defines the
cluster runtime contract for promotion gates, canary validation, production
routing, and rollback expectations.

Repo boundaries:

- `railiance-cluster` owns the lifecycle semantics, cluster prerequisites,
  ingress/routing expectations, and acceptance gates.
- `railiance-apps` owns workload-specific Helm values, application release
  definitions, and production workload configuration.
- `railiance-platform` owns shared platform services such as databases,
  caches, object storage, and backup targets.
- `railiance-enablement` owns developer-facing templates, CI workflows, and
  local ergonomics.
- `railiance-infra` owns host provisioning, OS hardening, SSH, firewall, and
  node bootstrap below Kubernetes.

## Lifecycle Overview

Railiance promotes a workload through three stages:

1. Stage 1: local validation.
2. Stage 2: production canary.
3. Stage 3: production promotion.

The stages are sequential. A workload may return to an earlier stage at any
time, but it must not skip a stage when moving toward production unless an
operator records an emergency exception in State Hub.

Each stage emits a machine-readable result with:

- workload identity;
- source revision or image digest;
- target environment;
- checks run;
- pass/fail status;
- non-secret evidence references;
- rollback target, when applicable;
- approving human or explicit "not required" decision.

## Workload Declaration

Each participating workload should declare its promotion contract in a
repository-local `railiance/app.toml`. The full schema is defined by the next
workplan task, but this lifecycle expects every workload declaration to provide
at least:

- stable workload name and owning repo;
- source revision, image tag, or image digest policy;
- stage-specific namespaces or release names;
- health checks and observability endpoints;
- ingress or routing targets;
- platform dependencies;
- rollback command or previous-stable reference;
- secret references by name or path, never plaintext secret values.

If a workload cannot provide a machine-readable declaration yet, it may still
use this lifecycle through a written operator runbook, but that is a temporary
compatibility path. The runbook must identify the missing declaration fields.

## Stage 1: Local Validation

Stage 1 proves that the workload can be built, configured, and checked outside
production traffic.

Typical Stage 1 targets:

- local container runtime;
- local Kubernetes such as k3d, kind, or a disposable namespace;
- dry-run Helm rendering;
- unit, integration, migration, and smoke checks that do not require production
  credentials.

Required Stage 1 checks:

- source revision is cleanly identified;
- build or artifact selection is deterministic;
- Helm templates or manifests render without invalid Kubernetes objects;
- local health checks pass;
- required secrets are referenced by name only and are not printed;
- database migrations, if any, are classified as reversible, forward-only, or
  requiring human approval;
- a Stage 2 candidate artifact is named by immutable digest or equivalent
  immutable revision.

Stage 1 fails closed when:

- local checks are skipped without an approved reason;
- generated manifests contain plaintext secrets;
- the artifact cannot be traced to source;
- the workload cannot state how it will be observed in Stage 2.

Stage 1 completion does not authorize production traffic. It only makes a
workload eligible for Stage 2 review.

## Stage 2: Production Canary

Stage 2 deploys the candidate to production infrastructure with limited or
isolated exposure. The goal is to observe the candidate against real platform
dependencies while keeping blast radius small.

Acceptable canary forms:

- weighted ingress split between stable and canary;
- header-based or path-based routing for operator traffic only;
- shadow deployment receiving replicated non-mutating traffic;
- isolated production namespace with manually triggered probes.

The selected canary form must be declared before deployment. If weighted
routing is unavailable, the fallback must preserve the same safety property:
the candidate can be observed without silently replacing stable production.

Required Stage 2 prechecks:

- Stage 1 result passed for the same candidate artifact;
- cluster connectivity and namespace readiness are verified;
- target image digest or immutable tag exists in the registry;
- Helm server-side dry-run succeeds;
- ingress, certificate, and DNS prerequisites are present where applicable;
- platform dependencies are healthy or explicitly degraded with operator
  approval;
- rollback target is known before the canary is applied;
- monitoring and log queries are available for the canary release.

Required Stage 2 evidence:

- rendered release identity;
- applied namespace and release name;
- pod readiness and restart status;
- ingress or routing state;
- key health endpoint result;
- relevant metrics window or explicit "metrics unavailable" note;
- State Hub progress note with non-secret evidence;
- operator approval when the workload is production-critical.

Canary acceptance gates:

- canary pods remain ready for the configured observation window;
- no crash loops, repeated restarts, or pending pods remain unexplained;
- health checks pass from inside and outside the cluster when both are
  applicable;
- error rate, latency, and saturation do not regress beyond the workload's
  declared threshold;
- no unexpected schema, storage, or queue side effects are observed;
- logs show no secret leakage and no repeated authorization failures;
- rollback has been tested previously or is a single documented command with a
  known previous-stable target.

Default observation windows:

- non-critical internal service: 15 minutes;
- user-facing or shared platform service: 30 minutes;
- production-critical infrastructure such as Forgejo, identity, registry, or
  State Hub: operator-defined window, minimum 60 minutes unless explicitly
  waived.

Stage 2 fails closed when:

- the canary cannot be distinguished from stable production;
- production routing changes more traffic than intended;
- any required evidence is missing and no operator waiver is recorded;
- rollback target is unknown;
- the candidate needs a secret, credential, or platform dependency that was not
  declared before the canary.

## Stage 3: Production Promotion

Stage 3 promotes the accepted candidate to the stable production path.

Promotion may mean:

- shifting weighted traffic to the canary release;
- replacing the stable Helm release with the accepted candidate;
- changing an ingress selector or service target;
- activating an operator-approved rollout workflow.

Required Stage 3 prechecks:

- Stage 2 acceptance gates passed for the same candidate artifact;
- the previous stable version is recorded;
- backup and restore posture is current for stateful workloads;
- migrations are approved and sequenced;
- production-critical workloads have explicit human approval;
- a rollback command and rollback verification check are available.

Required Stage 3 evidence:

- promotion command or workflow id;
- previous stable version;
- new stable version;
- production routing state after promotion;
- smoke result after promotion;
- rollback target retained;
- State Hub progress note with non-secret evidence.

Stage 3 is complete only after the post-promotion smoke passes and the
workload's stable routing points at the promoted candidate.

## Rollback Expectations

Rollback is part of every promotion, not an afterthought.

Every Stage 2 and Stage 3 action must identify one of:

- previous stable Helm release revision;
- previous image digest and values file;
- previous ingress/routing configuration;
- documented manual recovery path when automation is not yet safe.

Rollback must be immediate when:

- production availability is degraded;
- canary traffic escapes the declared blast radius;
- the workload emits repeated authorization or secret-handling errors;
- data integrity is at risk;
- an operator revokes approval during the observation window.

Rollback may be deferred only when the rollback itself is more dangerous than
the incident state. That decision requires a State Hub note and human approval.

After rollback, record:

- triggering symptom;
- rollback action;
- final stable version;
- remaining cleanup;
- whether the failed candidate is blocked, abandoned, or returned to Stage 1.

## Human Approval Gates

Human approval is required before production traffic changes for
production-critical workloads.

Production-critical workloads include:

- source forge and package registry workloads such as Forgejo or Gitea;
- identity, MFA, SSO, or authorization systems;
- State Hub, Inter-Hub, and operator coordination services;
- databases, object stores, and backup systems;
- ingress, certificate, or cluster-wide policy controllers;
- any workload whose failure blocks multiple repos or domains.

Approval must be recorded as a non-secret State Hub note or task comment. The
approval record should name:

- approving operator;
- candidate artifact;
- stage being approved;
- observation window;
- rollback target;
- any waived gates and why.

Emergency approval can be retrospective only when delaying the action would
increase production risk. Retrospective approval must be recorded immediately
after stabilization.

## Evidence And Secret Handling

Lifecycle evidence must be useful without being sensitive.

Allowed evidence:

- commit ids;
- image tags and digests;
- workflow ids;
- Kubernetes object names;
- pod status summaries;
- HTTP status codes;
- timestamps;
- State Hub progress ids;
- pass/fail summaries.

Forbidden evidence:

- plaintext secrets;
- bearer tokens;
- static API keys;
- kubeconfigs;
- private key material;
- full environment dumps;
- logs that contain credentials or user private data.

When a check needs secret-backed access, record only the access path and result,
for example: "OpenBao path configured, token exchange returned 200".

## Forgejo Readiness Interpretation

This lifecycle is clear enough for Forgejo when a future Forgejo workplan can
answer these questions before production cutover:

- What source revision and image digest are being promoted?
- What local checks prove the candidate is viable?
- How is the production canary isolated or traffic-limited?
- Which health, registry, SSH, web, Actions, and email recovery checks define
  acceptance?
- Who approves the Stage 3 traffic switch?
- What is the previous stable target?
- How is repository data protected before and after promotion?
- How will rollback be verified without losing package or repository state?

If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and
must not cut over to Stage 3.

## Minimum Command Contract

Future CLI tasks should make these lifecycle operations repeatable:

```text
bin/railiance run <app>             # Stage 1 local validation
bin/railiance deploy --stage 2 <app> # Stage 2 canary deployment
bin/railiance observe <app>          # Stage 2/3 evidence collection
bin/railiance promote <app>          # Stage 3 production promotion
bin/railiance rollback <app>         # rollback to previous stable
```

The exact command names may change as implementation lands, but the behavior
must preserve the stage gates and evidence requirements in this document.