# Railiance Deployment Lifecycle This document defines the Railiance three-stage promotion lifecycle for workloads that run on the Railiance Kubernetes substrate. The lifecycle exists so production workloads move through repeatable gates instead of one-off operator memory. It is intentionally conservative: every stage must leave evidence, every promotion must have a rollback path, and critical workloads require explicit human approval before production traffic is changed. ## Scope This specification is owned by `railiance-cluster` because it defines the cluster runtime contract for promotion gates, canary validation, production routing, and rollback expectations. Repo boundaries: - `railiance-cluster` owns the lifecycle semantics, cluster prerequisites, ingress/routing expectations, and acceptance gates. - `railiance-apps` owns workload-specific Helm values, application release definitions, and production workload configuration. - `railiance-platform` owns shared platform services such as databases, caches, object storage, and backup targets. - `railiance-enablement` owns developer-facing templates, CI workflows, and local ergonomics. - `railiance-infra` owns host provisioning, OS hardening, SSH, firewall, and node bootstrap below Kubernetes. ## Lifecycle Overview Railiance promotes a workload through three stages: 1. Stage 1: local validation. 2. Stage 2: production canary. 3. Stage 3: production promotion. The stages are sequential. A workload may return to an earlier stage at any time, but it must not skip a stage when moving toward production unless an operator records an emergency exception in State Hub. Each stage emits a machine-readable result with: - workload identity; - source revision or image digest; - target environment; - checks run; - pass/fail status; - non-secret evidence references; - rollback target, when applicable; - approving human or explicit "not required" decision. ## Workload Declaration Each participating workload should declare its promotion contract in a repository-local `railiance/app.toml`. The full schema is defined by the next workplan task, but this lifecycle expects every workload declaration to provide at least: - stable workload name and owning repo; - source revision, image tag, or image digest policy; - stage-specific namespaces or release names; - health checks and observability endpoints; - ingress or routing targets; - platform dependencies; - rollback command or previous-stable reference; - secret references by name or path, never plaintext secret values. If a workload cannot provide a machine-readable declaration yet, it may still use this lifecycle through a written operator runbook, but that is a temporary compatibility path. The runbook must identify the missing declaration fields. ## Stage 1: Local Validation Stage 1 proves that the workload can be built, configured, and checked outside production traffic. Typical Stage 1 targets: - local container runtime; - local Kubernetes such as k3d, kind, or a disposable namespace; - dry-run Helm rendering; - unit, integration, migration, and smoke checks that do not require production credentials. Required Stage 1 checks: - source revision is cleanly identified; - build or artifact selection is deterministic; - Helm templates or manifests render without invalid Kubernetes objects; - local health checks pass; - required secrets are referenced by name only and are not printed; - database migrations, if any, are classified as reversible, forward-only, or requiring human approval; - a Stage 2 candidate artifact is named by immutable digest or equivalent immutable revision. Stage 1 fails closed when: - local checks are skipped without an approved reason; - generated manifests contain plaintext secrets; - the artifact cannot be traced to source; - the workload cannot state how it will be observed in Stage 2. Stage 1 completion does not authorize production traffic. It only makes a workload eligible for Stage 2 review. ## Stage 2: Production Canary Stage 2 deploys the candidate to production infrastructure with limited or isolated exposure. The goal is to observe the candidate against real platform dependencies while keeping blast radius small. Acceptable canary forms: - weighted ingress split between stable and canary; - header-based or path-based routing for operator traffic only; - shadow deployment receiving replicated non-mutating traffic; - isolated production namespace with manually triggered probes. The selected canary form must be declared before deployment. If weighted routing is unavailable, the fallback must preserve the same safety property: the candidate can be observed without silently replacing stable production. Required Stage 2 prechecks: - Stage 1 result passed for the same candidate artifact; - cluster connectivity and namespace readiness are verified; - target image digest or immutable tag exists in the registry; - Helm server-side dry-run succeeds; - ingress, certificate, and DNS prerequisites are present where applicable; - platform dependencies are healthy or explicitly degraded with operator approval; - rollback target is known before the canary is applied; - monitoring and log queries are available for the canary release. Required Stage 2 evidence: - rendered release identity; - applied namespace and release name; - pod readiness and restart status; - ingress or routing state; - key health endpoint result; - relevant metrics window or explicit "metrics unavailable" note; - State Hub progress note with non-secret evidence; - operator approval when the workload is production-critical. Canary acceptance gates: - canary pods remain ready for the configured observation window; - no crash loops, repeated restarts, or pending pods remain unexplained; - health checks pass from inside and outside the cluster when both are applicable; - error rate, latency, and saturation do not regress beyond the workload's declared threshold; - no unexpected schema, storage, or queue side effects are observed; - logs show no secret leakage and no repeated authorization failures; - rollback has been tested previously or is a single documented command with a known previous-stable target. Default observation windows: - non-critical internal service: 15 minutes; - user-facing or shared platform service: 30 minutes; - production-critical infrastructure such as Forgejo, identity, registry, or State Hub: operator-defined window, minimum 60 minutes unless explicitly waived. Stage 2 fails closed when: - the canary cannot be distinguished from stable production; - production routing changes more traffic than intended; - any required evidence is missing and no operator waiver is recorded; - rollback target is unknown; - the candidate needs a secret, credential, or platform dependency that was not declared before the canary. ## Stage 3: Production Promotion Stage 3 promotes the accepted candidate to the stable production path. Promotion may mean: - shifting weighted traffic to the canary release; - replacing the stable Helm release with the accepted candidate; - changing an ingress selector or service target; - activating an operator-approved rollout workflow. Required Stage 3 prechecks: - Stage 2 acceptance gates passed for the same candidate artifact; - the previous stable version is recorded; - backup and restore posture is current for stateful workloads; - migrations are approved and sequenced; - production-critical workloads have explicit human approval; - a rollback command and rollback verification check are available. Required Stage 3 evidence: - promotion command or workflow id; - previous stable version; - new stable version; - production routing state after promotion; - smoke result after promotion; - rollback target retained; - State Hub progress note with non-secret evidence. Stage 3 is complete only after the post-promotion smoke passes and the workload's stable routing points at the promoted candidate. ## Rollback Expectations Rollback is part of every promotion, not an afterthought. Every Stage 2 and Stage 3 action must identify one of: - previous stable Helm release revision; - previous image digest and values file; - previous ingress/routing configuration; - documented manual recovery path when automation is not yet safe. Rollback must be immediate when: - production availability is degraded; - canary traffic escapes the declared blast radius; - the workload emits repeated authorization or secret-handling errors; - data integrity is at risk; - an operator revokes approval during the observation window. Rollback may be deferred only when the rollback itself is more dangerous than the incident state. That decision requires a State Hub note and human approval. After rollback, record: - triggering symptom; - rollback action; - final stable version; - remaining cleanup; - whether the failed candidate is blocked, abandoned, or returned to Stage 1. ## Human Approval Gates Human approval is required before production traffic changes for production-critical workloads. Production-critical workloads include: - source forge and package registry workloads such as Forgejo or Gitea; - identity, MFA, SSO, or authorization systems; - State Hub, Inter-Hub, and operator coordination services; - databases, object stores, and backup systems; - ingress, certificate, or cluster-wide policy controllers; - any workload whose failure blocks multiple repos or domains. Approval must be recorded as a non-secret State Hub note or task comment. The approval record should name: - approving operator; - candidate artifact; - stage being approved; - observation window; - rollback target; - any waived gates and why. Emergency approval can be retrospective only when delaying the action would increase production risk. Retrospective approval must be recorded immediately after stabilization. ## Evidence And Secret Handling Lifecycle evidence must be useful without being sensitive. Allowed evidence: - commit ids; - image tags and digests; - workflow ids; - Kubernetes object names; - pod status summaries; - HTTP status codes; - timestamps; - State Hub progress ids; - pass/fail summaries. Forbidden evidence: - plaintext secrets; - bearer tokens; - static API keys; - kubeconfigs; - private key material; - full environment dumps; - logs that contain credentials or user private data. When a check needs secret-backed access, record only the access path and result, for example: "OpenBao path configured, token exchange returned 200". ## Forgejo Readiness Interpretation This lifecycle is clear enough for Forgejo when a future Forgejo workplan can answer these questions before production cutover: - What source revision and image digest are being promoted? - What local checks prove the candidate is viable? - How is the production canary isolated or traffic-limited? - Which health, registry, SSH, web, Actions, and email recovery checks define acceptance? - Who approves the Stage 3 traffic switch? - What is the previous stable target? - How is repository data protected before and after promotion? - How will rollback be verified without losing package or repository state? If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and must not cut over to Stage 3. ## Minimum Command Contract Future CLI tasks should make these lifecycle operations repeatable: ```text bin/railiance run # Stage 1 local validation bin/railiance deploy --stage 2 # Stage 2 canary deployment bin/railiance observe # Stage 2/3 evidence collection bin/railiance promote # Stage 3 production promotion bin/railiance rollback # rollback to previous stable ``` The exact command names may change as implementation lands, but the behavior must preserve the stage gates and evidence requirements in this document.