Add staged deployment lifecycle spec

2026-06-16 08:01:25 +02:00
parent f30d901758
commit 45e57b0a11
3 changed files with 344 additions and 2 deletions
--- a/docs/README.md
+++ b/docs/README.md
@@ -71,6 +71,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild

 - [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)

+## Operations
+
+- [Deployment lifecycle](deployment-lifecycle.md)
+
 ## 👥 Contributing

 See CONTRIBUTING.md for rules, coding style, and workflow.
--- a/docs/deployment-lifecycle.md
+++ b/docs/deployment-lifecycle.md
@@ -0,0 +1,331 @@
+# Railiance Deployment Lifecycle
+
+This document defines the Railiance three-stage promotion lifecycle for
+workloads that run on the Railiance Kubernetes substrate.
+
+The lifecycle exists so production workloads move through repeatable gates
+instead of one-off operator memory. It is intentionally conservative: every
+stage must leave evidence, every promotion must have a rollback path, and
+critical workloads require explicit human approval before production traffic is
+changed.
+
+## Scope
+
+This specification is owned by `railiance-cluster` because it defines the
+cluster runtime contract for promotion gates, canary validation, production
+routing, and rollback expectations.
+
+Repo boundaries:
+
+- `railiance-cluster` owns the lifecycle semantics, cluster prerequisites,
+  ingress/routing expectations, and acceptance gates.
+- `railiance-apps` owns workload-specific Helm values, application release
+  definitions, and production workload configuration.
+- `railiance-platform` owns shared platform services such as databases,
+  caches, object storage, and backup targets.
+- `railiance-enablement` owns developer-facing templates, CI workflows, and
+  local ergonomics.
+- `railiance-infra` owns host provisioning, OS hardening, SSH, firewall, and
+  node bootstrap below Kubernetes.
+
+## Lifecycle Overview
+
+Railiance promotes a workload through three stages:
+
+1. Stage 1: local validation.
+2. Stage 2: production canary.
+3. Stage 3: production promotion.
+
+The stages are sequential. A workload may return to an earlier stage at any
+time, but it must not skip a stage when moving toward production unless an
+operator records an emergency exception in State Hub.
+
+Each stage emits a machine-readable result with:
+
+- workload identity;
+- source revision or image digest;
+- target environment;
+- checks run;
+- pass/fail status;
+- non-secret evidence references;
+- rollback target, when applicable;
+- approving human or explicit "not required" decision.
+
+## Workload Declaration
+
+Each participating workload should declare its promotion contract in a
+repository-local `railiance/app.toml`. The full schema is defined by the next
+workplan task, but this lifecycle expects every workload declaration to provide
+at least:
+
+- stable workload name and owning repo;
+- source revision, image tag, or image digest policy;
+- stage-specific namespaces or release names;
+- health checks and observability endpoints;
+- ingress or routing targets;
+- platform dependencies;
+- rollback command or previous-stable reference;
+- secret references by name or path, never plaintext secret values.
+
+If a workload cannot provide a machine-readable declaration yet, it may still
+use this lifecycle through a written operator runbook, but that is a temporary
+compatibility path. The runbook must identify the missing declaration fields.
+
+## Stage 1: Local Validation
+
+Stage 1 proves that the workload can be built, configured, and checked outside
+production traffic.
+
+Typical Stage 1 targets:
+
+- local container runtime;
+- local Kubernetes such as k3d, kind, or a disposable namespace;
+- dry-run Helm rendering;
+- unit, integration, migration, and smoke checks that do not require production
+  credentials.
+
+Required Stage 1 checks:
+
+- source revision is cleanly identified;
+- build or artifact selection is deterministic;
+- Helm templates or manifests render without invalid Kubernetes objects;
+- local health checks pass;
+- required secrets are referenced by name only and are not printed;
+- database migrations, if any, are classified as reversible, forward-only, or
+  requiring human approval;
+- a Stage 2 candidate artifact is named by immutable digest or equivalent
+  immutable revision.
+
+Stage 1 fails closed when:
+
+- local checks are skipped without an approved reason;
+- generated manifests contain plaintext secrets;
+- the artifact cannot be traced to source;
+- the workload cannot state how it will be observed in Stage 2.
+
+Stage 1 completion does not authorize production traffic. It only makes a
+workload eligible for Stage 2 review.
+
+## Stage 2: Production Canary
+
+Stage 2 deploys the candidate to production infrastructure with limited or
+isolated exposure. The goal is to observe the candidate against real platform
+dependencies while keeping blast radius small.
+
+Acceptable canary forms:
+
+- weighted ingress split between stable and canary;
+- header-based or path-based routing for operator traffic only;
+- shadow deployment receiving replicated non-mutating traffic;
+- isolated production namespace with manually triggered probes.
+
+The selected canary form must be declared before deployment. If weighted
+routing is unavailable, the fallback must preserve the same safety property:
+the candidate can be observed without silently replacing stable production.
+
+Required Stage 2 prechecks:
+
+- Stage 1 result passed for the same candidate artifact;
+- cluster connectivity and namespace readiness are verified;
+- target image digest or immutable tag exists in the registry;
+- Helm server-side dry-run succeeds;
+- ingress, certificate, and DNS prerequisites are present where applicable;
+- platform dependencies are healthy or explicitly degraded with operator
+  approval;
+- rollback target is known before the canary is applied;
+- monitoring and log queries are available for the canary release.
+
+Required Stage 2 evidence:
+
+- rendered release identity;
+- applied namespace and release name;
+- pod readiness and restart status;
+- ingress or routing state;
+- key health endpoint result;
+- relevant metrics window or explicit "metrics unavailable" note;
+- State Hub progress note with non-secret evidence;
+- operator approval when the workload is production-critical.
+
+Canary acceptance gates:
+
+- canary pods remain ready for the configured observation window;
+- no crash loops, repeated restarts, or pending pods remain unexplained;
+- health checks pass from inside and outside the cluster when both are
+  applicable;
+- error rate, latency, and saturation do not regress beyond the workload's
+  declared threshold;
+- no unexpected schema, storage, or queue side effects are observed;
+- logs show no secret leakage and no repeated authorization failures;
+- rollback has been tested previously or is a single documented command with a
+  known previous-stable target.
+
+Default observation windows:
+
+- non-critical internal service: 15 minutes;
+- user-facing or shared platform service: 30 minutes;
+- production-critical infrastructure such as Forgejo, identity, registry, or
+  State Hub: operator-defined window, minimum 60 minutes unless explicitly
+  waived.
+
+Stage 2 fails closed when:
+
+- the canary cannot be distinguished from stable production;
+- production routing changes more traffic than intended;
+- any required evidence is missing and no operator waiver is recorded;
+- rollback target is unknown;
+- the candidate needs a secret, credential, or platform dependency that was not
+  declared before the canary.
+
+## Stage 3: Production Promotion
+
+Stage 3 promotes the accepted candidate to the stable production path.
+
+Promotion may mean:
+
+- shifting weighted traffic to the canary release;
+- replacing the stable Helm release with the accepted candidate;
+- changing an ingress selector or service target;
+- activating an operator-approved rollout workflow.
+
+Required Stage 3 prechecks:
+
+- Stage 2 acceptance gates passed for the same candidate artifact;
+- the previous stable version is recorded;
+- backup and restore posture is current for stateful workloads;
+- migrations are approved and sequenced;
+- production-critical workloads have explicit human approval;
+- a rollback command and rollback verification check are available.
+
+Required Stage 3 evidence:
+
+- promotion command or workflow id;
+- previous stable version;
+- new stable version;
+- production routing state after promotion;
+- smoke result after promotion;
+- rollback target retained;
+- State Hub progress note with non-secret evidence.
+
+Stage 3 is complete only after the post-promotion smoke passes and the
+workload's stable routing points at the promoted candidate.
+
+## Rollback Expectations
+
+Rollback is part of every promotion, not an afterthought.
+
+Every Stage 2 and Stage 3 action must identify one of:
+
+- previous stable Helm release revision;
+- previous image digest and values file;
+- previous ingress/routing configuration;
+- documented manual recovery path when automation is not yet safe.
+
+Rollback must be immediate when:
+
+- production availability is degraded;
+- canary traffic escapes the declared blast radius;
+- the workload emits repeated authorization or secret-handling errors;
+- data integrity is at risk;
+- an operator revokes approval during the observation window.
+
+Rollback may be deferred only when the rollback itself is more dangerous than
+the incident state. That decision requires a State Hub note and human approval.
+
+After rollback, record:
+
+- triggering symptom;
+- rollback action;
+- final stable version;
+- remaining cleanup;
+- whether the failed candidate is blocked, abandoned, or returned to Stage 1.
+
+## Human Approval Gates
+
+Human approval is required before production traffic changes for
+production-critical workloads.
+
+Production-critical workloads include:
+
+- source forge and package registry workloads such as Forgejo or Gitea;
+- identity, MFA, SSO, or authorization systems;
+- State Hub, Inter-Hub, and operator coordination services;
+- databases, object stores, and backup systems;
+- ingress, certificate, or cluster-wide policy controllers;
+- any workload whose failure blocks multiple repos or domains.
+
+Approval must be recorded as a non-secret State Hub note or task comment. The
+approval record should name:
+
+- approving operator;
+- candidate artifact;
+- stage being approved;
+- observation window;
+- rollback target;
+- any waived gates and why.
+
+Emergency approval can be retrospective only when delaying the action would
+increase production risk. Retrospective approval must be recorded immediately
+after stabilization.
+
+## Evidence And Secret Handling
+
+Lifecycle evidence must be useful without being sensitive.
+
+Allowed evidence:
+
+- commit ids;
+- image tags and digests;
+- workflow ids;
+- Kubernetes object names;
+- pod status summaries;
+- HTTP status codes;
+- timestamps;
+- State Hub progress ids;
+- pass/fail summaries.
+
+Forbidden evidence:
+
+- plaintext secrets;
+- bearer tokens;
+- static API keys;
+- kubeconfigs;
+- private key material;
+- full environment dumps;
+- logs that contain credentials or user private data.
+
+When a check needs secret-backed access, record only the access path and result,
+for example: "OpenBao path configured, token exchange returned 200".
+
+## Forgejo Readiness Interpretation
+
+This lifecycle is clear enough for Forgejo when a future Forgejo workplan can
+answer these questions before production cutover:
+
+- What source revision and image digest are being promoted?
+- What local checks prove the candidate is viable?
+- How is the production canary isolated or traffic-limited?
+- Which health, registry, SSH, web, Actions, and email recovery checks define
+  acceptance?
+- Who approves the Stage 3 traffic switch?
+- What is the previous stable target?
+- How is repository data protected before and after promotion?
+- How will rollback be verified without losing package or repository state?
+
+If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and
+must not cut over to Stage 3.
+
+## Minimum Command Contract
+
+Future CLI tasks should make these lifecycle operations repeatable:
+
+```text
+bin/railiance run <app>             # Stage 1 local validation
+bin/railiance deploy --stage 2 <app> # Stage 2 canary deployment
+bin/railiance observe <app>          # Stage 2/3 evidence collection
+bin/railiance promote <app>          # Stage 3 production promotion
+bin/railiance rollback <app>         # rollback to previous stable
+```
+
+The exact command names may change as implementation lands, but the behavior
+must preserve the stage gates and evidence requirements in this document.
+
--- a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md
+++ b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md
@@ -10,7 +10,7 @@ topic_slug: railiance
 repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
 state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
 created: "2026-02-24"
-updated: "2026-05-03"
+updated: "2026-06-16"
 ---

 # Staged Promotion Lifecycle
@@ -54,7 +54,7 @@ Expected cross-repo handoffs:

 ```task
 id: RAIL-BS-WP-0006-T01
-status: todo
+status: done
 priority: high
 state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
 ```
@@ -72,6 +72,13 @@ The spec should define:
 **Done when:** the lifecycle is clear enough to apply to Forgejo as a later
 production workload.

+2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from
+`docs/README.md`. The specification defines Stage 1 local validation, Stage 2
+production canary, Stage 3 production promotion, required checks and evidence,
+canary acceptance gates, rollback expectations, human approval gates for
+production-critical workloads, and the Forgejo readiness questions that must be
+answered before cutover.
+
 ---

 ### T02 - Define railiance directory schema and app.toml contract