diff --git a/docs/README.md b/docs/README.md index 4f1483d..415bf43 100644 --- a/docs/README.md +++ b/docs/README.md @@ -71,6 +71,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild - [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md) +## Operations + +- [Deployment lifecycle](deployment-lifecycle.md) + ## 👥 Contributing See CONTRIBUTING.md for rules, coding style, and workflow. diff --git a/docs/deployment-lifecycle.md b/docs/deployment-lifecycle.md new file mode 100644 index 0000000..c61a404 --- /dev/null +++ b/docs/deployment-lifecycle.md @@ -0,0 +1,331 @@ +# Railiance Deployment Lifecycle + +This document defines the Railiance three-stage promotion lifecycle for +workloads that run on the Railiance Kubernetes substrate. + +The lifecycle exists so production workloads move through repeatable gates +instead of one-off operator memory. It is intentionally conservative: every +stage must leave evidence, every promotion must have a rollback path, and +critical workloads require explicit human approval before production traffic is +changed. + +## Scope + +This specification is owned by `railiance-cluster` because it defines the +cluster runtime contract for promotion gates, canary validation, production +routing, and rollback expectations. + +Repo boundaries: + +- `railiance-cluster` owns the lifecycle semantics, cluster prerequisites, + ingress/routing expectations, and acceptance gates. +- `railiance-apps` owns workload-specific Helm values, application release + definitions, and production workload configuration. +- `railiance-platform` owns shared platform services such as databases, + caches, object storage, and backup targets. +- `railiance-enablement` owns developer-facing templates, CI workflows, and + local ergonomics. +- `railiance-infra` owns host provisioning, OS hardening, SSH, firewall, and + node bootstrap below Kubernetes. + +## Lifecycle Overview + +Railiance promotes a workload through three stages: + +1. Stage 1: local validation. +2. Stage 2: production canary. +3. Stage 3: production promotion. + +The stages are sequential. A workload may return to an earlier stage at any +time, but it must not skip a stage when moving toward production unless an +operator records an emergency exception in State Hub. + +Each stage emits a machine-readable result with: + +- workload identity; +- source revision or image digest; +- target environment; +- checks run; +- pass/fail status; +- non-secret evidence references; +- rollback target, when applicable; +- approving human or explicit "not required" decision. + +## Workload Declaration + +Each participating workload should declare its promotion contract in a +repository-local `railiance/app.toml`. The full schema is defined by the next +workplan task, but this lifecycle expects every workload declaration to provide +at least: + +- stable workload name and owning repo; +- source revision, image tag, or image digest policy; +- stage-specific namespaces or release names; +- health checks and observability endpoints; +- ingress or routing targets; +- platform dependencies; +- rollback command or previous-stable reference; +- secret references by name or path, never plaintext secret values. + +If a workload cannot provide a machine-readable declaration yet, it may still +use this lifecycle through a written operator runbook, but that is a temporary +compatibility path. The runbook must identify the missing declaration fields. + +## Stage 1: Local Validation + +Stage 1 proves that the workload can be built, configured, and checked outside +production traffic. + +Typical Stage 1 targets: + +- local container runtime; +- local Kubernetes such as k3d, kind, or a disposable namespace; +- dry-run Helm rendering; +- unit, integration, migration, and smoke checks that do not require production + credentials. + +Required Stage 1 checks: + +- source revision is cleanly identified; +- build or artifact selection is deterministic; +- Helm templates or manifests render without invalid Kubernetes objects; +- local health checks pass; +- required secrets are referenced by name only and are not printed; +- database migrations, if any, are classified as reversible, forward-only, or + requiring human approval; +- a Stage 2 candidate artifact is named by immutable digest or equivalent + immutable revision. + +Stage 1 fails closed when: + +- local checks are skipped without an approved reason; +- generated manifests contain plaintext secrets; +- the artifact cannot be traced to source; +- the workload cannot state how it will be observed in Stage 2. + +Stage 1 completion does not authorize production traffic. It only makes a +workload eligible for Stage 2 review. + +## Stage 2: Production Canary + +Stage 2 deploys the candidate to production infrastructure with limited or +isolated exposure. The goal is to observe the candidate against real platform +dependencies while keeping blast radius small. + +Acceptable canary forms: + +- weighted ingress split between stable and canary; +- header-based or path-based routing for operator traffic only; +- shadow deployment receiving replicated non-mutating traffic; +- isolated production namespace with manually triggered probes. + +The selected canary form must be declared before deployment. If weighted +routing is unavailable, the fallback must preserve the same safety property: +the candidate can be observed without silently replacing stable production. + +Required Stage 2 prechecks: + +- Stage 1 result passed for the same candidate artifact; +- cluster connectivity and namespace readiness are verified; +- target image digest or immutable tag exists in the registry; +- Helm server-side dry-run succeeds; +- ingress, certificate, and DNS prerequisites are present where applicable; +- platform dependencies are healthy or explicitly degraded with operator + approval; +- rollback target is known before the canary is applied; +- monitoring and log queries are available for the canary release. + +Required Stage 2 evidence: + +- rendered release identity; +- applied namespace and release name; +- pod readiness and restart status; +- ingress or routing state; +- key health endpoint result; +- relevant metrics window or explicit "metrics unavailable" note; +- State Hub progress note with non-secret evidence; +- operator approval when the workload is production-critical. + +Canary acceptance gates: + +- canary pods remain ready for the configured observation window; +- no crash loops, repeated restarts, or pending pods remain unexplained; +- health checks pass from inside and outside the cluster when both are + applicable; +- error rate, latency, and saturation do not regress beyond the workload's + declared threshold; +- no unexpected schema, storage, or queue side effects are observed; +- logs show no secret leakage and no repeated authorization failures; +- rollback has been tested previously or is a single documented command with a + known previous-stable target. + +Default observation windows: + +- non-critical internal service: 15 minutes; +- user-facing or shared platform service: 30 minutes; +- production-critical infrastructure such as Forgejo, identity, registry, or + State Hub: operator-defined window, minimum 60 minutes unless explicitly + waived. + +Stage 2 fails closed when: + +- the canary cannot be distinguished from stable production; +- production routing changes more traffic than intended; +- any required evidence is missing and no operator waiver is recorded; +- rollback target is unknown; +- the candidate needs a secret, credential, or platform dependency that was not + declared before the canary. + +## Stage 3: Production Promotion + +Stage 3 promotes the accepted candidate to the stable production path. + +Promotion may mean: + +- shifting weighted traffic to the canary release; +- replacing the stable Helm release with the accepted candidate; +- changing an ingress selector or service target; +- activating an operator-approved rollout workflow. + +Required Stage 3 prechecks: + +- Stage 2 acceptance gates passed for the same candidate artifact; +- the previous stable version is recorded; +- backup and restore posture is current for stateful workloads; +- migrations are approved and sequenced; +- production-critical workloads have explicit human approval; +- a rollback command and rollback verification check are available. + +Required Stage 3 evidence: + +- promotion command or workflow id; +- previous stable version; +- new stable version; +- production routing state after promotion; +- smoke result after promotion; +- rollback target retained; +- State Hub progress note with non-secret evidence. + +Stage 3 is complete only after the post-promotion smoke passes and the +workload's stable routing points at the promoted candidate. + +## Rollback Expectations + +Rollback is part of every promotion, not an afterthought. + +Every Stage 2 and Stage 3 action must identify one of: + +- previous stable Helm release revision; +- previous image digest and values file; +- previous ingress/routing configuration; +- documented manual recovery path when automation is not yet safe. + +Rollback must be immediate when: + +- production availability is degraded; +- canary traffic escapes the declared blast radius; +- the workload emits repeated authorization or secret-handling errors; +- data integrity is at risk; +- an operator revokes approval during the observation window. + +Rollback may be deferred only when the rollback itself is more dangerous than +the incident state. That decision requires a State Hub note and human approval. + +After rollback, record: + +- triggering symptom; +- rollback action; +- final stable version; +- remaining cleanup; +- whether the failed candidate is blocked, abandoned, or returned to Stage 1. + +## Human Approval Gates + +Human approval is required before production traffic changes for +production-critical workloads. + +Production-critical workloads include: + +- source forge and package registry workloads such as Forgejo or Gitea; +- identity, MFA, SSO, or authorization systems; +- State Hub, Inter-Hub, and operator coordination services; +- databases, object stores, and backup systems; +- ingress, certificate, or cluster-wide policy controllers; +- any workload whose failure blocks multiple repos or domains. + +Approval must be recorded as a non-secret State Hub note or task comment. The +approval record should name: + +- approving operator; +- candidate artifact; +- stage being approved; +- observation window; +- rollback target; +- any waived gates and why. + +Emergency approval can be retrospective only when delaying the action would +increase production risk. Retrospective approval must be recorded immediately +after stabilization. + +## Evidence And Secret Handling + +Lifecycle evidence must be useful without being sensitive. + +Allowed evidence: + +- commit ids; +- image tags and digests; +- workflow ids; +- Kubernetes object names; +- pod status summaries; +- HTTP status codes; +- timestamps; +- State Hub progress ids; +- pass/fail summaries. + +Forbidden evidence: + +- plaintext secrets; +- bearer tokens; +- static API keys; +- kubeconfigs; +- private key material; +- full environment dumps; +- logs that contain credentials or user private data. + +When a check needs secret-backed access, record only the access path and result, +for example: "OpenBao path configured, token exchange returned 200". + +## Forgejo Readiness Interpretation + +This lifecycle is clear enough for Forgejo when a future Forgejo workplan can +answer these questions before production cutover: + +- What source revision and image digest are being promoted? +- What local checks prove the candidate is viable? +- How is the production canary isolated or traffic-limited? +- Which health, registry, SSH, web, Actions, and email recovery checks define + acceptance? +- Who approves the Stage 3 traffic switch? +- What is the previous stable target? +- How is repository data protected before and after promotion? +- How will rollback be verified without losing package or repository state? + +If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and +must not cut over to Stage 3. + +## Minimum Command Contract + +Future CLI tasks should make these lifecycle operations repeatable: + +```text +bin/railiance run # Stage 1 local validation +bin/railiance deploy --stage 2 # Stage 2 canary deployment +bin/railiance observe # Stage 2/3 evidence collection +bin/railiance promote # Stage 3 production promotion +bin/railiance rollback # rollback to previous stable +``` + +The exact command names may change as implementation lands, but the behavior +must preserve the stage gates and evidence requirements in this document. + diff --git a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md index 46d2d5e..40760f9 100644 --- a/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md +++ b/workplans/RAIL-BS-WP-0006-staged-promotion-lifecycle.md @@ -10,7 +10,7 @@ topic_slug: railiance repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580" state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603" created: "2026-02-24" -updated: "2026-05-03" +updated: "2026-06-16" --- # Staged Promotion Lifecycle @@ -54,7 +54,7 @@ Expected cross-repo handoffs: ```task id: RAIL-BS-WP-0006-T01 -status: todo +status: done priority: high state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f" ``` @@ -72,6 +72,13 @@ The spec should define: **Done when:** the lifecycle is clear enough to apply to Forgejo as a later production workload. +2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from +`docs/README.md`. The specification defines Stage 1 local validation, Stage 2 +production canary, Stage 3 production promotion, required checks and evidence, +canary acceptance gates, rollback expectations, human approval gates for +production-critical workloads, and the Forgejo readiness questions that must be +answered before cutover. + --- ### T02 - Define railiance directory schema and app.toml contract