Add staged deployment lifecycle spec
This commit is contained in:
@@ -71,6 +71,10 @@ From two bare Linux servers, a Git repo, and valid credentials, you can rebuild
|
|||||||
|
|
||||||
- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
|
- [2026-03-10 — pgpool CrashLoopBackOff on PostgreSQL HA failover](incidents/2026-03-10-pgpool-missing-secret.md)
|
||||||
|
|
||||||
|
## Operations
|
||||||
|
|
||||||
|
- [Deployment lifecycle](deployment-lifecycle.md)
|
||||||
|
|
||||||
## 👥 Contributing
|
## 👥 Contributing
|
||||||
|
|
||||||
See CONTRIBUTING.md for rules, coding style, and workflow.
|
See CONTRIBUTING.md for rules, coding style, and workflow.
|
||||||
|
|||||||
331
docs/deployment-lifecycle.md
Normal file
331
docs/deployment-lifecycle.md
Normal file
@@ -0,0 +1,331 @@
|
|||||||
|
# Railiance Deployment Lifecycle
|
||||||
|
|
||||||
|
This document defines the Railiance three-stage promotion lifecycle for
|
||||||
|
workloads that run on the Railiance Kubernetes substrate.
|
||||||
|
|
||||||
|
The lifecycle exists so production workloads move through repeatable gates
|
||||||
|
instead of one-off operator memory. It is intentionally conservative: every
|
||||||
|
stage must leave evidence, every promotion must have a rollback path, and
|
||||||
|
critical workloads require explicit human approval before production traffic is
|
||||||
|
changed.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This specification is owned by `railiance-cluster` because it defines the
|
||||||
|
cluster runtime contract for promotion gates, canary validation, production
|
||||||
|
routing, and rollback expectations.
|
||||||
|
|
||||||
|
Repo boundaries:
|
||||||
|
|
||||||
|
- `railiance-cluster` owns the lifecycle semantics, cluster prerequisites,
|
||||||
|
ingress/routing expectations, and acceptance gates.
|
||||||
|
- `railiance-apps` owns workload-specific Helm values, application release
|
||||||
|
definitions, and production workload configuration.
|
||||||
|
- `railiance-platform` owns shared platform services such as databases,
|
||||||
|
caches, object storage, and backup targets.
|
||||||
|
- `railiance-enablement` owns developer-facing templates, CI workflows, and
|
||||||
|
local ergonomics.
|
||||||
|
- `railiance-infra` owns host provisioning, OS hardening, SSH, firewall, and
|
||||||
|
node bootstrap below Kubernetes.
|
||||||
|
|
||||||
|
## Lifecycle Overview
|
||||||
|
|
||||||
|
Railiance promotes a workload through three stages:
|
||||||
|
|
||||||
|
1. Stage 1: local validation.
|
||||||
|
2. Stage 2: production canary.
|
||||||
|
3. Stage 3: production promotion.
|
||||||
|
|
||||||
|
The stages are sequential. A workload may return to an earlier stage at any
|
||||||
|
time, but it must not skip a stage when moving toward production unless an
|
||||||
|
operator records an emergency exception in State Hub.
|
||||||
|
|
||||||
|
Each stage emits a machine-readable result with:
|
||||||
|
|
||||||
|
- workload identity;
|
||||||
|
- source revision or image digest;
|
||||||
|
- target environment;
|
||||||
|
- checks run;
|
||||||
|
- pass/fail status;
|
||||||
|
- non-secret evidence references;
|
||||||
|
- rollback target, when applicable;
|
||||||
|
- approving human or explicit "not required" decision.
|
||||||
|
|
||||||
|
## Workload Declaration
|
||||||
|
|
||||||
|
Each participating workload should declare its promotion contract in a
|
||||||
|
repository-local `railiance/app.toml`. The full schema is defined by the next
|
||||||
|
workplan task, but this lifecycle expects every workload declaration to provide
|
||||||
|
at least:
|
||||||
|
|
||||||
|
- stable workload name and owning repo;
|
||||||
|
- source revision, image tag, or image digest policy;
|
||||||
|
- stage-specific namespaces or release names;
|
||||||
|
- health checks and observability endpoints;
|
||||||
|
- ingress or routing targets;
|
||||||
|
- platform dependencies;
|
||||||
|
- rollback command or previous-stable reference;
|
||||||
|
- secret references by name or path, never plaintext secret values.
|
||||||
|
|
||||||
|
If a workload cannot provide a machine-readable declaration yet, it may still
|
||||||
|
use this lifecycle through a written operator runbook, but that is a temporary
|
||||||
|
compatibility path. The runbook must identify the missing declaration fields.
|
||||||
|
|
||||||
|
## Stage 1: Local Validation
|
||||||
|
|
||||||
|
Stage 1 proves that the workload can be built, configured, and checked outside
|
||||||
|
production traffic.
|
||||||
|
|
||||||
|
Typical Stage 1 targets:
|
||||||
|
|
||||||
|
- local container runtime;
|
||||||
|
- local Kubernetes such as k3d, kind, or a disposable namespace;
|
||||||
|
- dry-run Helm rendering;
|
||||||
|
- unit, integration, migration, and smoke checks that do not require production
|
||||||
|
credentials.
|
||||||
|
|
||||||
|
Required Stage 1 checks:
|
||||||
|
|
||||||
|
- source revision is cleanly identified;
|
||||||
|
- build or artifact selection is deterministic;
|
||||||
|
- Helm templates or manifests render without invalid Kubernetes objects;
|
||||||
|
- local health checks pass;
|
||||||
|
- required secrets are referenced by name only and are not printed;
|
||||||
|
- database migrations, if any, are classified as reversible, forward-only, or
|
||||||
|
requiring human approval;
|
||||||
|
- a Stage 2 candidate artifact is named by immutable digest or equivalent
|
||||||
|
immutable revision.
|
||||||
|
|
||||||
|
Stage 1 fails closed when:
|
||||||
|
|
||||||
|
- local checks are skipped without an approved reason;
|
||||||
|
- generated manifests contain plaintext secrets;
|
||||||
|
- the artifact cannot be traced to source;
|
||||||
|
- the workload cannot state how it will be observed in Stage 2.
|
||||||
|
|
||||||
|
Stage 1 completion does not authorize production traffic. It only makes a
|
||||||
|
workload eligible for Stage 2 review.
|
||||||
|
|
||||||
|
## Stage 2: Production Canary
|
||||||
|
|
||||||
|
Stage 2 deploys the candidate to production infrastructure with limited or
|
||||||
|
isolated exposure. The goal is to observe the candidate against real platform
|
||||||
|
dependencies while keeping blast radius small.
|
||||||
|
|
||||||
|
Acceptable canary forms:
|
||||||
|
|
||||||
|
- weighted ingress split between stable and canary;
|
||||||
|
- header-based or path-based routing for operator traffic only;
|
||||||
|
- shadow deployment receiving replicated non-mutating traffic;
|
||||||
|
- isolated production namespace with manually triggered probes.
|
||||||
|
|
||||||
|
The selected canary form must be declared before deployment. If weighted
|
||||||
|
routing is unavailable, the fallback must preserve the same safety property:
|
||||||
|
the candidate can be observed without silently replacing stable production.
|
||||||
|
|
||||||
|
Required Stage 2 prechecks:
|
||||||
|
|
||||||
|
- Stage 1 result passed for the same candidate artifact;
|
||||||
|
- cluster connectivity and namespace readiness are verified;
|
||||||
|
- target image digest or immutable tag exists in the registry;
|
||||||
|
- Helm server-side dry-run succeeds;
|
||||||
|
- ingress, certificate, and DNS prerequisites are present where applicable;
|
||||||
|
- platform dependencies are healthy or explicitly degraded with operator
|
||||||
|
approval;
|
||||||
|
- rollback target is known before the canary is applied;
|
||||||
|
- monitoring and log queries are available for the canary release.
|
||||||
|
|
||||||
|
Required Stage 2 evidence:
|
||||||
|
|
||||||
|
- rendered release identity;
|
||||||
|
- applied namespace and release name;
|
||||||
|
- pod readiness and restart status;
|
||||||
|
- ingress or routing state;
|
||||||
|
- key health endpoint result;
|
||||||
|
- relevant metrics window or explicit "metrics unavailable" note;
|
||||||
|
- State Hub progress note with non-secret evidence;
|
||||||
|
- operator approval when the workload is production-critical.
|
||||||
|
|
||||||
|
Canary acceptance gates:
|
||||||
|
|
||||||
|
- canary pods remain ready for the configured observation window;
|
||||||
|
- no crash loops, repeated restarts, or pending pods remain unexplained;
|
||||||
|
- health checks pass from inside and outside the cluster when both are
|
||||||
|
applicable;
|
||||||
|
- error rate, latency, and saturation do not regress beyond the workload's
|
||||||
|
declared threshold;
|
||||||
|
- no unexpected schema, storage, or queue side effects are observed;
|
||||||
|
- logs show no secret leakage and no repeated authorization failures;
|
||||||
|
- rollback has been tested previously or is a single documented command with a
|
||||||
|
known previous-stable target.
|
||||||
|
|
||||||
|
Default observation windows:
|
||||||
|
|
||||||
|
- non-critical internal service: 15 minutes;
|
||||||
|
- user-facing or shared platform service: 30 minutes;
|
||||||
|
- production-critical infrastructure such as Forgejo, identity, registry, or
|
||||||
|
State Hub: operator-defined window, minimum 60 minutes unless explicitly
|
||||||
|
waived.
|
||||||
|
|
||||||
|
Stage 2 fails closed when:
|
||||||
|
|
||||||
|
- the canary cannot be distinguished from stable production;
|
||||||
|
- production routing changes more traffic than intended;
|
||||||
|
- any required evidence is missing and no operator waiver is recorded;
|
||||||
|
- rollback target is unknown;
|
||||||
|
- the candidate needs a secret, credential, or platform dependency that was not
|
||||||
|
declared before the canary.
|
||||||
|
|
||||||
|
## Stage 3: Production Promotion
|
||||||
|
|
||||||
|
Stage 3 promotes the accepted candidate to the stable production path.
|
||||||
|
|
||||||
|
Promotion may mean:
|
||||||
|
|
||||||
|
- shifting weighted traffic to the canary release;
|
||||||
|
- replacing the stable Helm release with the accepted candidate;
|
||||||
|
- changing an ingress selector or service target;
|
||||||
|
- activating an operator-approved rollout workflow.
|
||||||
|
|
||||||
|
Required Stage 3 prechecks:
|
||||||
|
|
||||||
|
- Stage 2 acceptance gates passed for the same candidate artifact;
|
||||||
|
- the previous stable version is recorded;
|
||||||
|
- backup and restore posture is current for stateful workloads;
|
||||||
|
- migrations are approved and sequenced;
|
||||||
|
- production-critical workloads have explicit human approval;
|
||||||
|
- a rollback command and rollback verification check are available.
|
||||||
|
|
||||||
|
Required Stage 3 evidence:
|
||||||
|
|
||||||
|
- promotion command or workflow id;
|
||||||
|
- previous stable version;
|
||||||
|
- new stable version;
|
||||||
|
- production routing state after promotion;
|
||||||
|
- smoke result after promotion;
|
||||||
|
- rollback target retained;
|
||||||
|
- State Hub progress note with non-secret evidence.
|
||||||
|
|
||||||
|
Stage 3 is complete only after the post-promotion smoke passes and the
|
||||||
|
workload's stable routing points at the promoted candidate.
|
||||||
|
|
||||||
|
## Rollback Expectations
|
||||||
|
|
||||||
|
Rollback is part of every promotion, not an afterthought.
|
||||||
|
|
||||||
|
Every Stage 2 and Stage 3 action must identify one of:
|
||||||
|
|
||||||
|
- previous stable Helm release revision;
|
||||||
|
- previous image digest and values file;
|
||||||
|
- previous ingress/routing configuration;
|
||||||
|
- documented manual recovery path when automation is not yet safe.
|
||||||
|
|
||||||
|
Rollback must be immediate when:
|
||||||
|
|
||||||
|
- production availability is degraded;
|
||||||
|
- canary traffic escapes the declared blast radius;
|
||||||
|
- the workload emits repeated authorization or secret-handling errors;
|
||||||
|
- data integrity is at risk;
|
||||||
|
- an operator revokes approval during the observation window.
|
||||||
|
|
||||||
|
Rollback may be deferred only when the rollback itself is more dangerous than
|
||||||
|
the incident state. That decision requires a State Hub note and human approval.
|
||||||
|
|
||||||
|
After rollback, record:
|
||||||
|
|
||||||
|
- triggering symptom;
|
||||||
|
- rollback action;
|
||||||
|
- final stable version;
|
||||||
|
- remaining cleanup;
|
||||||
|
- whether the failed candidate is blocked, abandoned, or returned to Stage 1.
|
||||||
|
|
||||||
|
## Human Approval Gates
|
||||||
|
|
||||||
|
Human approval is required before production traffic changes for
|
||||||
|
production-critical workloads.
|
||||||
|
|
||||||
|
Production-critical workloads include:
|
||||||
|
|
||||||
|
- source forge and package registry workloads such as Forgejo or Gitea;
|
||||||
|
- identity, MFA, SSO, or authorization systems;
|
||||||
|
- State Hub, Inter-Hub, and operator coordination services;
|
||||||
|
- databases, object stores, and backup systems;
|
||||||
|
- ingress, certificate, or cluster-wide policy controllers;
|
||||||
|
- any workload whose failure blocks multiple repos or domains.
|
||||||
|
|
||||||
|
Approval must be recorded as a non-secret State Hub note or task comment. The
|
||||||
|
approval record should name:
|
||||||
|
|
||||||
|
- approving operator;
|
||||||
|
- candidate artifact;
|
||||||
|
- stage being approved;
|
||||||
|
- observation window;
|
||||||
|
- rollback target;
|
||||||
|
- any waived gates and why.
|
||||||
|
|
||||||
|
Emergency approval can be retrospective only when delaying the action would
|
||||||
|
increase production risk. Retrospective approval must be recorded immediately
|
||||||
|
after stabilization.
|
||||||
|
|
||||||
|
## Evidence And Secret Handling
|
||||||
|
|
||||||
|
Lifecycle evidence must be useful without being sensitive.
|
||||||
|
|
||||||
|
Allowed evidence:
|
||||||
|
|
||||||
|
- commit ids;
|
||||||
|
- image tags and digests;
|
||||||
|
- workflow ids;
|
||||||
|
- Kubernetes object names;
|
||||||
|
- pod status summaries;
|
||||||
|
- HTTP status codes;
|
||||||
|
- timestamps;
|
||||||
|
- State Hub progress ids;
|
||||||
|
- pass/fail summaries.
|
||||||
|
|
||||||
|
Forbidden evidence:
|
||||||
|
|
||||||
|
- plaintext secrets;
|
||||||
|
- bearer tokens;
|
||||||
|
- static API keys;
|
||||||
|
- kubeconfigs;
|
||||||
|
- private key material;
|
||||||
|
- full environment dumps;
|
||||||
|
- logs that contain credentials or user private data.
|
||||||
|
|
||||||
|
When a check needs secret-backed access, record only the access path and result,
|
||||||
|
for example: "OpenBao path configured, token exchange returned 200".
|
||||||
|
|
||||||
|
## Forgejo Readiness Interpretation
|
||||||
|
|
||||||
|
This lifecycle is clear enough for Forgejo when a future Forgejo workplan can
|
||||||
|
answer these questions before production cutover:
|
||||||
|
|
||||||
|
- What source revision and image digest are being promoted?
|
||||||
|
- What local checks prove the candidate is viable?
|
||||||
|
- How is the production canary isolated or traffic-limited?
|
||||||
|
- Which health, registry, SSH, web, Actions, and email recovery checks define
|
||||||
|
acceptance?
|
||||||
|
- Who approves the Stage 3 traffic switch?
|
||||||
|
- What is the previous stable target?
|
||||||
|
- How is repository data protected before and after promotion?
|
||||||
|
- How will rollback be verified without losing package or repository state?
|
||||||
|
|
||||||
|
If any answer is missing, Forgejo remains in Stage 1 or Stage 2 preparation and
|
||||||
|
must not cut over to Stage 3.
|
||||||
|
|
||||||
|
## Minimum Command Contract
|
||||||
|
|
||||||
|
Future CLI tasks should make these lifecycle operations repeatable:
|
||||||
|
|
||||||
|
```text
|
||||||
|
bin/railiance run <app> # Stage 1 local validation
|
||||||
|
bin/railiance deploy --stage 2 <app> # Stage 2 canary deployment
|
||||||
|
bin/railiance observe <app> # Stage 2/3 evidence collection
|
||||||
|
bin/railiance promote <app> # Stage 3 production promotion
|
||||||
|
bin/railiance rollback <app> # rollback to previous stable
|
||||||
|
```
|
||||||
|
|
||||||
|
The exact command names may change as implementation lands, but the behavior
|
||||||
|
must preserve the stage gates and evidence requirements in this document.
|
||||||
|
|
||||||
@@ -10,7 +10,7 @@ topic_slug: railiance
|
|||||||
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
|
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
|
||||||
state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
|
state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
|
||||||
created: "2026-02-24"
|
created: "2026-02-24"
|
||||||
updated: "2026-05-03"
|
updated: "2026-06-16"
|
||||||
---
|
---
|
||||||
|
|
||||||
# Staged Promotion Lifecycle
|
# Staged Promotion Lifecycle
|
||||||
@@ -54,7 +54,7 @@ Expected cross-repo handoffs:
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: RAIL-BS-WP-0006-T01
|
id: RAIL-BS-WP-0006-T01
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
|
state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
|
||||||
```
|
```
|
||||||
@@ -72,6 +72,13 @@ The spec should define:
|
|||||||
**Done when:** the lifecycle is clear enough to apply to Forgejo as a later
|
**Done when:** the lifecycle is clear enough to apply to Forgejo as a later
|
||||||
production workload.
|
production workload.
|
||||||
|
|
||||||
|
2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from
|
||||||
|
`docs/README.md`. The specification defines Stage 1 local validation, Stage 2
|
||||||
|
production canary, Stage 3 production promotion, required checks and evidence,
|
||||||
|
canary acceptance gates, rollback expectations, human approval gates for
|
||||||
|
production-critical workloads, and the Forgejo readiness questions that must be
|
||||||
|
answered before cutover.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### T02 - Define railiance directory schema and app.toml contract
|
### T02 - Define railiance directory schema and app.toml contract
|
||||||
|
|||||||
Reference in New Issue
Block a user