Repo hygiene + new workplans (RAIL-BS-WP-0008/0009)
Some checks failed
railiance-tests / smoke (push) Has been cancelled
Some checks failed
railiance-tests / smoke (push) Has been cancelled
- Add RAIL-BS-WP-0008 (activity-core WP-0016 deploy) and RAIL-BS-WP-0009 (admin-sync smoke) from inbox asks 87952ff1 / aa8b7986 - Archive finished workplans to workplans/archived/ per ADR-001 convention; normalize frontmatter statuses (completed/done -> finished) - Fill stack-and-commands.md, complete repo-boundary.md, refresh SCOPE Current State, add docs/operator-runbook.md for production-touching targets Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0001
|
||||
type: workplan
|
||||
title: "Dependency Management — Add lockfile for Ansible control-node deps"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: railiance
|
||||
topic_slug: railiance
|
||||
state_hub_workstream_id: 59155efb-b461-4caa-ad7b-b3fce348db84
|
||||
state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
|
||||
created: "2026-03-01"
|
||||
updated: "2026-03-01"
|
||||
completed: "2026-03-01"
|
||||
---
|
||||
|
||||
# Dependency Management — Add Ansible control-node lockfile
|
||||
|
||||
## Problem
|
||||
|
||||
This repo drives all Ansible automation but carries no pinned, machine-readable
|
||||
inventory of its own runtime dependencies.
|
||||
|
||||
The Ansible version (and all pip packages it depends on) are whatever is
|
||||
installed on the control node at any given time. This means:
|
||||
|
||||
- Behaviour is not reproducible across machines or over time
|
||||
- The Custodian State Hub SBOM scanner finds nothing to ingest (`last_sbom_at = null`)
|
||||
- Licence and vulnerability auditing of the actual dependencies in use is impossible
|
||||
- The `railiance-cluster` repo appears as a gap in the SBOM coverage map
|
||||
|
||||
## Root cause
|
||||
|
||||
No `pyproject.toml` (or `requirements.txt`) declares the control-node pip
|
||||
dependencies. No `ansible/requirements.yml` exists for Galaxy collections
|
||||
(correct if none are used; but it should be explicit).
|
||||
|
||||
## Expected state after this task
|
||||
|
||||
- `pyproject.toml` at repo root declares `ansible` as a dependency (and any
|
||||
other pip packages used by playbooks or the `bin/` commands)
|
||||
- `uv.lock` is generated and committed — pins Ansible + full transitive pip tree
|
||||
- If Galaxy collections are used: `ansible/requirements.yml` lists them
|
||||
- SBOM is ingested: `last_sbom_at` is not null in the State Hub
|
||||
- The SBOM dashboard shows `railiance-cluster` in the railiance domain row
|
||||
with a package count
|
||||
|
||||
## Tasks
|
||||
|
||||
### T1 — Audit control-node pip dependencies
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0001-T01
|
||||
state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
|
||||
status: done
|
||||
priority: medium
|
||||
completed: "2026-03-01"
|
||||
```
|
||||
|
||||
Review `bin/` commands, Ansible playbooks, and any Python scripts in the repo.
|
||||
List all pip packages that must be present on the control node:
|
||||
- `ansible` (minimum version)
|
||||
- Any collections-related tools (ansible-core, ansible-lint, etc.)
|
||||
- Any other pip deps called from scripts (e.g. `paramiko`, `netaddr`, `jinja2`)
|
||||
|
||||
### T2 — Create pyproject.toml and generate uv.lock
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0001-T02
|
||||
status: done
|
||||
priority: medium
|
||||
completed: "2026-03-01"
|
||||
state_hub_task_id: "8aa8a9d3-6560-4176-b933-72a21e6d43d4"
|
||||
```
|
||||
|
||||
1. Create `pyproject.toml`:
|
||||
```toml
|
||||
[project]
|
||||
name = "railiance-cluster"
|
||||
version = "0.1.0"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"ansible>=10", # adjust version as appropriate
|
||||
# add other deps found in T1
|
||||
]
|
||||
```
|
||||
2. Run `uv lock` to generate `uv.lock`
|
||||
3. Commit both files
|
||||
|
||||
### T3 — Ingest SBOM into State Hub
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0001-T03
|
||||
status: done
|
||||
priority: medium
|
||||
completed: "2026-03-01"
|
||||
state_hub_task_id: "4fb477e9-dbac-4e43-84d0-5202c68f4705"
|
||||
```
|
||||
|
||||
From `~/the-custodian/state-hub/`:
|
||||
|
||||
```bash
|
||||
make ingest-sbom REPO=railiance-cluster SCAN=1 REPO_PATH=/home/worsch/railiance-cluster
|
||||
```
|
||||
|
||||
Verify in the SBOM dashboard: railiance domain should show `railiance-cluster`
|
||||
with a package count and no gap warning.
|
||||
|
||||
### T4 — Create ansible/requirements.yml (even if empty)
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0001-T04
|
||||
status: done
|
||||
priority: low
|
||||
completed: "2026-03-01"
|
||||
state_hub_task_id: "d0eb1c96-e7c2-4f6b-b934-a3f295e4db72"
|
||||
```
|
||||
|
||||
Create `ansible/requirements.yml`. If no Galaxy roles or collections are used,
|
||||
create it empty with a comment. This makes the absence of collections explicit:
|
||||
|
||||
```yaml
|
||||
---
|
||||
# No external Ansible Galaxy roles or collections required.
|
||||
# Add roles/collections here as needed:
|
||||
# roles: []
|
||||
# collections: []
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- Custodian SBOM Convention: `canon/standards/sbom-convention_v0.1.md`
|
||||
- SBOM dashboard: http://127.0.0.1:3000/sbom
|
||||
- Repos coverage page: http://127.0.0.1:3000/repos
|
||||
- State Hub task: `5f8cade5-119c-42e8-ba93-e9d0478650e4`
|
||||
175
workplans/archived/260622-RAIL-BS-WP-0002-k3s-baseline.md
Normal file
175
workplans/archived/260622-RAIL-BS-WP-0002-k3s-baseline.md
Normal file
@@ -0,0 +1,175 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0002
|
||||
type: workplan
|
||||
title: "k3s and Kubernetes Platform Baseline"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: railiance
|
||||
topic_slug: railiance
|
||||
repo_goal_id: "70ab2379-fb9d-4fec-a09d-b2a717e4ace8"
|
||||
state_hub_workstream_id: "4c63dfc6-9eac-4e79-9f77-8f644ad7147d"
|
||||
created: "2026-03-09"
|
||||
updated: "2026-03-10"
|
||||
completed: "2026-03-10"
|
||||
---
|
||||
|
||||
# k3s and Kubernetes Platform Baseline
|
||||
|
||||
## Goal
|
||||
|
||||
Install k3s, Helm, and the baseline Kubernetes services on the converged
|
||||
HostEurope node. This workplan picks up exactly where `railiance-hosts`
|
||||
leaves off: a hardened, verified OS node that is ready for Kubernetes.
|
||||
|
||||
## Pre-condition
|
||||
|
||||
`railiance-infra` converge + Goss verify must pass before any task here
|
||||
is executed:
|
||||
|
||||
```bash
|
||||
# In railiance-infra/
|
||||
make converge
|
||||
make verify # must exit 0
|
||||
```
|
||||
|
||||
## Boundary
|
||||
|
||||
This repo owns everything from k3s upward. It must not re-configure items
|
||||
defined in `railiance-infra/spec/server-baseline.yaml`. See ADR-003:
|
||||
`railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`.
|
||||
|
||||
**Out of scope here:** platform services (PostgreSQL, storage, identity)
|
||||
→ `railiance-platform`. Application deployments (Gitea, coulomb services)
|
||||
→ `railiance-apps`.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Ansible playbook: install k3s (server mode)
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "3f042630-eab0-4c6a-9167-e2b28ff20e40"
|
||||
completed: "2026-03-10"
|
||||
```
|
||||
|
||||
Harden `ansible/bootstrap.yml` to a production-ready k3s install:
|
||||
|
||||
- Use the official k3s install script pinned to a specific version
|
||||
(`INSTALL_K3S_VERSION=v1.35.1+k3s1`)
|
||||
- `INSTALL_K3S_EXEC="server --cluster-init --write-kubeconfig-mode=644"`
|
||||
(cluster-init enables embedded etcd for future HA expansion)
|
||||
- Wait for node `Ready` before proceeding:
|
||||
```bash
|
||||
k3s kubectl wait node --all --for=condition=Ready --timeout=120s
|
||||
```
|
||||
- Fetch kubeconfig to the control node as `~/.kube/config-hosteurope`
|
||||
|
||||
**Done when:** `k3s kubectl get nodes` returns `Ready` from both the server
|
||||
and the control node (via kubeconfig).
|
||||
|
||||
---
|
||||
|
||||
### T02 — Helm installation
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "e8510646-46ed-4697-a345-f3d3009eea78"
|
||||
completed: "2026-03-10"
|
||||
```
|
||||
|
||||
Add a task (or a role `roles/helm/`) that:
|
||||
|
||||
1. Downloads the Helm binary (pinned version) to `/usr/local/bin/helm`
|
||||
2. Verifies the checksum
|
||||
3. Confirms `helm version` succeeds
|
||||
|
||||
**Done when:** `helm version` succeeds on the HostEurope node.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Smoke test: k3s + Helm
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "dab2c07f-8aa0-4635-8df6-857e87e93fc5"
|
||||
completed: "2026-03-10"
|
||||
```
|
||||
|
||||
Extend `tests/smoke_kube.sh` to assert:
|
||||
|
||||
- `k3s kubectl get nodes` → node in Ready state
|
||||
- `helm version` exits 0
|
||||
- CoreDNS pod running in `kube-system`
|
||||
- Traefik ingress controller pod running (default in k3s)
|
||||
|
||||
Run via:
|
||||
```bash
|
||||
ansible-playbook -i ansible/hosts.ini ansible/smoke.yml
|
||||
```
|
||||
or directly over SSH if the kubeconfig is available locally.
|
||||
|
||||
**Done when:** all assertions pass and the script exits 0.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Commit kubeconfig management notes
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "5c3d40e4-239b-488e-9519-6f7a38d2325f"
|
||||
completed: "2026-03-10"
|
||||
```
|
||||
|
||||
Document in `docs/kubeconfig.md`:
|
||||
|
||||
- Where the kubeconfig is fetched to (`~/.kube/config-hosteurope`)
|
||||
- How to merge it into `~/.kube/config`
|
||||
- How to switch context: `kubectl config use-context default`
|
||||
- Security note: kubeconfig is gitignored (contains cluster CA + client cert)
|
||||
|
||||
**Done when:** doc written and committed.
|
||||
|
||||
---
|
||||
|
||||
### T05 — Add `make k3s-install` and `make smoke` targets
|
||||
|
||||
```task
|
||||
id: T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "7f9e0e58-a130-467a-a2d0-b3f2564e496f"
|
||||
completed: "2026-03-10"
|
||||
```
|
||||
|
||||
Add to Makefile (create one if none exists):
|
||||
|
||||
```makefile
|
||||
k3s-install: ## Install k3s and Helm on all inventory hosts
|
||||
ansible-playbook -i ansible/hosts.ini ansible/bootstrap.yml
|
||||
|
||||
smoke: ## Run Kubernetes smoke tests
|
||||
bash tests/smoke_kube.sh
|
||||
```
|
||||
|
||||
**Done when:** both targets work and are listed in `make help`.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Repo goal: `70ab2379-fb9d-4fec-a09d-b2a717e4ace8` (Install k3s and Kubernetes Baseline)
|
||||
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure Kubernetes Infrastructure)
|
||||
- Pre-condition: railiance-infra WP-0001 (Secure Single-Server Bootstrap) — completed 2026-03-09
|
||||
- Boundary ADR: `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`
|
||||
- k3s releases: https://github.com/k3s-io/k3s/releases
|
||||
@@ -0,0 +1,194 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0003
|
||||
type: bug-report
|
||||
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: tegwick
|
||||
created: "2026-03-10"
|
||||
updated: "2026-03-10"
|
||||
state_hub_workstream_id: "7ee9ee22-1fae-4567-9194-8d70a9e0f45b"
|
||||
---
|
||||
|
||||
# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
|
||||
|
||||
## Summary
|
||||
|
||||
On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
|
||||
restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
|
||||
entered CrashLoopBackOff and produced no logs. As a result Gitea's login
|
||||
and all write operations hung indefinitely. The root page was still served
|
||||
(from Valkey cache) which masked the failure.
|
||||
|
||||
The fix was to patch a missing key in a Kubernetes secret. The root cause is
|
||||
that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
|
||||
populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
|
||||
secret, even though the pgpool pod requires it at startup.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|---|---|
|
||||
| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
|
||||
| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
|
||||
| ~11:00 | User noticed Gitea login hanging; home page still loading |
|
||||
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
|
||||
| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
|
||||
| ~13:15 | Gitea fully operational |
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
The Bitnami `pgpool` container startup script reads the file
|
||||
`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
|
||||
`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
|
||||
mount. That secret key was never created by the Helm chart, so the file did
|
||||
not exist. The container exited immediately with no logs.
|
||||
|
||||
The pod had been running for 20 days without a restart, so this gap was
|
||||
never discovered during initial deployment.
|
||||
|
||||
---
|
||||
|
||||
## Evidence
|
||||
|
||||
```bash
|
||||
# Secret was missing the pgpool-password key
|
||||
sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
|
||||
# data: keys were password, postgres-password, repmgr-password only
|
||||
# pgpool-password was absent
|
||||
|
||||
# pgpool pod describe showed 824 back-off restarts over 173 minutes
|
||||
# No logs in either current or --previous output
|
||||
sudo k3s kubectl logs -n default <pgpool-pod> --previous
|
||||
# (empty)
|
||||
|
||||
# Gitea process had zero TCP connections to PostgreSQL port 5432
|
||||
# but many connections to Valkey port 6379
|
||||
cat /proc/<gitea-pid>/net/tcp | grep 1538 # 1538 = 5432 hex — no results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Immediate Fix Applied
|
||||
|
||||
```bash
|
||||
# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
|
||||
sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
|
||||
--type='json' \
|
||||
-p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
|
||||
|
||||
# Restart pgpool
|
||||
sudo k3s kubectl delete pod -n default <pgpool-pod-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Risk: Fix Will Be Lost on helm upgrade
|
||||
|
||||
The patched secret is managed by Helm (annotation:
|
||||
`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
|
||||
secret from the chart template, which does not include `pgpool-password`,
|
||||
and the bug will recur.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Add pgpool-password to Helm values
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
|
||||
```
|
||||
|
||||
Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
|
||||
include the pgpool-password so it survives `helm upgrade`:
|
||||
|
||||
```yaml
|
||||
postgresql-ha:
|
||||
postgresql:
|
||||
pgpoolPassword: <value matching sr-check-password>
|
||||
```
|
||||
|
||||
**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
|
||||
without manual secret patching.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Add pgpool health check to smoke test
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
|
||||
```
|
||||
|
||||
Extend `tests/smoke_kube.sh` to assert:
|
||||
|
||||
```bash
|
||||
# All postgresql-ha pods Running
|
||||
kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
|
||||
|
||||
# pgpool specifically not in CrashLoopBackOff
|
||||
kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
|
||||
-o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
|
||||
```
|
||||
|
||||
**Done when:** the smoke test catches a pgpool failure within 5 minutes.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Add HA failover test
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
|
||||
```
|
||||
|
||||
Create `tests/test_ha_failover.sh` that:
|
||||
|
||||
1. Records Gitea login response time (baseline)
|
||||
2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
|
||||
3. Waits for repmgr to promote a replica (max 60s)
|
||||
4. Asserts Gitea login POST still succeeds within 10s
|
||||
5. Asserts pgpool pod is Running (not CrashLoopBackOff)
|
||||
6. Asserts all postgresql pods return to Running
|
||||
|
||||
This test must pass before any PostgreSQL HA deployment is considered done.
|
||||
|
||||
**Done when:** script exits 0 against a live cluster.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Document the incident in docs/
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
|
||||
```
|
||||
|
||||
Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
|
||||
timeline, root cause, and fix, so future operators understand what happened
|
||||
and how to recover.
|
||||
|
||||
**Done when:** doc committed and linked from `docs/README.md`.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Bitnami postgresql-ha chart v16.2.2
|
||||
- Gitea Helm chart v12.2.0
|
||||
- Related decision: D3 (HA testing policy) in `DECISIONS.md`
|
||||
273
workplans/archived/260622-RAIL-BS-WP-0004-safety-net.md
Normal file
273
workplans/archived/260622-RAIL-BS-WP-0004-safety-net.md
Normal file
@@ -0,0 +1,273 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0004
|
||||
type: workplan
|
||||
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: tegwick
|
||||
topic_slug: railiance
|
||||
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
|
||||
created: "2026-02-25"
|
||||
updated: "2026-03-26"
|
||||
---
|
||||
|
||||
# Integrated Backup — S2 Kubernetes Runtime Layer
|
||||
|
||||
## Goal
|
||||
|
||||
Implement the Q3 (Operability & Resilience) integrated backup for
|
||||
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
|
||||
encrypted with age, written to a local directory on the server. No external
|
||||
dependencies required.
|
||||
|
||||
## Architecture (Decision D4)
|
||||
|
||||
Each railiance repo implements its own backup for what it owns. No central
|
||||
backup service. See `DECISIONS.md` D4 for full rationale.
|
||||
|
||||
**Standard interface every railiance repo must provide:**
|
||||
|
||||
```bash
|
||||
make backup # encrypt + write to /opt/backup/railiance/<layer>/
|
||||
make restore # restore from most recent local backup
|
||||
```
|
||||
|
||||
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
|
||||
Output: `/opt/backup/railiance/cluster/` on the server.
|
||||
|
||||
## What S2 (railiance-cluster) owns and must back up
|
||||
|
||||
| Asset | Why it matters |
|
||||
|---|---|
|
||||
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
|
||||
| Helm release values | Runtime values not in git (any manually applied overrides) |
|
||||
| kubeconfig | Admin access to the cluster |
|
||||
|
||||
**Not S2's responsibility:**
|
||||
- Custodian State Hub DB → the-custodian owns this
|
||||
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
|
||||
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
|
||||
- PostgreSQL data volumes → S3 (railiance-platform) owns this
|
||||
|
||||
## Encryption
|
||||
|
||||
Reuse the age public key from `.sops.yaml`:
|
||||
|
||||
```bash
|
||||
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
|
||||
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
|
||||
```
|
||||
|
||||
Decryption requires the private key at `~/.config/sops/age/keys.txt`
|
||||
(same key used for `sops -d`). No additional key management needed.
|
||||
|
||||
## Extension Point EP-RAIL-005
|
||||
|
||||
Once all five OAS layers implement this interface, the custodian can
|
||||
orchestrate a full-stack backup with:
|
||||
|
||||
```bash
|
||||
for repo in railiance-infra railiance-cluster railiance-platform \
|
||||
railiance-enablement railiance-apps; do
|
||||
make -C ~/$repo backup
|
||||
done
|
||||
```
|
||||
|
||||
No special protocol needed — just the standard interface.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Define backup directory and encryption wrapper
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
|
||||
```
|
||||
|
||||
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
|
||||
|
||||
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
|
||||
- Encrypt each artifact with age using public key from `.sops.yaml`
|
||||
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
|
||||
- Keep last 7 of each type
|
||||
- Write `.last-backup` stamp
|
||||
- Exit 0 on success, non-zero on any failure
|
||||
- No network required
|
||||
|
||||
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
|
||||
custodian DB — wrong scope, not applicable to this server).
|
||||
|
||||
**Done when:** `make backup` runs on COULOMBCORE without error and files
|
||||
appear in `/opt/backup/railiance/cluster/`.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Back up k3s state (SQLite hot backup)
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
|
||||
```
|
||||
|
||||
k3s has built-in etcd snapshot support:
|
||||
|
||||
```bash
|
||||
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
|
||||
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
|
||||
```
|
||||
|
||||
Add to the backup script: take a fresh snapshot, encrypt with age,
|
||||
copy to `/opt/backup/railiance/cluster/`.
|
||||
|
||||
> **Note — verify etcd is in use before implementing:**
|
||||
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
|
||||
> Without it, k3s uses SQLite and this command will fail.
|
||||
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
|
||||
|
||||
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
|
||||
> this is resolved (backup runs under root's crontab).
|
||||
|
||||
**Done when:** backup includes a current etcd snapshot.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Back up Helm release values
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
|
||||
```
|
||||
|
||||
Capture current runtime Helm values for all releases:
|
||||
|
||||
```bash
|
||||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
|
||||
jq -r '.[].name + " " + .namespace' | \
|
||||
while read name ns; do
|
||||
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
|
||||
done
|
||||
```
|
||||
|
||||
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
|
||||
|
||||
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
|
||||
> only by default. The backup script must either run as root (see T06) or k3s
|
||||
> must be configured with `--write-kubeconfig-mode=644`. Running as root
|
||||
> (via root crontab) is the chosen approach — no config change needed.
|
||||
|
||||
**Done when:** backup includes a snapshot of all Helm release values.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Back up kubeconfig
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
|
||||
```
|
||||
|
||||
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
|
||||
into `kubeconfig-<ts>.yaml.age` in the backup directory.
|
||||
|
||||
**Done when:** backup includes the encrypted kubeconfig.
|
||||
|
||||
---
|
||||
|
||||
### T05 — make restore target
|
||||
|
||||
```task
|
||||
id: T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
|
||||
```
|
||||
|
||||
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
|
||||
backups, with guided restore for the etcd snapshot case.
|
||||
|
||||
Restore of etcd from snapshot:
|
||||
```bash
|
||||
sudo k3s server --cluster-reset \
|
||||
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
|
||||
```
|
||||
|
||||
**Done when:** `make restore` prints available backups and a restore guide.
|
||||
|
||||
---
|
||||
|
||||
### T06 — Install cron job and run restore drill
|
||||
|
||||
```task
|
||||
id: T06
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
|
||||
```
|
||||
|
||||
#### Solving the sudo problem
|
||||
|
||||
The backup script needs root for two reasons:
|
||||
- `k3s etcd-snapshot save` requires root
|
||||
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
|
||||
|
||||
**Solution: run the cron under root's crontab.**
|
||||
|
||||
This is the correct pattern for system-level backup jobs. It avoids a
|
||||
proliferating sudoers whitelist (one entry per command, brittle to maintain)
|
||||
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
|
||||
production. The backup writes to `/opt/backup/` which is root-owned anyway.
|
||||
|
||||
Install the cron as root:
|
||||
|
||||
```bash
|
||||
sudo crontab -e
|
||||
# Add:
|
||||
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
|
||||
```
|
||||
|
||||
Note: use the absolute path to the repo — `~` does not expand reliably in
|
||||
root's crontab unless HOME is set.
|
||||
|
||||
Verify it is installed:
|
||||
```bash
|
||||
sudo crontab -l | grep railiance
|
||||
```
|
||||
|
||||
#### Restore drill
|
||||
|
||||
Once T01–T04 are done, run a decrypt-and-verify drill:
|
||||
|
||||
```bash
|
||||
# Decrypt the etcd snapshot and verify it is a valid snapshot file
|
||||
sudo age -d -i ~/.config/sops/age/keys.txt \
|
||||
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
|
||||
| file -
|
||||
|
||||
# Record the drill
|
||||
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
|
||||
>> /opt/backup/railiance/cluster/restore-drill.log
|
||||
```
|
||||
|
||||
**Done when:** cron installed under root, drill completes without error,
|
||||
log entry written.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
|
||||
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
|
||||
- OAS Q3: Operability & Resilience
|
||||
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
|
||||
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore
|
||||
143
workplans/archived/260622-RAIL-BS-WP-0005-kubeconfig-delivery.md
Normal file
143
workplans/archived/260622-RAIL-BS-WP-0005-kubeconfig-delivery.md
Normal file
@@ -0,0 +1,143 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0005
|
||||
type: workplan
|
||||
title: "Kubeconfig delivery for netkingdom SSO/MFA stack apply"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: railiance-worker
|
||||
topic_slug: railiance
|
||||
capability_request_id: "34b97d89-e80a-42ae-a623-a9185e5b17f5"
|
||||
created: "2026-03-20"
|
||||
updated: "2026-03-20"
|
||||
state_hub_workstream_id: "b236de41-2f33-4ebc-bb84-5fcedb2982f8"
|
||||
---
|
||||
|
||||
# RAIL-BS-WP-0005 — Kubeconfig delivery for netkingdom SSO/MFA stack apply
|
||||
|
||||
**Scope:** Fulfil capability request 34b97d89 — deliver a working local kubeconfig so
|
||||
the netkingdom SSO/MFA workstream (NK-WP-0001) can apply manifests (T02–T08) against
|
||||
the existing K3s cluster on HostEurope (92.205.130.254).
|
||||
|
||||
**Context:**
|
||||
- Cluster is healthy: one node `Ready`, k3s v1.30.3, 200 days uptime.
|
||||
- K3s API listens on `*:6443` (all interfaces); UFW is inactive — direct public access works.
|
||||
- The in-cluster kubeconfig uses `server: https://127.0.0.1:6443`; must be rewritten
|
||||
to `https://92.205.130.254:6443` for off-server use.
|
||||
- No ops-bridge tunnel needed for kubectl (API is directly reachable).
|
||||
- Wrong catalog entry was filed (PostgreSQL HA instead of k3s provisioning) — noted,
|
||||
no API endpoint to correct it retroactively; document here.
|
||||
|
||||
**Depends on:** RAIL-BS-WP-0002 (k3s-kubernetes-baseline) ✓ completed
|
||||
**Unblocks:** NK-WP-0001 T02–T08 (SSO/MFA stack apply)
|
||||
|
||||
---
|
||||
|
||||
## Task: Extract kubeconfig from HostEurope server
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0005-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "c59a8e0c-e1fd-4cfd-aa5e-7cbb895609f0"
|
||||
```
|
||||
|
||||
```bash
|
||||
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
|
||||
"sudo cat /etc/rancher/k3s/k3s.yaml" > /tmp/k3s-raw.yaml
|
||||
```
|
||||
|
||||
Verify file is non-empty and contains a valid YAML kubeconfig.
|
||||
|
||||
---
|
||||
|
||||
## Task: Rewrite server address and install kubeconfig
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0005-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "93d61bc6-47e7-442f-8611-97f5f2f208c4"
|
||||
```
|
||||
|
||||
Replace `127.0.0.1` with `92.205.130.254` in the kubeconfig; place at
|
||||
`~/.kube/config` (create `~/.kube/` if absent). Back up any existing config first.
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.kube
|
||||
# back up existing if present
|
||||
[ -f ~/.kube/config ] && cp ~/.kube/config ~/.kube/config.bak.$(date +%Y%m%d)
|
||||
# rewrite server and install
|
||||
sed 's|https://127.0.0.1:6443|https://92.205.130.254:6443|g' /tmp/k3s-raw.yaml \
|
||||
> ~/.kube/config
|
||||
chmod 600 ~/.kube/config
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task: Smoke-test kubectl from local machine
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0005-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "f15626c2-73a0-443f-8aae-5515806ae0fa"
|
||||
```
|
||||
|
||||
```bash
|
||||
kubectl get nodes
|
||||
kubectl get pods -A
|
||||
```
|
||||
|
||||
Expected: node `254.130.205.92.host.secureserver.net` in `Ready` state.
|
||||
If unreachable, check firewall on server: `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 "sudo ufw status"`.
|
||||
|
||||
---
|
||||
|
||||
## Task: Resolve capability request
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0005-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8109450c-95df-4d01-96fd-8847c88beb34"
|
||||
```
|
||||
|
||||
Patch capability request 34b97d89 to `completed` with a resolution note:
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/capability-requests/34b97d89-e80a-42ae-a623-a9185e5b17f5/status" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"status": "completed",
|
||||
"note": "Kubeconfig delivered to ~/.kube/config (server: 92.205.130.254:6443). kubectl smoke-test passed. NK-WP-0001 T02-T08 can proceed. Note: wrong catalog_entry_id filed (PostgreSQL HA eca6e5cc instead of k3s provisioning 9520cc98) — no retroactive API to correct."
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task: Register UFW-inactive finding as technical debt
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0005-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "ea120464-fdeb-4259-99e1-e6743cd86797"
|
||||
```
|
||||
|
||||
UFW is inactive on 92.205.130.254 — K3s API port 6443 is exposed to the internet,
|
||||
protected only by TLS mutual auth. Register as TD item in state-hub so it gets
|
||||
addressed in a future railiance-cluster security hardening workplan.
|
||||
|
||||
```bash
|
||||
curl -s -X POST "http://127.0.0.1:8000/technical-debt/" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"domain": "railiance",
|
||||
"debt_type": "security",
|
||||
"severity": "medium",
|
||||
"title": "UFW inactive on HostEurope K3s node — API port 6443 exposed to internet",
|
||||
"description": "UFW is inactive on 92.205.130.254. K3s API (port 6443) is reachable from anywhere, protected only by TLS client certificates. Should be restricted to known IPs or tunnelled. Discovered 2026-03-20 during kubeconfig delivery workplan.",
|
||||
"status": "open"
|
||||
}'
|
||||
```
|
||||
@@ -0,0 +1,110 @@
|
||||
---
|
||||
id: RAILIANCE-WP-0012
|
||||
type: workplan
|
||||
title: "activity-core cluster-owned deploy/verify"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: railiance
|
||||
created: "2026-06-15"
|
||||
updated: "2026-06-16"
|
||||
state_hub_workstream_id: "6434f7cb-e13c-4c05-839b-197bb239d5cd"
|
||||
---
|
||||
|
||||
# activity-core cluster-owned deploy/verify
|
||||
|
||||
## Context
|
||||
|
||||
activity-core `ACTIVITY-WP-0007-T06` needs live Railiance cluster evidence for
|
||||
the disabled ops inventory probe. That live verification should be owned by the
|
||||
cluster/operator layer, not by arbitrary activity-core sessions with local
|
||||
`kubectl` assumptions.
|
||||
|
||||
This workplan creates a cluster-owned path that keeps credentials in
|
||||
operator-owned locations while returning only non-secret evidence to State Hub.
|
||||
|
||||
## Implement cluster-owned verifier
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0012-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "3769fdfb-b4f1-431b-a55a-672d93b3ea55"
|
||||
```
|
||||
|
||||
Add a repeatable command that:
|
||||
|
||||
- reconciles the activity-core Railiance runtime bundle;
|
||||
- reruns `actcore-sync`;
|
||||
- checks the `ops-service-inventory-probes` ActivityDefinition exists and is
|
||||
still disabled;
|
||||
- triggers the disabled definition manually through the in-cluster API path;
|
||||
- verifies a fresh `ops_inventory_probe` progress event exists in State Hub;
|
||||
- posts a non-secret State Hub evidence note for activity-core to cite.
|
||||
|
||||
Implemented as `tools/cmd/railiance-verify-activity-core` with Makefile target
|
||||
`verify-activity-core`. The script defaults to the `railiance01` SSH executor;
|
||||
use `ACTIVITY_CORE_CLUSTER_HOST=local` only for an explicitly selected local
|
||||
`kubectl` context.
|
||||
|
||||
## Run live verification and publish evidence
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0012-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "6d7f87c3-a533-4de1-84de-9ca65f2e2779"
|
||||
```
|
||||
|
||||
Run `make verify-activity-core` against the Railiance cluster. On success, cite
|
||||
the State Hub evidence note id in this task and in activity-core
|
||||
`ACTIVITY-WP-0007-T06`.
|
||||
|
||||
If a gate fails, the verifier must still post a non-secret State Hub note with
|
||||
the failing gate and last completed evidence fields.
|
||||
|
||||
2026-06-15: Completed against Railiance01 after refreshing the same-tag
|
||||
`activity-core:railiance01-prod` image from activity-core commit `ab17378`,
|
||||
importing digest `sha256:cff43c72455b9fc4fc11a0a997b4671a38987bb4583a600245dd961965af0e40`
|
||||
into k3s containerd, syncing the current runtime bundle to
|
||||
`/home/tegwick/activity-core/k8s/railiance`, and restarting the activity-core
|
||||
runtime deployments. The verifier reconciled the runtime bundle, completed
|
||||
`actcore-sync`, confirmed `ops-service-inventory-probes` exists and remains
|
||||
disabled, triggered it manually, verified State Hub progress
|
||||
`4c82360d-33e7-455b-8ab4-33facd4a3f8e`, and posted evidence note
|
||||
`baeeaeac-aa6d-4406-ae64-e54577f21386`.
|
||||
|
||||
An intermediate verifier invocation accidentally targeted the local
|
||||
CoulombCore `kubectl` context. It created only `actcore-*` runtime resources in
|
||||
the existing `activity-core` namespace; those resources were removed with the
|
||||
runtime manifest cleanup, and the pre-existing `llm-connect` deployment remains
|
||||
running.
|
||||
|
||||
Operational cleanup note: the successful Railiance01 verifier run used
|
||||
`ACTIVITY_CORE_RESTART_DEPLOYMENTS=1` after importing the same-tag image. The
|
||||
script was corrected afterward to restart only `actcore-api`,
|
||||
`actcore-worker`, and `actcore-event-router`, because
|
||||
`actcore-state-hub-bridge` uses host networking and a rolling restart leaves a
|
||||
new bridge pod pending behind the host-bound running pod. A 2026-06-16 cleanup
|
||||
check showed the bridge rollout had settled on Railiance01: the host-bound
|
||||
bridge pod was running and the replacement ReplicaSet was scaled to zero, so no
|
||||
manual live cleanup was needed.
|
||||
|
||||
## Handoff closure to activity-core
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0012-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "43f652c6-fcc4-49fa-90cc-4122eb6d5321"
|
||||
```
|
||||
|
||||
After live evidence exists, update activity-core `ACTIVITY-WP-0007-T06` to cite
|
||||
the Railiance evidence and close it if Inter-Hub submission is active or
|
||||
explicitly deferred with the clean State Hub fallback result.
|
||||
|
||||
2026-06-15: Updated activity-core `ACTIVITY-WP-0007-T06` to cite Railiance
|
||||
evidence note `baeeaeac-aa6d-4406-ae64-e54577f21386` and close the task with
|
||||
Inter-Hub submission explicitly deferred while the State Hub fallback evidence
|
||||
path is verified.
|
||||
@@ -0,0 +1,120 @@
|
||||
---
|
||||
id: RAILIANCE-WP-0013
|
||||
type: workplan
|
||||
title: "activity-core verifier evidence hardening"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: railiance
|
||||
created: "2026-06-16"
|
||||
updated: "2026-06-16"
|
||||
state_hub_workstream_id: "a3abb83a-2d42-40f9-a5f6-1dbc36903436"
|
||||
---
|
||||
|
||||
# activity-core verifier evidence hardening
|
||||
|
||||
## Context
|
||||
|
||||
`RAILIANCE-WP-0012` moved activity-core live deploy/verify ownership into
|
||||
`railiance-cluster` and produced State Hub evidence
|
||||
`baeeaeac-aa6d-4406-ae64-e54577f21386`, with `ops_inventory_probe` progress
|
||||
`4c82360d-33e7-455b-8ab4-33facd4a3f8e`.
|
||||
|
||||
A follow-up review found hardening work that matters for routine verifier use:
|
||||
the verifier should prove the State Hub progress event belongs to the specific
|
||||
manual trigger it launched, evidence should include an immutable runtime
|
||||
identity, and local `kubectl` mode should require an explicit double opt-in.
|
||||
|
||||
This is a hardening follow-up only; it does not reopen activity-core
|
||||
`ACTIVITY-WP-0007-T06`.
|
||||
|
||||
## Correlate State Hub progress to the manual trigger
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0013-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "d013a4a9-77fc-4cf0-babf-528d71acc0a1"
|
||||
```
|
||||
|
||||
Update `tools/cmd/railiance-verify-activity-core` so after
|
||||
`POST /activity-definitions/<id>/trigger` it parses `trigger_key`, derives the
|
||||
expected activity-core manual `run_id`, and polls State Hub until it finds
|
||||
`ops_inventory_probe` where:
|
||||
|
||||
- `detail.activity_id == DEFINITION_ID`;
|
||||
- `detail.activity_core_run_id == expected_run_id`.
|
||||
|
||||
The verifier must not pass on merely any event created after `STARTED_AT`.
|
||||
Include the expected run id and matched progress id in the evidence note.
|
||||
|
||||
2026-06-16: Implemented exact correlation. The verifier now derives the
|
||||
expected UUIDv5 `activity_core_run_id` from `<DEFINITION_ID>:<trigger_key>` and
|
||||
requires State Hub `ops_inventory_probe` detail to match both `activity_id` and
|
||||
`activity_core_run_id`.
|
||||
|
||||
## Record immutable runtime evidence
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0013-T02
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "c5780ec1-9a74-401e-b60e-a0fdf2b7e5d2"
|
||||
```
|
||||
|
||||
Ensure successful evidence includes either `activity_core_revision` or an
|
||||
immutable Kubernetes image ID/digest. When the remote repo revision is
|
||||
unavailable, fall back to the live `actcore-api` pod container `imageID`.
|
||||
|
||||
2026-06-16: Implemented `api_image_id` capture from the live `actcore-api` pod
|
||||
container status and added a guard so passed evidence must include either the
|
||||
remote repo revision or the immutable image ID.
|
||||
|
||||
## Guard explicit local kubectl override
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0013-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "0d60809f-3f1d-4ea9-a96f-af074911acc0"
|
||||
```
|
||||
|
||||
Keep `railiance01`/SSH as the default executor. If
|
||||
`ACTIVITY_CORE_CLUSTER_HOST=local` is selected, require an additional explicit
|
||||
opt-in such as `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` and print the current
|
||||
`kubectl` context before continuing.
|
||||
|
||||
2026-06-16: Implemented the double opt-in. `ACTIVITY_CORE_CLUSTER_HOST=local`
|
||||
now exits before cluster access unless `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` is
|
||||
also set, and accepted local mode prints the current `kubectl` context.
|
||||
|
||||
## Verify and publish hardening evidence
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0013-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "150e4fa3-800c-4997-baaa-da696f5a0fc0"
|
||||
```
|
||||
|
||||
Run `bash -n tools/cmd/railiance-verify-activity-core`, run
|
||||
`make verify-activity-core` against Railiance01, confirm the evidence note
|
||||
matched the manual trigger run id, and post a non-secret State Hub note citing
|
||||
the new evidence.
|
||||
|
||||
2026-06-16: Verified with `bash -n tools/cmd/railiance-verify-activity-core`
|
||||
and a live Railiance01 `make verify-activity-core` run. The verifier posted
|
||||
State Hub evidence note `60256e9a-9d1b-44db-8999-738cf03bca2e`, matched manual
|
||||
run id `90e3b112-d1e3-51af-8fb2-cb61f26add17`, matched
|
||||
`ops_inventory_probe` progress `db408146-0310-4ac3-ac77-f73c5a41e070`, and
|
||||
included `api_image_id`
|
||||
`sha256:5ff92a8217c450ae06075d00862b6e2a92a83ca09eea18b5a5e96b5d2d728b35`.
|
||||
|
||||
Done when:
|
||||
|
||||
- the verifier rejects unrelated fresh `ops_inventory_probe` events;
|
||||
- evidence includes a non-null revision or image digest;
|
||||
- local `kubectl` mode requires explicit double opt-in;
|
||||
- the Railiance01 verifier run posts a passed evidence note with matched run id;
|
||||
- `make fix-consistency REPO=railiance-cluster` has synced the workplan.
|
||||
@@ -0,0 +1,258 @@
|
||||
---
|
||||
id: RAIL-BS-WP-0006
|
||||
type: workplan
|
||||
title: "Staged Promotion Lifecycle"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: railiance
|
||||
topic_slug: railiance
|
||||
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
|
||||
state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
|
||||
created: "2026-02-24"
|
||||
updated: "2026-06-27"
|
||||
---
|
||||
|
||||
# Staged Promotion Lifecycle
|
||||
|
||||
## Goal
|
||||
|
||||
Design and implement the three-stage deployment lifecycle as the core
|
||||
Railiance application promotion pattern:
|
||||
|
||||
1. Stage 1: local development and validation.
|
||||
2. Stage 2: canary on production infrastructure.
|
||||
3. Stage 3: full production promotion with rollback.
|
||||
|
||||
This lifecycle should become the repeatable path for native Railiance apps and
|
||||
third-party upstream applications wrapped by a Railiance overlay repo.
|
||||
|
||||
## Why This Belongs Before Forgejo
|
||||
|
||||
Forgejo will become critical production infrastructure. Before moving the
|
||||
source forge itself, Railiance needs a well-defined promotion lifecycle so the
|
||||
Forgejo deployment, Actions runners, package registry, and future upgrades can
|
||||
move through the same staged gates as every other important workload.
|
||||
|
||||
## Boundary
|
||||
|
||||
This workplan lives in `railiance-cluster` because it defines cluster runtime
|
||||
promotion mechanics and the canonical handoff between local validation,
|
||||
canary deployment, and production routing.
|
||||
|
||||
Expected cross-repo handoffs:
|
||||
|
||||
- `railiance-enablement`: developer-facing CLI templates and CI workflow
|
||||
conventions.
|
||||
- `railiance-platform`: shared platform dependencies used by canaries.
|
||||
- `railiance-apps`: application Helm values and workload-specific promotion
|
||||
definitions.
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 - Write deployment lifecycle specification
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
|
||||
```
|
||||
|
||||
Write `docs/deployment-lifecycle.md`.
|
||||
|
||||
The spec should define:
|
||||
|
||||
- Stage 1, Stage 2, and Stage 3 semantics.
|
||||
- Required checks before each stage.
|
||||
- Canary acceptance gates.
|
||||
- Rollback expectations.
|
||||
- Human approval gates for production-critical workloads.
|
||||
|
||||
**Done when:** the lifecycle is clear enough to apply to Forgejo as a later
|
||||
production workload.
|
||||
|
||||
2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from
|
||||
`docs/README.md`. The specification defines Stage 1 local validation, Stage 2
|
||||
production canary, Stage 3 production promotion, required checks and evidence,
|
||||
canary acceptance gates, rollback expectations, human approval gates for
|
||||
production-critical workloads, and the Forgejo readiness questions that must be
|
||||
answered before cutover.
|
||||
|
||||
---
|
||||
|
||||
### T02 - Define railiance directory schema and app.toml contract
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "523cf928-bb0e-4109-a172-abf029c62885"
|
||||
```
|
||||
|
||||
Define the repository-local `railiance/` directory schema and `app.toml`
|
||||
contract for native and third-party applications.
|
||||
|
||||
Minimum contract:
|
||||
|
||||
- App identity and ownership.
|
||||
- Stage definitions.
|
||||
- Required platform dependencies.
|
||||
- Health checks and observability endpoints.
|
||||
- Promotion and rollback commands.
|
||||
- Secret references without plaintext secret values.
|
||||
|
||||
**Done when:** a repo can declare how it moves through the Railiance promotion
|
||||
lifecycle without bespoke instructions.
|
||||
|
||||
2026-06-27: Added `docs/app-toml-contract.md`, `schemas/railiance-app.schema.json`, and `examples/railiance/app.toml`. The v1 contract covers app identity, ownership, source/artifact policy, platform dependencies, secret references without plaintext values, health and observability endpoints, stage commands/checks/evidence, canary and promotion modes, rollback strategy, and human approval gates.
|
||||
|
||||
---
|
||||
|
||||
### T03 - Overlay repo pattern and creation script
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "7cd378f2-0319-407a-9ce7-2c6d1a6d6d24"
|
||||
```
|
||||
|
||||
Design the overlay repo pattern for third-party upstream applications and add
|
||||
`create_railiance_overlay_repo.sh` or equivalent tooling.
|
||||
|
||||
The pattern should keep upstream code and Railiance deployment concerns cleanly
|
||||
separated while still allowing reproducible promotion.
|
||||
|
||||
**Done when:** a third-party app can be wrapped without forking deployment
|
||||
logic into the upstream repository.
|
||||
|
||||
2026-06-27: Added `docs/overlay-repo-pattern.md` and `tools/create_railiance_overlay_repo.sh`, plus the `bin/railiance create-overlay` dispatcher entry. The scaffold records upstream identity in `railiance/upstream.toml`, generates a schema-valid `railiance/app.toml`, stage values, a thin Helm chart, Stage 1 test script, rollback runbook, and promotion notes without vendoring upstream code or touching secrets.
|
||||
|
||||
---
|
||||
|
||||
### T04 - railiance run command
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "95c3311b-04bb-4c83-bda3-47958217b665"
|
||||
```
|
||||
|
||||
Implement the Stage 1 `railiance run` command for local development and
|
||||
validation.
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- Read `railiance/app.toml`.
|
||||
- Start or validate the local development target.
|
||||
- Run defined local health checks.
|
||||
- Emit a machine-readable result suitable for later promotion gates.
|
||||
|
||||
**Done when:** at least one representative app can complete Stage 1 locally.
|
||||
|
||||
2026-06-27: Added `tools/cmd/railiance-run`, the `bin/railiance run` dispatcher entry, and `docs/railiance-run-command.md`. The command reads `railiance/app.toml`, runs Stage 1 commands and local checks, and emits `railiance.run-result.v1` JSON without command logs or secret values. Updated the overlay generator so a generated Forgejo overlay completes Stage 1 locally in this environment; Helm rendering is optional when Helm is unavailable.
|
||||
|
||||
---
|
||||
|
||||
### T05 - Canary Helm chart template
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T05
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "47b8cd47-99c7-4f31-a147-ea16afde7217"
|
||||
```
|
||||
|
||||
Create the Stage 2 canary Helm chart template.
|
||||
|
||||
Minimum requirements:
|
||||
|
||||
- Stable and canary release identities.
|
||||
- Weighted routing or equivalent traffic split through the chosen ingress
|
||||
path.
|
||||
- Prometheus-compatible annotations.
|
||||
- Resource limits appropriate for single-node and future ThreePhoenix use.
|
||||
- Rollback-safe values layout.
|
||||
|
||||
**Done when:** a canary deployment can be created without hand-editing cluster
|
||||
resources.
|
||||
|
||||
2026-06-27: Updated generated overlay charts for Stage 2 canaries. The
|
||||
scaffold now emits stable/canary release identities, isolated canary ingress by
|
||||
default, optional Traefik weighted routing, Prometheus-compatible annotations,
|
||||
HTTP probes, conservative single-node resource limits, rollback labels,
|
||||
separate Stage 2/Stage 3 values, and `tests/stage2-template.sh`. Verified a
|
||||
fresh Forgejo overlay with schema validation, Stage 1 run, and Stage 2 scaffold
|
||||
checks; Helm rendering was skipped because Helm is unavailable in this
|
||||
environment.
|
||||
|
||||
---
|
||||
|
||||
### T06 - railiance deploy --stage 2 and observation tooling
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T06
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "6a5c7422-fcb1-49d1-8153-e891bd1c27fa"
|
||||
```
|
||||
|
||||
Implement Stage 2 deployment and observation commands.
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- Deploy the canary from declared app metadata.
|
||||
- Show rollout state, pod health, ingress/routing state, and key metrics.
|
||||
- Fail closed when prerequisites or health gates are missing.
|
||||
|
||||
**Done when:** Stage 2 can be run and observed from a repeatable command path.
|
||||
|
||||
2026-06-27: Added `tools/cmd/railiance-stage2` and dispatcher entries for
|
||||
`bin/railiance deploy` and `bin/railiance observe`. Deploy emits a
|
||||
`railiance.stage2-deploy-result.v1` plan by default, can run Helm server dry-run
|
||||
or apply when tools and cluster access are present, and fails closed when
|
||||
required paths, Helm, or approval evidence are missing. Observe emits a
|
||||
`railiance.stage2-observe-result.v1` target plan by default and runs live
|
||||
kubectl rollout, pod, ingress, and metrics checks only with `--live`. Updated
|
||||
generated overlays to declare the repeatable Stage 2 plan commands.
|
||||
|
||||
---
|
||||
|
||||
### T07 - railiance promote, rollback, and onboarding guide
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0006-T07
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "476198f6-0049-4ac4-9593-6723c86c9602"
|
||||
```
|
||||
|
||||
Implement Stage 3 promotion and rollback commands, then write the reference
|
||||
onboarding guide.
|
||||
|
||||
Expected output:
|
||||
|
||||
- `railiance promote` for controlled production promotion.
|
||||
- `railiance rollback` for reverting to the previous stable version.
|
||||
- A guide showing how a representative app adopts the lifecycle.
|
||||
- Explicit human approval points for critical infrastructure workloads.
|
||||
|
||||
**Done when:** a representative app can move Stage 1 -> Stage 2 -> Stage 3 and
|
||||
back through rollback using documented commands.
|
||||
|
||||
2026-06-27: Added `tools/cmd/railiance-stage3` and dispatcher entries for
|
||||
`bin/railiance promote` and `bin/railiance rollback`. Both commands default to
|
||||
non-mutating JSON plans, apply modes require approval evidence and Helm, and
|
||||
rollback apply also requires a Helm revision for `helm-revision` strategy.
|
||||
Added `docs/promote-rollback-onboarding.md` with the representative Stage 1 ->
|
||||
Stage 2 -> Stage 3 -> rollback path and explicit human approval points for
|
||||
critical workloads. Updated generated overlays to declare promote/rollback plan
|
||||
commands.
|
||||
|
||||
## Dependencies
|
||||
|
||||
This workplan should be done before the Forgejo production cutover. It can run
|
||||
in parallel with preparatory ThreePhoenix design, but its Stage 2/3 behavior
|
||||
should be validated against the intended ThreePhoenix cluster model.
|
||||
@@ -0,0 +1,106 @@
|
||||
---
|
||||
id: RAILIANCE-WP-0014
|
||||
type: workplan
|
||||
title: "activity-core llm-connect live reconcile"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: railiance
|
||||
created: "2026-06-18"
|
||||
updated: "2026-07-01"
|
||||
state_hub_workstream_id: "a152ddda-d60a-4a65-9b9c-59e2db9ff2b7"
|
||||
---
|
||||
|
||||
# activity-core llm-connect live reconcile
|
||||
|
||||
## Context
|
||||
|
||||
activity-core has updated its Railiance runtime manifest so
|
||||
`actcore-runtime-config` points at the verified in-cluster llm-connect URL:
|
||||
|
||||
```text
|
||||
LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080
|
||||
LLM_CONNECT_TIMEOUT_SECONDS=300
|
||||
```
|
||||
|
||||
The remaining live gate belongs at the cluster/operator layer. Provider
|
||||
credentials must stay outside Git and State Hub, and the fixture smoke should
|
||||
record only non-secret evidence.
|
||||
|
||||
## Add cluster-owned reconcile/check command
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0014-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "49288db7-8102-4ad5-af08-1fe6ab3f1d37"
|
||||
```
|
||||
|
||||
Add a repeatable Railiance command that:
|
||||
|
||||
- reconciles the non-secret activity-core runtime config keys;
|
||||
- checks the provider Secret by key count only;
|
||||
- applies the llm-connect overlay only after the provider Secret exists;
|
||||
- runs the in-namespace fixture smoke only after deployment readiness;
|
||||
- posts a non-secret State Hub evidence note.
|
||||
|
||||
2026-06-18: Added `tools/cmd/railiance-reconcile-activity-core-llm-connect`
|
||||
and Makefile target `reconcile-activity-core-llm-connect`.
|
||||
|
||||
## Reconcile live non-secret runtime config
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0014-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "61df5bad-535f-4ad1-ac7a-f46ff278c388"
|
||||
```
|
||||
|
||||
Patch the live `activity-core/actcore-runtime-config` ConfigMap so it consumes
|
||||
the verified llm-connect service URL and timeout. Do not touch Secret values.
|
||||
|
||||
2026-06-18: The reconcile command patches only `LLM_CONNECT_URL` and
|
||||
`LLM_CONNECT_TIMEOUT_SECONDS`, then re-reads the live ConfigMap to verify the
|
||||
values. Live evidence note `c72c514a-399e-4c54-8d5b-d36405932360` confirms
|
||||
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080` and
|
||||
`LLM_CONNECT_TIMEOUT_SECONDS=300`.
|
||||
|
||||
## Complete provider Secret, deployment, and smoke gate
|
||||
|
||||
```task
|
||||
id: RAILIANCE-WP-0014-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ae8af00a-c14f-4b76-933c-46d06cd360ae"
|
||||
```
|
||||
|
||||
After an operator stores provider credentials in
|
||||
`activity-core/llm-connect-provider-secrets`, rerun:
|
||||
|
||||
```bash
|
||||
make reconcile-activity-core-llm-connect
|
||||
```
|
||||
|
||||
The command will apply the llm-connect overlay, wait for deployment readiness,
|
||||
run the in-namespace fixture smoke with `imagePullPolicy=Never`, and post
|
||||
non-secret evidence: provider Secret key count, deployment readiness,
|
||||
pass/fail, latency/recommendation summary or sanitized failure.
|
||||
|
||||
2026-07-01: Gate closed. Provider Secret `activity-core/llm-connect-provider-secrets`
|
||||
present (key count 1, no values inspected), overlay applied (no drift),
|
||||
deployment `llm-connect` ready 1/1, in-namespace fixture smoke passed
|
||||
(`health=ok latency_seconds=2.084 recommendations=1`). Evidence note
|
||||
`bddbf5d2-6cbe-4d97-9de6-689147d61be1`. The first rerun failed with
|
||||
`Connection refused` because the `llm-connect-activity-core-only`
|
||||
NetworkPolicy (added 2026-06-19) allowlist had not yet propagated the fresh
|
||||
smoke-pod IP; the reconcile tool now retries the smoke up to 6× with a 5s
|
||||
warm-up inside the pod.
|
||||
|
||||
Historical live gate on 2026-06-18: provider Secret
|
||||
`activity-core/llm-connect-provider-secrets` is missing, so deployment and
|
||||
smoke are intentionally blocked until operator/OpenBao-to-Kubernetes Secret
|
||||
custody is complete. Evidence note
|
||||
`c72c514a-399e-4c54-8d5b-d36405932360` records provider Secret status
|
||||
`missing`, key count `0`, deployment status `not checked; provider Secret gate
|
||||
not satisfied`, and smoke status `blocked`.
|
||||
Reference in New Issue
Block a user