Repo hygiene + new workplans (RAIL-BS-WP-0008/0009)
Some checks failed
railiance-tests / smoke (push) Has been cancelled

- Add RAIL-BS-WP-0008 (activity-core WP-0016 deploy) and RAIL-BS-WP-0009
  (admin-sync smoke) from inbox asks 87952ff1 / aa8b7986
- Archive finished workplans to workplans/archived/ per ADR-001 convention;
  normalize frontmatter statuses (completed/done -> finished)
- Fill stack-and-commands.md, complete repo-boundary.md, refresh SCOPE
  Current State, add docs/operator-runbook.md for production-touching targets

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-02 00:02:36 +02:00
parent eefa6c1b2a
commit b3b0c3e3ff
15 changed files with 206 additions and 24 deletions

View File

@@ -0,0 +1,135 @@
---
id: RAIL-BS-WP-0001
type: workplan
title: "Dependency Management — Add lockfile for Ansible control-node deps"
domain: financials
repo: railiance-cluster
status: finished
owner: railiance
topic_slug: railiance
state_hub_workstream_id: 59155efb-b461-4caa-ad7b-b3fce348db84
state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
created: "2026-03-01"
updated: "2026-03-01"
completed: "2026-03-01"
---
# Dependency Management — Add Ansible control-node lockfile
## Problem
This repo drives all Ansible automation but carries no pinned, machine-readable
inventory of its own runtime dependencies.
The Ansible version (and all pip packages it depends on) are whatever is
installed on the control node at any given time. This means:
- Behaviour is not reproducible across machines or over time
- The Custodian State Hub SBOM scanner finds nothing to ingest (`last_sbom_at = null`)
- Licence and vulnerability auditing of the actual dependencies in use is impossible
- The `railiance-cluster` repo appears as a gap in the SBOM coverage map
## Root cause
No `pyproject.toml` (or `requirements.txt`) declares the control-node pip
dependencies. No `ansible/requirements.yml` exists for Galaxy collections
(correct if none are used; but it should be explicit).
## Expected state after this task
- `pyproject.toml` at repo root declares `ansible` as a dependency (and any
other pip packages used by playbooks or the `bin/` commands)
- `uv.lock` is generated and committed — pins Ansible + full transitive pip tree
- If Galaxy collections are used: `ansible/requirements.yml` lists them
- SBOM is ingested: `last_sbom_at` is not null in the State Hub
- The SBOM dashboard shows `railiance-cluster` in the railiance domain row
with a package count
## Tasks
### T1 — Audit control-node pip dependencies
```task
id: RAIL-BS-WP-0001-T01
state_hub_task_id: 5f8cade5-119c-42e8-ba93-e9d0478650e4
status: done
priority: medium
completed: "2026-03-01"
```
Review `bin/` commands, Ansible playbooks, and any Python scripts in the repo.
List all pip packages that must be present on the control node:
- `ansible` (minimum version)
- Any collections-related tools (ansible-core, ansible-lint, etc.)
- Any other pip deps called from scripts (e.g. `paramiko`, `netaddr`, `jinja2`)
### T2 — Create pyproject.toml and generate uv.lock
```task
id: RAIL-BS-WP-0001-T02
status: done
priority: medium
completed: "2026-03-01"
state_hub_task_id: "8aa8a9d3-6560-4176-b933-72a21e6d43d4"
```
1. Create `pyproject.toml`:
```toml
[project]
name = "railiance-cluster"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"ansible>=10", # adjust version as appropriate
# add other deps found in T1
]
```
2. Run `uv lock` to generate `uv.lock`
3. Commit both files
### T3 — Ingest SBOM into State Hub
```task
id: RAIL-BS-WP-0001-T03
status: done
priority: medium
completed: "2026-03-01"
state_hub_task_id: "4fb477e9-dbac-4e43-84d0-5202c68f4705"
```
From `~/the-custodian/state-hub/`:
```bash
make ingest-sbom REPO=railiance-cluster SCAN=1 REPO_PATH=/home/worsch/railiance-cluster
```
Verify in the SBOM dashboard: railiance domain should show `railiance-cluster`
with a package count and no gap warning.
### T4 — Create ansible/requirements.yml (even if empty)
```task
id: RAIL-BS-WP-0001-T04
status: done
priority: low
completed: "2026-03-01"
state_hub_task_id: "d0eb1c96-e7c2-4f6b-b934-a3f295e4db72"
```
Create `ansible/requirements.yml`. If no Galaxy roles or collections are used,
create it empty with a comment. This makes the absence of collections explicit:
```yaml
---
# No external Ansible Galaxy roles or collections required.
# Add roles/collections here as needed:
# roles: []
# collections: []
```
## References
- Custodian SBOM Convention: `canon/standards/sbom-convention_v0.1.md`
- SBOM dashboard: http://127.0.0.1:3000/sbom
- Repos coverage page: http://127.0.0.1:3000/repos
- State Hub task: `5f8cade5-119c-42e8-ba93-e9d0478650e4`

View File

@@ -0,0 +1,175 @@
---
id: RAIL-BS-WP-0002
type: workplan
title: "k3s and Kubernetes Platform Baseline"
domain: financials
repo: railiance-cluster
status: finished
owner: railiance
topic_slug: railiance
repo_goal_id: "70ab2379-fb9d-4fec-a09d-b2a717e4ace8"
state_hub_workstream_id: "4c63dfc6-9eac-4e79-9f77-8f644ad7147d"
created: "2026-03-09"
updated: "2026-03-10"
completed: "2026-03-10"
---
# k3s and Kubernetes Platform Baseline
## Goal
Install k3s, Helm, and the baseline Kubernetes services on the converged
HostEurope node. This workplan picks up exactly where `railiance-hosts`
leaves off: a hardened, verified OS node that is ready for Kubernetes.
## Pre-condition
`railiance-infra` converge + Goss verify must pass before any task here
is executed:
```bash
# In railiance-infra/
make converge
make verify # must exit 0
```
## Boundary
This repo owns everything from k3s upward. It must not re-configure items
defined in `railiance-infra/spec/server-baseline.yaml`. See ADR-003:
`railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`.
**Out of scope here:** platform services (PostgreSQL, storage, identity)
`railiance-platform`. Application deployments (Gitea, coulomb services)
`railiance-apps`.
---
## Tasks
### T01 — Ansible playbook: install k3s (server mode)
```task
id: T01
status: done
priority: high
state_hub_task_id: "3f042630-eab0-4c6a-9167-e2b28ff20e40"
completed: "2026-03-10"
```
Harden `ansible/bootstrap.yml` to a production-ready k3s install:
- Use the official k3s install script pinned to a specific version
(`INSTALL_K3S_VERSION=v1.35.1+k3s1`)
- `INSTALL_K3S_EXEC="server --cluster-init --write-kubeconfig-mode=644"`
(cluster-init enables embedded etcd for future HA expansion)
- Wait for node `Ready` before proceeding:
```bash
k3s kubectl wait node --all --for=condition=Ready --timeout=120s
```
- Fetch kubeconfig to the control node as `~/.kube/config-hosteurope`
**Done when:** `k3s kubectl get nodes` returns `Ready` from both the server
and the control node (via kubeconfig).
---
### T02 — Helm installation
```task
id: T02
status: done
priority: high
state_hub_task_id: "e8510646-46ed-4697-a345-f3d3009eea78"
completed: "2026-03-10"
```
Add a task (or a role `roles/helm/`) that:
1. Downloads the Helm binary (pinned version) to `/usr/local/bin/helm`
2. Verifies the checksum
3. Confirms `helm version` succeeds
**Done when:** `helm version` succeeds on the HostEurope node.
---
### T03 — Smoke test: k3s + Helm
```task
id: T03
status: done
priority: high
state_hub_task_id: "dab2c07f-8aa0-4635-8df6-857e87e93fc5"
completed: "2026-03-10"
```
Extend `tests/smoke_kube.sh` to assert:
- `k3s kubectl get nodes` → node in Ready state
- `helm version` exits 0
- CoreDNS pod running in `kube-system`
- Traefik ingress controller pod running (default in k3s)
Run via:
```bash
ansible-playbook -i ansible/hosts.ini ansible/smoke.yml
```
or directly over SSH if the kubeconfig is available locally.
**Done when:** all assertions pass and the script exits 0.
---
### T04 — Commit kubeconfig management notes
```task
id: T04
status: done
priority: medium
state_hub_task_id: "5c3d40e4-239b-488e-9519-6f7a38d2325f"
completed: "2026-03-10"
```
Document in `docs/kubeconfig.md`:
- Where the kubeconfig is fetched to (`~/.kube/config-hosteurope`)
- How to merge it into `~/.kube/config`
- How to switch context: `kubectl config use-context default`
- Security note: kubeconfig is gitignored (contains cluster CA + client cert)
**Done when:** doc written and committed.
---
### T05 — Add `make k3s-install` and `make smoke` targets
```task
id: T05
status: done
priority: medium
state_hub_task_id: "7f9e0e58-a130-467a-a2d0-b3f2564e496f"
completed: "2026-03-10"
```
Add to Makefile (create one if none exists):
```makefile
k3s-install: ## Install k3s and Helm on all inventory hosts
ansible-playbook -i ansible/hosts.ini ansible/bootstrap.yml
smoke: ## Run Kubernetes smoke tests
bash tests/smoke_kube.sh
```
**Done when:** both targets work and are listed in `make help`.
---
## References
- Repo goal: `70ab2379-fb9d-4fec-a09d-b2a717e4ace8` (Install k3s and Kubernetes Baseline)
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure Kubernetes Infrastructure)
- Pre-condition: railiance-infra WP-0001 (Secure Single-Server Bootstrap) — completed 2026-03-09
- Boundary ADR: `railiance-infra/docs/adr/ADR-003-railiance-5repo-stack-architecture.md`
- k3s releases: https://github.com/k3s-io/k3s/releases

View File

@@ -0,0 +1,194 @@
---
id: RAIL-BS-WP-0003
type: bug-report
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
domain: financials
repo: railiance-cluster
status: finished
owner: tegwick
created: "2026-03-10"
updated: "2026-03-10"
state_hub_workstream_id: "7ee9ee22-1fae-4567-9194-8d70a9e0f45b"
---
# Bug Report: pgpool CrashLoopBackOff on PostgreSQL HA failover
## Summary
On 2026-03-10 a PostgreSQL HA failover caused all three postgresql pods to
restart. pgpool — the connection pooler between Gitea and PostgreSQL — then
entered CrashLoopBackOff and produced no logs. As a result Gitea's login
and all write operations hung indefinitely. The root page was still served
(from Valkey cache) which masked the failure.
The fix was to patch a missing key in a Kubernetes secret. The root cause is
that the `gitea-12.2.0` Helm chart (postgresql-ha subchart v16.2.2) does not
populate the `pgpool-password` key in the `gitea-postgresql-ha-postgresql`
secret, even though the pgpool pod requires it at startup.
---
## Timeline
| Time (UTC) | Event |
|---|---|
| ~09:45 | postgresql-0, postgresql-2 pods restarted (repmgr failover) |
| ~09:45 | pgpool pod restarted and entered CrashLoopBackOff |
| ~11:00 | User noticed Gitea login hanging; home page still loading |
| ~13:00 | Root cause identified: missing `pgpool-password` secret key |
| ~13:10 | Secret patched; pgpool pod deleted and restarted cleanly |
| ~13:15 | Gitea fully operational |
---
## Root Cause
The Bitnami `pgpool` container startup script reads the file
`/opt/bitnami/pgpool/secrets/pgpool-password`, which is mounted from the
`gitea-postgresql-ha-postgresql` Kubernetes Secret via a `subPath` volume
mount. That secret key was never created by the Helm chart, so the file did
not exist. The container exited immediately with no logs.
The pod had been running for 20 days without a restart, so this gap was
never discovered during initial deployment.
---
## Evidence
```bash
# Secret was missing the pgpool-password key
sudo k3s kubectl get secret -n default gitea-postgresql-ha-postgresql -o yaml
# data: keys were password, postgres-password, repmgr-password only
# pgpool-password was absent
# pgpool pod describe showed 824 back-off restarts over 173 minutes
# No logs in either current or --previous output
sudo k3s kubectl logs -n default <pgpool-pod> --previous
# (empty)
# Gitea process had zero TCP connections to PostgreSQL port 5432
# but many connections to Valkey port 6379
cat /proc/<gitea-pid>/net/tcp | grep 1538 # 1538 = 5432 hex — no results
```
---
## Immediate Fix Applied
```bash
# Add the missing key (value = sr-check-password = changeme4 = base64: Y2hhbmdlbWU0)
sudo k3s kubectl patch secret -n default gitea-postgresql-ha-postgresql \
--type='json' \
-p='[{"op":"add","path":"/data/pgpool-password","value":"Y2hhbmdlbWU0"}]'
# Restart pgpool
sudo k3s kubectl delete pod -n default <pgpool-pod-name>
```
---
## Risk: Fix Will Be Lost on helm upgrade
The patched secret is managed by Helm (annotation:
`meta.helm.sh/release-name: gitea`). A `helm upgrade` will regenerate the
secret from the chart template, which does not include `pgpool-password`,
and the bug will recur.
---
## Tasks
### T01 — Add pgpool-password to Helm values
```task
id: T01
status: done
priority: high
state_hub_task_id: "6841c93a-f146-47eb-9f7c-8fa0e02c1bbc"
```
Create or update `helm/gitea-values.yaml` (or equivalent) to permanently
include the pgpool-password so it survives `helm upgrade`:
```yaml
postgresql-ha:
postgresql:
pgpoolPassword: <value matching sr-check-password>
```
**Done when:** `helm upgrade gitea` completes and pgpool starts cleanly
without manual secret patching.
---
### T02 — Add pgpool health check to smoke test
```task
id: T02
status: done
priority: high
state_hub_task_id: "ab166073-30a7-4702-a037-4091e8706e20"
```
Extend `tests/smoke_kube.sh` to assert:
```bash
# All postgresql-ha pods Running
kubectl get pods -n default | grep gitea-postgresql-ha | grep -v Running && exit 1
# pgpool specifically not in CrashLoopBackOff
kubectl get pod -n default -l app.kubernetes.io/component=pgpool \
-o jsonpath='{.items[0].status.containerStatuses[0].state}' | grep -v crash
```
**Done when:** the smoke test catches a pgpool failure within 5 minutes.
---
### T03 — Add HA failover test
```task
id: T03
status: done
priority: high
state_hub_task_id: "140da396-8e30-4f4d-b88c-c42c0cd46c01"
```
Create `tests/test_ha_failover.sh` that:
1. Records Gitea login response time (baseline)
2. Kills the primary PostgreSQL pod: `kubectl delete pod gitea-postgresql-ha-postgresql-0 -n default`
3. Waits for repmgr to promote a replica (max 60s)
4. Asserts Gitea login POST still succeeds within 10s
5. Asserts pgpool pod is Running (not CrashLoopBackOff)
6. Asserts all postgresql pods return to Running
This test must pass before any PostgreSQL HA deployment is considered done.
**Done when:** script exits 0 against a live cluster.
---
### T04 — Document the incident in docs/
```task
id: T04
status: done
priority: medium
state_hub_task_id: "d8a3ba40-fda0-4c1f-a9f1-ffcd621a5b3d"
```
Add `docs/incidents/2026-03-10-pgpool-missing-secret.md` with the full
timeline, root cause, and fix, so future operators understand what happened
and how to recover.
**Done when:** doc committed and linked from `docs/README.md`.
---
## References
- Bitnami postgresql-ha chart v16.2.2
- Gitea Helm chart v12.2.0
- Related decision: D3 (HA testing policy) in `DECISIONS.md`

View File

@@ -0,0 +1,273 @@
---
id: RAIL-BS-WP-0004
type: workplan
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
domain: financials
repo: railiance-cluster
status: finished
owner: tegwick
topic_slug: railiance
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
created: "2026-02-25"
updated: "2026-03-26"
---
# Integrated Backup — S2 Kubernetes Runtime Layer
## Goal
Implement the Q3 (Operability & Resilience) integrated backup for
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
encrypted with age, written to a local directory on the server. No external
dependencies required.
## Architecture (Decision D4)
Each railiance repo implements its own backup for what it owns. No central
backup service. See `DECISIONS.md` D4 for full rationale.
**Standard interface every railiance repo must provide:**
```bash
make backup # encrypt + write to /opt/backup/railiance/<layer>/
make restore # restore from most recent local backup
```
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
Output: `/opt/backup/railiance/cluster/` on the server.
## What S2 (railiance-cluster) owns and must back up
| Asset | Why it matters |
|---|---|
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
| Helm release values | Runtime values not in git (any manually applied overrides) |
| kubeconfig | Admin access to the cluster |
**Not S2's responsibility:**
- Custodian State Hub DB → the-custodian owns this
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
- PostgreSQL data volumes → S3 (railiance-platform) owns this
## Encryption
Reuse the age public key from `.sops.yaml`:
```bash
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
```
Decryption requires the private key at `~/.config/sops/age/keys.txt`
(same key used for `sops -d`). No additional key management needed.
## Extension Point EP-RAIL-005
Once all five OAS layers implement this interface, the custodian can
orchestrate a full-stack backup with:
```bash
for repo in railiance-infra railiance-cluster railiance-platform \
railiance-enablement railiance-apps; do
make -C ~/$repo backup
done
```
No special protocol needed — just the standard interface.
---
## Tasks
### T01 — Define backup directory and encryption wrapper
```task
id: T01
status: done
priority: high
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
```
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
- Encrypt each artifact with age using public key from `.sops.yaml`
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
- Keep last 7 of each type
- Write `.last-backup` stamp
- Exit 0 on success, non-zero on any failure
- No network required
Also remove the old `tools/cmd/railiance-backup` (backed up Docker-based
custodian DB — wrong scope, not applicable to this server).
**Done when:** `make backup` runs on COULOMBCORE without error and files
appear in `/opt/backup/railiance/cluster/`.
---
### T02 — Back up k3s state (SQLite hot backup)
```task
id: T02
status: done
priority: high
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
```
k3s has built-in etcd snapshot support:
```bash
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
```
Add to the backup script: take a fresh snapshot, encrypt with age,
copy to `/opt/backup/railiance/cluster/`.
> **Note — verify etcd is in use before implementing:**
> `k3s etcd-snapshot` only works if k3s was started with `--cluster-init`.
> Without it, k3s uses SQLite and this command will fail.
> Verify first: `sudo k3s etcd-snapshot ls 2>&1`
> **Note — sudo required:** etcd snapshot requires root. See T06 for how
> this is resolved (backup runs under root's crontab).
**Done when:** backup includes a current etcd snapshot.
---
### T03 — Back up Helm release values
```task
id: T03
status: done
priority: medium
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
```
Capture current runtime Helm values for all releases:
```bash
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -A -o json | \
jq -r '.[].name + " " + .namespace' | \
while read name ns; do
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get values "$name" -n "$ns" -o yaml
done
```
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
> **Note — kubeconfig permissions:** `/etc/rancher/k3s/k3s.yaml` is root-readable
> only by default. The backup script must either run as root (see T06) or k3s
> must be configured with `--write-kubeconfig-mode=644`. Running as root
> (via root crontab) is the chosen approach — no config change needed.
**Done when:** backup includes a snapshot of all Helm release values.
---
### T04 — Back up kubeconfig
```task
id: T04
status: done
priority: medium
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
```
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
into `kubeconfig-<ts>.yaml.age` in the backup directory.
**Done when:** backup includes the encrypted kubeconfig.
---
### T05 — make restore target
```task
id: T05
status: done
priority: medium
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
```
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
backups, with guided restore for the etcd snapshot case.
Restore of etcd from snapshot:
```bash
sudo k3s server --cluster-reset \
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
```
**Done when:** `make restore` prints available backups and a restore guide.
---
### T06 — Install cron job and run restore drill
```task
id: T06
status: done
priority: medium
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
```
#### Solving the sudo problem
The backup script needs root for two reasons:
- `k3s etcd-snapshot save` requires root
- `/etc/rancher/k3s/k3s.yaml` (kubeconfig) is root-readable only
**Solution: run the cron under root's crontab.**
This is the correct pattern for system-level backup jobs. It avoids a
proliferating sudoers whitelist (one entry per command, brittle to maintain)
and matches how tools like `rsnapshot`, `bacula`, and `borgbackup` work in
production. The backup writes to `/opt/backup/` which is root-owned anyway.
Install the cron as root:
```bash
sudo crontab -e
# Add:
0 2 * * * make -C /home/tegwick/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1
```
Note: use the absolute path to the repo — `~` does not expand reliably in
root's crontab unless HOME is set.
Verify it is installed:
```bash
sudo crontab -l | grep railiance
```
#### Restore drill
Once T01T04 are done, run a decrypt-and-verify drill:
```bash
# Decrypt the etcd snapshot and verify it is a valid snapshot file
sudo age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/railiance/cluster/etcd-$(ls /opt/backup/railiance/cluster/etcd-*.snap.age | sort -r | head -1 | xargs basename | sed 's/etcd-//;s/.snap.age//').snap.age \
| file -
# Record the drill
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) restore drill OK" \
>> /opt/backup/railiance/cluster/restore-drill.log
```
**Done when:** cron installed under root, drill completes without error,
log entry written.
---
## References
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
- OAS Q3: Operability & Resilience
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore

View File

@@ -0,0 +1,143 @@
---
id: RAIL-BS-WP-0005
type: workplan
title: "Kubeconfig delivery for netkingdom SSO/MFA stack apply"
domain: financials
repo: railiance-cluster
status: finished
owner: railiance-worker
topic_slug: railiance
capability_request_id: "34b97d89-e80a-42ae-a623-a9185e5b17f5"
created: "2026-03-20"
updated: "2026-03-20"
state_hub_workstream_id: "b236de41-2f33-4ebc-bb84-5fcedb2982f8"
---
# RAIL-BS-WP-0005 — Kubeconfig delivery for netkingdom SSO/MFA stack apply
**Scope:** Fulfil capability request 34b97d89 — deliver a working local kubeconfig so
the netkingdom SSO/MFA workstream (NK-WP-0001) can apply manifests (T02T08) against
the existing K3s cluster on HostEurope (92.205.130.254).
**Context:**
- Cluster is healthy: one node `Ready`, k3s v1.30.3, 200 days uptime.
- K3s API listens on `*:6443` (all interfaces); UFW is inactive — direct public access works.
- The in-cluster kubeconfig uses `server: https://127.0.0.1:6443`; must be rewritten
to `https://92.205.130.254:6443` for off-server use.
- No ops-bridge tunnel needed for kubectl (API is directly reachable).
- Wrong catalog entry was filed (PostgreSQL HA instead of k3s provisioning) — noted,
no API endpoint to correct it retroactively; document here.
**Depends on:** RAIL-BS-WP-0002 (k3s-kubernetes-baseline) ✓ completed
**Unblocks:** NK-WP-0001 T02T08 (SSO/MFA stack apply)
---
## Task: Extract kubeconfig from HostEurope server
```task
id: RAIL-BS-WP-0005-T01
status: done
priority: high
state_hub_task_id: "c59a8e0c-e1fd-4cfd-aa5e-7cbb895609f0"
```
```bash
ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 \
"sudo cat /etc/rancher/k3s/k3s.yaml" > /tmp/k3s-raw.yaml
```
Verify file is non-empty and contains a valid YAML kubeconfig.
---
## Task: Rewrite server address and install kubeconfig
```task
id: RAIL-BS-WP-0005-T02
status: done
priority: high
state_hub_task_id: "93d61bc6-47e7-442f-8611-97f5f2f208c4"
```
Replace `127.0.0.1` with `92.205.130.254` in the kubeconfig; place at
`~/.kube/config` (create `~/.kube/` if absent). Back up any existing config first.
```bash
mkdir -p ~/.kube
# back up existing if present
[ -f ~/.kube/config ] && cp ~/.kube/config ~/.kube/config.bak.$(date +%Y%m%d)
# rewrite server and install
sed 's|https://127.0.0.1:6443|https://92.205.130.254:6443|g' /tmp/k3s-raw.yaml \
> ~/.kube/config
chmod 600 ~/.kube/config
```
---
## Task: Smoke-test kubectl from local machine
```task
id: RAIL-BS-WP-0005-T03
status: done
priority: high
state_hub_task_id: "f15626c2-73a0-443f-8aae-5515806ae0fa"
```
```bash
kubectl get nodes
kubectl get pods -A
```
Expected: node `254.130.205.92.host.secureserver.net` in `Ready` state.
If unreachable, check firewall on server: `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 "sudo ufw status"`.
---
## Task: Resolve capability request
```task
id: RAIL-BS-WP-0005-T04
status: done
priority: high
state_hub_task_id: "8109450c-95df-4d01-96fd-8847c88beb34"
```
Patch capability request 34b97d89 to `completed` with a resolution note:
```bash
curl -s -X PATCH "http://127.0.0.1:8000/capability-requests/34b97d89-e80a-42ae-a623-a9185e5b17f5/status" \
-H "Content-Type: application/json" \
-d '{
"status": "completed",
"note": "Kubeconfig delivered to ~/.kube/config (server: 92.205.130.254:6443). kubectl smoke-test passed. NK-WP-0001 T02-T08 can proceed. Note: wrong catalog_entry_id filed (PostgreSQL HA eca6e5cc instead of k3s provisioning 9520cc98) — no retroactive API to correct."
}'
```
---
## Task: Register UFW-inactive finding as technical debt
```task
id: RAIL-BS-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "ea120464-fdeb-4259-99e1-e6743cd86797"
```
UFW is inactive on 92.205.130.254 — K3s API port 6443 is exposed to the internet,
protected only by TLS mutual auth. Register as TD item in state-hub so it gets
addressed in a future railiance-cluster security hardening workplan.
```bash
curl -s -X POST "http://127.0.0.1:8000/technical-debt/" \
-H "Content-Type: application/json" \
-d '{
"domain": "railiance",
"debt_type": "security",
"severity": "medium",
"title": "UFW inactive on HostEurope K3s node — API port 6443 exposed to internet",
"description": "UFW is inactive on 92.205.130.254. K3s API (port 6443) is reachable from anywhere, protected only by TLS client certificates. Should be restricted to known IPs or tunnelled. Discovered 2026-03-20 during kubeconfig delivery workplan.",
"status": "open"
}'
```

View File

@@ -0,0 +1,110 @@
---
id: RAILIANCE-WP-0012
type: workplan
title: "activity-core cluster-owned deploy/verify"
domain: financials
repo: railiance-cluster
status: finished
owner: codex
topic_slug: railiance
created: "2026-06-15"
updated: "2026-06-16"
state_hub_workstream_id: "6434f7cb-e13c-4c05-839b-197bb239d5cd"
---
# activity-core cluster-owned deploy/verify
## Context
activity-core `ACTIVITY-WP-0007-T06` needs live Railiance cluster evidence for
the disabled ops inventory probe. That live verification should be owned by the
cluster/operator layer, not by arbitrary activity-core sessions with local
`kubectl` assumptions.
This workplan creates a cluster-owned path that keeps credentials in
operator-owned locations while returning only non-secret evidence to State Hub.
## Implement cluster-owned verifier
```task
id: RAILIANCE-WP-0012-T01
status: done
priority: high
state_hub_task_id: "3769fdfb-b4f1-431b-a55a-672d93b3ea55"
```
Add a repeatable command that:
- reconciles the activity-core Railiance runtime bundle;
- reruns `actcore-sync`;
- checks the `ops-service-inventory-probes` ActivityDefinition exists and is
still disabled;
- triggers the disabled definition manually through the in-cluster API path;
- verifies a fresh `ops_inventory_probe` progress event exists in State Hub;
- posts a non-secret State Hub evidence note for activity-core to cite.
Implemented as `tools/cmd/railiance-verify-activity-core` with Makefile target
`verify-activity-core`. The script defaults to the `railiance01` SSH executor;
use `ACTIVITY_CORE_CLUSTER_HOST=local` only for an explicitly selected local
`kubectl` context.
## Run live verification and publish evidence
```task
id: RAILIANCE-WP-0012-T02
status: done
priority: high
state_hub_task_id: "6d7f87c3-a533-4de1-84de-9ca65f2e2779"
```
Run `make verify-activity-core` against the Railiance cluster. On success, cite
the State Hub evidence note id in this task and in activity-core
`ACTIVITY-WP-0007-T06`.
If a gate fails, the verifier must still post a non-secret State Hub note with
the failing gate and last completed evidence fields.
2026-06-15: Completed against Railiance01 after refreshing the same-tag
`activity-core:railiance01-prod` image from activity-core commit `ab17378`,
importing digest `sha256:cff43c72455b9fc4fc11a0a997b4671a38987bb4583a600245dd961965af0e40`
into k3s containerd, syncing the current runtime bundle to
`/home/tegwick/activity-core/k8s/railiance`, and restarting the activity-core
runtime deployments. The verifier reconciled the runtime bundle, completed
`actcore-sync`, confirmed `ops-service-inventory-probes` exists and remains
disabled, triggered it manually, verified State Hub progress
`4c82360d-33e7-455b-8ab4-33facd4a3f8e`, and posted evidence note
`baeeaeac-aa6d-4406-ae64-e54577f21386`.
An intermediate verifier invocation accidentally targeted the local
CoulombCore `kubectl` context. It created only `actcore-*` runtime resources in
the existing `activity-core` namespace; those resources were removed with the
runtime manifest cleanup, and the pre-existing `llm-connect` deployment remains
running.
Operational cleanup note: the successful Railiance01 verifier run used
`ACTIVITY_CORE_RESTART_DEPLOYMENTS=1` after importing the same-tag image. The
script was corrected afterward to restart only `actcore-api`,
`actcore-worker`, and `actcore-event-router`, because
`actcore-state-hub-bridge` uses host networking and a rolling restart leaves a
new bridge pod pending behind the host-bound running pod. A 2026-06-16 cleanup
check showed the bridge rollout had settled on Railiance01: the host-bound
bridge pod was running and the replacement ReplicaSet was scaled to zero, so no
manual live cleanup was needed.
## Handoff closure to activity-core
```task
id: RAILIANCE-WP-0012-T03
status: done
priority: medium
state_hub_task_id: "43f652c6-fcc4-49fa-90cc-4122eb6d5321"
```
After live evidence exists, update activity-core `ACTIVITY-WP-0007-T06` to cite
the Railiance evidence and close it if Inter-Hub submission is active or
explicitly deferred with the clean State Hub fallback result.
2026-06-15: Updated activity-core `ACTIVITY-WP-0007-T06` to cite Railiance
evidence note `baeeaeac-aa6d-4406-ae64-e54577f21386` and close the task with
Inter-Hub submission explicitly deferred while the State Hub fallback evidence
path is verified.

View File

@@ -0,0 +1,120 @@
---
id: RAILIANCE-WP-0013
type: workplan
title: "activity-core verifier evidence hardening"
domain: financials
repo: railiance-cluster
status: finished
owner: codex
topic_slug: railiance
created: "2026-06-16"
updated: "2026-06-16"
state_hub_workstream_id: "a3abb83a-2d42-40f9-a5f6-1dbc36903436"
---
# activity-core verifier evidence hardening
## Context
`RAILIANCE-WP-0012` moved activity-core live deploy/verify ownership into
`railiance-cluster` and produced State Hub evidence
`baeeaeac-aa6d-4406-ae64-e54577f21386`, with `ops_inventory_probe` progress
`4c82360d-33e7-455b-8ab4-33facd4a3f8e`.
A follow-up review found hardening work that matters for routine verifier use:
the verifier should prove the State Hub progress event belongs to the specific
manual trigger it launched, evidence should include an immutable runtime
identity, and local `kubectl` mode should require an explicit double opt-in.
This is a hardening follow-up only; it does not reopen activity-core
`ACTIVITY-WP-0007-T06`.
## Correlate State Hub progress to the manual trigger
```task
id: RAILIANCE-WP-0013-T01
status: done
priority: high
state_hub_task_id: "d013a4a9-77fc-4cf0-babf-528d71acc0a1"
```
Update `tools/cmd/railiance-verify-activity-core` so after
`POST /activity-definitions/<id>/trigger` it parses `trigger_key`, derives the
expected activity-core manual `run_id`, and polls State Hub until it finds
`ops_inventory_probe` where:
- `detail.activity_id == DEFINITION_ID`;
- `detail.activity_core_run_id == expected_run_id`.
The verifier must not pass on merely any event created after `STARTED_AT`.
Include the expected run id and matched progress id in the evidence note.
2026-06-16: Implemented exact correlation. The verifier now derives the
expected UUIDv5 `activity_core_run_id` from `<DEFINITION_ID>:<trigger_key>` and
requires State Hub `ops_inventory_probe` detail to match both `activity_id` and
`activity_core_run_id`.
## Record immutable runtime evidence
```task
id: RAILIANCE-WP-0013-T02
status: done
priority: medium
state_hub_task_id: "c5780ec1-9a74-401e-b60e-a0fdf2b7e5d2"
```
Ensure successful evidence includes either `activity_core_revision` or an
immutable Kubernetes image ID/digest. When the remote repo revision is
unavailable, fall back to the live `actcore-api` pod container `imageID`.
2026-06-16: Implemented `api_image_id` capture from the live `actcore-api` pod
container status and added a guard so passed evidence must include either the
remote repo revision or the immutable image ID.
## Guard explicit local kubectl override
```task
id: RAILIANCE-WP-0013-T03
status: done
priority: medium
state_hub_task_id: "0d60809f-3f1d-4ea9-a96f-af074911acc0"
```
Keep `railiance01`/SSH as the default executor. If
`ACTIVITY_CORE_CLUSTER_HOST=local` is selected, require an additional explicit
opt-in such as `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` and print the current
`kubectl` context before continuing.
2026-06-16: Implemented the double opt-in. `ACTIVITY_CORE_CLUSTER_HOST=local`
now exits before cluster access unless `ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1` is
also set, and accepted local mode prints the current `kubectl` context.
## Verify and publish hardening evidence
```task
id: RAILIANCE-WP-0013-T04
status: done
priority: medium
state_hub_task_id: "150e4fa3-800c-4997-baaa-da696f5a0fc0"
```
Run `bash -n tools/cmd/railiance-verify-activity-core`, run
`make verify-activity-core` against Railiance01, confirm the evidence note
matched the manual trigger run id, and post a non-secret State Hub note citing
the new evidence.
2026-06-16: Verified with `bash -n tools/cmd/railiance-verify-activity-core`
and a live Railiance01 `make verify-activity-core` run. The verifier posted
State Hub evidence note `60256e9a-9d1b-44db-8999-738cf03bca2e`, matched manual
run id `90e3b112-d1e3-51af-8fb2-cb61f26add17`, matched
`ops_inventory_probe` progress `db408146-0310-4ac3-ac77-f73c5a41e070`, and
included `api_image_id`
`sha256:5ff92a8217c450ae06075d00862b6e2a92a83ca09eea18b5a5e96b5d2d728b35`.
Done when:
- the verifier rejects unrelated fresh `ops_inventory_probe` events;
- evidence includes a non-null revision or image digest;
- local `kubectl` mode requires explicit double opt-in;
- the Railiance01 verifier run posts a passed evidence note with matched run id;
- `make fix-consistency REPO=railiance-cluster` has synced the workplan.

View File

@@ -0,0 +1,258 @@
---
id: RAIL-BS-WP-0006
type: workplan
title: "Staged Promotion Lifecycle"
domain: financials
repo: railiance-cluster
status: finished
owner: railiance
topic_slug: railiance
repo_goal_id: "6ea441f7-7fe3-4598-922b-38baf20c0580"
state_hub_workstream_id: "cb72d3ba-1863-43c2-a2a5-49ac75fc2603"
created: "2026-02-24"
updated: "2026-06-27"
---
# Staged Promotion Lifecycle
## Goal
Design and implement the three-stage deployment lifecycle as the core
Railiance application promotion pattern:
1. Stage 1: local development and validation.
2. Stage 2: canary on production infrastructure.
3. Stage 3: full production promotion with rollback.
This lifecycle should become the repeatable path for native Railiance apps and
third-party upstream applications wrapped by a Railiance overlay repo.
## Why This Belongs Before Forgejo
Forgejo will become critical production infrastructure. Before moving the
source forge itself, Railiance needs a well-defined promotion lifecycle so the
Forgejo deployment, Actions runners, package registry, and future upgrades can
move through the same staged gates as every other important workload.
## Boundary
This workplan lives in `railiance-cluster` because it defines cluster runtime
promotion mechanics and the canonical handoff between local validation,
canary deployment, and production routing.
Expected cross-repo handoffs:
- `railiance-enablement`: developer-facing CLI templates and CI workflow
conventions.
- `railiance-platform`: shared platform dependencies used by canaries.
- `railiance-apps`: application Helm values and workload-specific promotion
definitions.
## Tasks
### T01 - Write deployment lifecycle specification
```task
id: RAIL-BS-WP-0006-T01
status: done
priority: high
state_hub_task_id: "fbfc341f-8ccb-4950-a85d-3e59c4f5b87f"
```
Write `docs/deployment-lifecycle.md`.
The spec should define:
- Stage 1, Stage 2, and Stage 3 semantics.
- Required checks before each stage.
- Canary acceptance gates.
- Rollback expectations.
- Human approval gates for production-critical workloads.
**Done when:** the lifecycle is clear enough to apply to Forgejo as a later
production workload.
2026-06-16: Added `docs/deployment-lifecycle.md` and linked it from
`docs/README.md`. The specification defines Stage 1 local validation, Stage 2
production canary, Stage 3 production promotion, required checks and evidence,
canary acceptance gates, rollback expectations, human approval gates for
production-critical workloads, and the Forgejo readiness questions that must be
answered before cutover.
---
### T02 - Define railiance directory schema and app.toml contract
```task
id: RAIL-BS-WP-0006-T02
status: done
priority: high
state_hub_task_id: "523cf928-bb0e-4109-a172-abf029c62885"
```
Define the repository-local `railiance/` directory schema and `app.toml`
contract for native and third-party applications.
Minimum contract:
- App identity and ownership.
- Stage definitions.
- Required platform dependencies.
- Health checks and observability endpoints.
- Promotion and rollback commands.
- Secret references without plaintext secret values.
**Done when:** a repo can declare how it moves through the Railiance promotion
lifecycle without bespoke instructions.
2026-06-27: Added `docs/app-toml-contract.md`, `schemas/railiance-app.schema.json`, and `examples/railiance/app.toml`. The v1 contract covers app identity, ownership, source/artifact policy, platform dependencies, secret references without plaintext values, health and observability endpoints, stage commands/checks/evidence, canary and promotion modes, rollback strategy, and human approval gates.
---
### T03 - Overlay repo pattern and creation script
```task
id: RAIL-BS-WP-0006-T03
status: done
priority: medium
state_hub_task_id: "7cd378f2-0319-407a-9ce7-2c6d1a6d6d24"
```
Design the overlay repo pattern for third-party upstream applications and add
`create_railiance_overlay_repo.sh` or equivalent tooling.
The pattern should keep upstream code and Railiance deployment concerns cleanly
separated while still allowing reproducible promotion.
**Done when:** a third-party app can be wrapped without forking deployment
logic into the upstream repository.
2026-06-27: Added `docs/overlay-repo-pattern.md` and `tools/create_railiance_overlay_repo.sh`, plus the `bin/railiance create-overlay` dispatcher entry. The scaffold records upstream identity in `railiance/upstream.toml`, generates a schema-valid `railiance/app.toml`, stage values, a thin Helm chart, Stage 1 test script, rollback runbook, and promotion notes without vendoring upstream code or touching secrets.
---
### T04 - railiance run command
```task
id: RAIL-BS-WP-0006-T04
status: done
priority: high
state_hub_task_id: "95c3311b-04bb-4c83-bda3-47958217b665"
```
Implement the Stage 1 `railiance run` command for local development and
validation.
Expected behavior:
- Read `railiance/app.toml`.
- Start or validate the local development target.
- Run defined local health checks.
- Emit a machine-readable result suitable for later promotion gates.
**Done when:** at least one representative app can complete Stage 1 locally.
2026-06-27: Added `tools/cmd/railiance-run`, the `bin/railiance run` dispatcher entry, and `docs/railiance-run-command.md`. The command reads `railiance/app.toml`, runs Stage 1 commands and local checks, and emits `railiance.run-result.v1` JSON without command logs or secret values. Updated the overlay generator so a generated Forgejo overlay completes Stage 1 locally in this environment; Helm rendering is optional when Helm is unavailable.
---
### T05 - Canary Helm chart template
```task
id: RAIL-BS-WP-0006-T05
status: done
priority: high
state_hub_task_id: "47b8cd47-99c7-4f31-a147-ea16afde7217"
```
Create the Stage 2 canary Helm chart template.
Minimum requirements:
- Stable and canary release identities.
- Weighted routing or equivalent traffic split through the chosen ingress
path.
- Prometheus-compatible annotations.
- Resource limits appropriate for single-node and future ThreePhoenix use.
- Rollback-safe values layout.
**Done when:** a canary deployment can be created without hand-editing cluster
resources.
2026-06-27: Updated generated overlay charts for Stage 2 canaries. The
scaffold now emits stable/canary release identities, isolated canary ingress by
default, optional Traefik weighted routing, Prometheus-compatible annotations,
HTTP probes, conservative single-node resource limits, rollback labels,
separate Stage 2/Stage 3 values, and `tests/stage2-template.sh`. Verified a
fresh Forgejo overlay with schema validation, Stage 1 run, and Stage 2 scaffold
checks; Helm rendering was skipped because Helm is unavailable in this
environment.
---
### T06 - railiance deploy --stage 2 and observation tooling
```task
id: RAIL-BS-WP-0006-T06
status: done
priority: medium
state_hub_task_id: "6a5c7422-fcb1-49d1-8153-e891bd1c27fa"
```
Implement Stage 2 deployment and observation commands.
Expected behavior:
- Deploy the canary from declared app metadata.
- Show rollout state, pod health, ingress/routing state, and key metrics.
- Fail closed when prerequisites or health gates are missing.
**Done when:** Stage 2 can be run and observed from a repeatable command path.
2026-06-27: Added `tools/cmd/railiance-stage2` and dispatcher entries for
`bin/railiance deploy` and `bin/railiance observe`. Deploy emits a
`railiance.stage2-deploy-result.v1` plan by default, can run Helm server dry-run
or apply when tools and cluster access are present, and fails closed when
required paths, Helm, or approval evidence are missing. Observe emits a
`railiance.stage2-observe-result.v1` target plan by default and runs live
kubectl rollout, pod, ingress, and metrics checks only with `--live`. Updated
generated overlays to declare the repeatable Stage 2 plan commands.
---
### T07 - railiance promote, rollback, and onboarding guide
```task
id: RAIL-BS-WP-0006-T07
status: done
priority: medium
state_hub_task_id: "476198f6-0049-4ac4-9593-6723c86c9602"
```
Implement Stage 3 promotion and rollback commands, then write the reference
onboarding guide.
Expected output:
- `railiance promote` for controlled production promotion.
- `railiance rollback` for reverting to the previous stable version.
- A guide showing how a representative app adopts the lifecycle.
- Explicit human approval points for critical infrastructure workloads.
**Done when:** a representative app can move Stage 1 -> Stage 2 -> Stage 3 and
back through rollback using documented commands.
2026-06-27: Added `tools/cmd/railiance-stage3` and dispatcher entries for
`bin/railiance promote` and `bin/railiance rollback`. Both commands default to
non-mutating JSON plans, apply modes require approval evidence and Helm, and
rollback apply also requires a Helm revision for `helm-revision` strategy.
Added `docs/promote-rollback-onboarding.md` with the representative Stage 1 ->
Stage 2 -> Stage 3 -> rollback path and explicit human approval points for
critical workloads. Updated generated overlays to declare promote/rollback plan
commands.
## Dependencies
This workplan should be done before the Forgejo production cutover. It can run
in parallel with preparatory ThreePhoenix design, but its Stage 2/3 behavior
should be validated against the intended ThreePhoenix cluster model.

View File

@@ -0,0 +1,106 @@
---
id: RAILIANCE-WP-0014
type: workplan
title: "activity-core llm-connect live reconcile"
domain: financials
repo: railiance-cluster
status: finished
owner: codex
topic_slug: railiance
created: "2026-06-18"
updated: "2026-07-01"
state_hub_workstream_id: "a152ddda-d60a-4a65-9b9c-59e2db9ff2b7"
---
# activity-core llm-connect live reconcile
## Context
activity-core has updated its Railiance runtime manifest so
`actcore-runtime-config` points at the verified in-cluster llm-connect URL:
```text
LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080
LLM_CONNECT_TIMEOUT_SECONDS=300
```
The remaining live gate belongs at the cluster/operator layer. Provider
credentials must stay outside Git and State Hub, and the fixture smoke should
record only non-secret evidence.
## Add cluster-owned reconcile/check command
```task
id: RAILIANCE-WP-0014-T01
status: done
priority: high
state_hub_task_id: "49288db7-8102-4ad5-af08-1fe6ab3f1d37"
```
Add a repeatable Railiance command that:
- reconciles the non-secret activity-core runtime config keys;
- checks the provider Secret by key count only;
- applies the llm-connect overlay only after the provider Secret exists;
- runs the in-namespace fixture smoke only after deployment readiness;
- posts a non-secret State Hub evidence note.
2026-06-18: Added `tools/cmd/railiance-reconcile-activity-core-llm-connect`
and Makefile target `reconcile-activity-core-llm-connect`.
## Reconcile live non-secret runtime config
```task
id: RAILIANCE-WP-0014-T02
status: done
priority: high
state_hub_task_id: "61df5bad-535f-4ad1-ac7a-f46ff278c388"
```
Patch the live `activity-core/actcore-runtime-config` ConfigMap so it consumes
the verified llm-connect service URL and timeout. Do not touch Secret values.
2026-06-18: The reconcile command patches only `LLM_CONNECT_URL` and
`LLM_CONNECT_TIMEOUT_SECONDS`, then re-reads the live ConfigMap to verify the
values. Live evidence note `c72c514a-399e-4c54-8d5b-d36405932360` confirms
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080` and
`LLM_CONNECT_TIMEOUT_SECONDS=300`.
## Complete provider Secret, deployment, and smoke gate
```task
id: RAILIANCE-WP-0014-T03
status: done
priority: high
state_hub_task_id: "ae8af00a-c14f-4b76-933c-46d06cd360ae"
```
After an operator stores provider credentials in
`activity-core/llm-connect-provider-secrets`, rerun:
```bash
make reconcile-activity-core-llm-connect
```
The command will apply the llm-connect overlay, wait for deployment readiness,
run the in-namespace fixture smoke with `imagePullPolicy=Never`, and post
non-secret evidence: provider Secret key count, deployment readiness,
pass/fail, latency/recommendation summary or sanitized failure.
2026-07-01: Gate closed. Provider Secret `activity-core/llm-connect-provider-secrets`
present (key count 1, no values inspected), overlay applied (no drift),
deployment `llm-connect` ready 1/1, in-namespace fixture smoke passed
(`health=ok latency_seconds=2.084 recommendations=1`). Evidence note
`bddbf5d2-6cbe-4d97-9de6-689147d61be1`. The first rerun failed with
`Connection refused` because the `llm-connect-activity-core-only`
NetworkPolicy (added 2026-06-19) allowlist had not yet propagated the fresh
smoke-pod IP; the reconcile tool now retries the smoke up to 6× with a 5s
warm-up inside the pod.
Historical live gate on 2026-06-18: provider Secret
`activity-core/llm-connect-provider-secrets` is missing, so deployment and
smoke are intentionally blocked until operator/OpenBao-to-Kubernetes Secret
custody is complete. Evidence note
`c72c514a-399e-4c54-8d5b-d36405932360` records provider Secret status
`missing`, key count `0`, deployment status `not checked; provider Secret gate
not satisfied`, and smoke status `blocked`.