Repo hygiene + new workplans (RAIL-BS-WP-0008/0009)
Some checks failed
railiance-tests / smoke (push) Has been cancelled

- Add RAIL-BS-WP-0008 (activity-core WP-0016 deploy) and RAIL-BS-WP-0009
  (admin-sync smoke) from inbox asks 87952ff1 / aa8b7986
- Archive finished workplans to workplans/archived/ per ADR-001 convention;
  normalize frontmatter statuses (completed/done -> finished)
- Fill stack-and-commands.md, complete repo-boundary.md, refresh SCOPE
  Current State, add docs/operator-runbook.md for production-touching targets

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-02 00:02:36 +02:00
parent eefa6c1b2a
commit b3b0c3e3ff
15 changed files with 206 additions and 24 deletions

View File

@@ -1,8 +1,17 @@
## Repo boundary
This repo owns **railiance-cluster** only. It does not own:
This repo owns **railiance-cluster** only (OAS S2 — cluster runtime). It does not own:
<!-- TODO: List what belongs in adjacent repos, e.g.:
- SSH key management → railiance-infra/
- State hub code → state-hub/
-->
- OS hardening, SSH, firewall, Terraform/cloud-init provisioning → `railiance-infra` (S1)
- Platform services: PostgreSQL HA, Valkey, OpenBao, object storage → `railiance-platform` (S3)
- CI/CD templates, developer portal, SDKs → `railiance-enablement` (S4)
- Application Helm releases and workload manifests (incl. Gitea values) → `railiance-apps` (S5)
- Forge/registry infrastructure (Gitea/Forgejo operation) → `railiance-forge`
- Ecosystem graph/registry model → `railiance-fabric`
- Identity/SSO/MFA (Keycloak, IAM profiles) → `net-kingdom`
- State Hub code → `state-hub` / `the-custodian`
S2 *does* deploy cluster-scoped operators and addons (cert-manager, cnpg
operator, ArgoCD, nginx ingress) and owns kubeconfig custody, plus
cluster-owned deploy/verify gates for workloads whose repos have no cluster
access (e.g. the activity-core and llm-connect reconcile commands).

View File

@@ -1,19 +1,22 @@
## Stack
<!-- TODO: Fill in language, frameworks, and key dependencies -->
- **Language:**
- **Key deps:**
- **Language:** Bash tooling (`tools/cmd/`) orchestrating kubectl/Helm over SSH
- **Key deps:** k3s on railiance01 (COULOMBCORE), Helm, SOPS/age for secrets, State Hub REST for evidence notes
- **Execution model:** commands run from the workstation; cluster access is `ssh railiance01` (most `tools/cmd/*` accept a `CLUSTER_HOST` override)
## Dev Commands
```bash
# TODO: Fill in the standard commands for this repo
# Install dependencies
# Run tests
# Lint / type check
# Build / package (if applicable)
make help # list all targets
make preflight # pre-migration safety gate — run before cluster work
make smoke # Kubernetes smoke tests
make test-ha-failover # HA failover test (kills primary PG pod, asserts recovery)
sudo make backup # age-encrypted backup: k3s state + Helm values + kubeconfig
make restore # list backups + restore guide
make verify-activity-core # reconcile activity-core runtime + probe evidence
make reconcile-activity-core-llm-connect # llm-connect reconcile + non-secret gate checks
```
Production-touching targets (deploy/reconcile/backup) need operator approval —
see `docs/operator-runbook.md`. There is no test suite or linter in this repo;
validation is the preflight + smoke targets against the live cluster.

View File

@@ -60,8 +60,8 @@ Railiance is structured as five independent repos per OAS Stack layer. This repo
## Current State
- Status: active / stable
- Implementation: k3s baseline complete (RAIL-BS-WP-0002 done); pgpool HA failover fix complete (RAIL-BS-WP-0003 done); integrated backup complete (RAIL-BS-WP-0004 done — age-encrypted local backup, daily cron under root)
- Stability: high — no active open workplans
- Implementation: k3s baseline, pgpool HA failover fix, age-encrypted backup, kubeconfig delivery, staged promotion lifecycle, and activity-core/llm-connect reconcile gates all finished (RAIL-BS-WP-0002…0006, RAILIANCE-WP-0012…0014)
- Open work: RAIL-BS-WP-0007 ThreePhoenix HA cluster (active, 0/7); RAIL-BS-WP-0008 activity-core WP-0016 deploy (ready); RAIL-BS-WP-0009 admin-sync smoke (ready)
- Usage: core Kubernetes runtime for all Railiance deployments; runs on COULOMBCORE (92.205.130.254)
- Also deployed at cluster level: cert-manager, ArgoCD, CloudNative PG operator (cnpg), nginx ingress, SSO stack (mfa + sso namespaces via net-kingdom)

35
docs/operator-runbook.md Normal file
View File

@@ -0,0 +1,35 @@
# Operator runbook — production-touching commands
All targets below change state on the production k3s cluster (railiance01 /
COULOMBCORE, 92.205.130.254) or its backups. Agent sessions running in auto
mode are denied these by the permission classifier — that is intentional.
## How to run a production-touching target
- **Interactively in a Claude Code session:** type `! <command>` so the
command runs under the operator's authority and the output lands in the
conversation for the agent to act on.
- **Directly:** run from this repo root on the workstation; cluster access is
`ssh railiance01` (key-based, configured in `~/.ssh/config`).
## Production-touching targets
| Target | Effect |
|---|---|
| `sudo make backup` | writes age-encrypted backup to `/opt/backup/railiance/cluster/` |
| `make k3s-install` | (re)installs k3s baseline — destructive, preflight first |
| `make test-ha-failover` | kills the primary PG pod to assert recovery |
| `make verify-activity-core` | reconciles activity-core runtime on railiance01 |
| `make reconcile-activity-core-llm-connect` | patches ConfigMap, applies llm-connect overlay, runs smoke pod |
## Read-only / safe targets
`make help`, `make preflight`, `make smoke`, `make restore` (prints guide
only). These are safe to allowlist for agent sessions.
## Evidence convention
Reconcile/verify targets post non-secret evidence notes to the State Hub
(`STATE_HUB_EVIDENCE_WORKSTREAM_ID` / `STATE_HUB_EVIDENCE_TASK_ID` env vars
attach them to a workstream/task). Never record Secret values — key counts
and readiness states only.

View File

@@ -0,0 +1,89 @@
---
id: RAIL-BS-WP-0008
type: workplan
title: "activity-core WP-0016 triage-output robustness deploy"
domain: financials
repo: railiance-cluster
status: ready
owner: railiance-cluster
topic_slug: railiance
created: "2026-07-01"
updated: "2026-07-01"
---
# activity-core WP-0016 triage-output robustness deploy
## Context
Inbox message `87952ff1` (activity-core, 2026-06-26): the scheduled daily WSJF
triage run on 2026-06-26 failed schema validation and the whole run was
discarded, resetting the WP-0006-T03 three-clean-run streak. ACTIVITY-WP-0016
hardened the instruction-executor output contract in-repo (commits
`5eb33bd..bf877b7` on activity-core main, 220 tests passed). The remaining
work is operator/cluster-owned on railiance01.
**Deploy coupling constraint:** `schemas/daily-triage-report.json` is now
strict per-item and is consumed by both the llm-connect hint and the
whole-doc validator. It MUST ship together with the new `executor.py`
(T03 per-item quarantine parser). Never deploy the schema ahead of the code.
## Deploy activity-core with coupled schema and executor
```task
id: RAIL-BS-WP-0008-T01
status: todo
priority: high
```
Rebuild/import the activity-core image from main (`bf877b7` or later) into
the railiance01 k3s runtime and reconcile the activity-core deployment so the
new executor and the strict per-item schema ship together.
## Update daily-statehub-wsjf-triage runtime-bundle Instruction
```task
id: RAIL-BS-WP-0008-T02
status: todo
priority: high
```
In the runtime projection (not the activity-core repo), update the
`daily-statehub-wsjf-triage` Instruction:
- raise `max_tokens` (currently ~1200; give clear headroom above the
~13001500-token 16-workstream list);
- prompt: bounded top-N (≤7) ranked recommendations, "if uncertain emit fewer
well-formed items rather than more";
- prompt: per-item NDJSON framing (leading summary object, then one
recommendation JSON object per line) so the T03 parser recovers items
independently.
## Pull raw llm-connect response for the 2026-06-26 run
```task
id: RAIL-BS-WP-0008-T03
status: todo
priority: medium
```
From the llm-connect pod logs / response store on railiance01, capture the
full raw response and `finish_reason` for the 2026-06-26 05:20:57Z run
(activity-core retained only a 4000-char preview; the JSON break is at char
5268). Send to activity-core to close ACTIVITY-WP-0016-T01. Logs only, no
secrets.
## Acceptance smoke
```task
id: RAIL-BS-WP-0008-T04
status: todo
priority: high
```
Trigger one daily-triage run against the reconciled runtime and confirm it
either (i) returns a clean schema-valid report, or (ii) degrades gracefully
(valid recommendations with `output_validated=true`, `partial=true`,
`quarantined_count>0`) instead of discarding the run. Confirm the State Hub
shows a matching `daily_triage` progress event. Closes ACTIVITY-WP-0016-T05
and unblocks the three-clean-run streak for ACTIVITY-WP-0010-T04 /
WP-0006-T03.

View File

@@ -0,0 +1,46 @@
---
id: RAIL-BS-WP-0009
type: workplan
title: "activity-core no-restart admin-sync smoke (ACTIVITY-WP-0012-T05)"
domain: financials
repo: railiance-cluster
status: ready
owner: railiance-cluster
topic_slug: railiance
created: "2026-07-01"
updated: "2026-07-01"
---
# activity-core no-restart admin-sync smoke (ACTIVITY-WP-0012-T05)
## Context
Inbox message `aa8b7986` (activity-core, 2026-06-18): activity-core commit
`3e93567` implements ACTIVITY-WP-0012 T01T04 (shared sync_service,
`POST /admin/sync`, explicit schedule upsert/pause/orphan-delete counts,
worker startup reuse, runbook docs; 192 tests passed). T05 is the
cluster-owned smoke: prove admin sync works **without** worker
SIGTERM/pod restart.
The deploy precondition is covered by RAIL-BS-WP-0008-T01 (main at
`bf877b7``3e93567`), so run this after that reconcile.
## Run the no-restart admin-sync smoke
```task
id: RAIL-BS-WP-0009-T01
status: wait
priority: medium
```
After RAIL-BS-WP-0008-T01 is deployed, without restarting the worker:
1. Change or use a customer ActivityDefinition enabled-flip/rename fixture.
2. Call `POST /admin/sync?definitions=true&schedules=true` from the operator
path.
3. Confirm the new Temporal schedule is active and the retired/disabled
schedule is paused or deleted per sync semantics.
4. Confirm event-triggered definitions still fire normally.
5. Record non-secret evidence in the State Hub. Response JSON should include
`definitions.synced`, `schedules.upserted`, `schedules.paused`,
`schedules.deleted_orphans`, and `errors[]`.

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Dependency Management — Add lockfile for Ansible control-node deps"
domain: financials
repo: railiance-cluster
status: completed
status: finished
owner: railiance
topic_slug: railiance
state_hub_workstream_id: 59155efb-b461-4caa-ad7b-b3fce348db84

View File

@@ -4,7 +4,7 @@ type: workplan
title: "k3s and Kubernetes Platform Baseline"
domain: financials
repo: railiance-cluster
status: completed
status: finished
owner: railiance
topic_slug: railiance
repo_goal_id: "70ab2379-fb9d-4fec-a09d-b2a717e4ace8"

View File

@@ -4,7 +4,7 @@ type: bug-report
title: "pgpool CrashLoopBackOff on PostgreSQL HA failover — missing secret key"
domain: financials
repo: railiance-cluster
status: completed
status: finished
owner: tegwick
created: "2026-03-10"
updated: "2026-03-10"

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
domain: financials
repo: railiance-cluster
status: done
status: finished
owner: tegwick
topic_slug: railiance
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Kubeconfig delivery for netkingdom SSO/MFA stack apply"
domain: financials
repo: railiance-cluster
status: done
status: finished
owner: railiance-worker
topic_slug: railiance
capability_request_id: "34b97d89-e80a-42ae-a623-a9185e5b17f5"