Adapt RAIL-HO-WP-0005 for production Forgejo and staged repo ladder

Reflects live railiance01 deploy, cancels isolated probe T03 in favor of
in-production pilots, marks T08/T10 progress (forgejo-actions-probe,
glas-harness), and documents tier 0-3 migration sequencing before state-hub.
This commit is contained in:
2026-07-04 01:02:42 +02:00
parent 6b0ededee2
commit 67b259f6dc
2 changed files with 157 additions and 82 deletions

View File

@@ -204,3 +204,25 @@ lost or left with an untracked remote.
This first pass satisfies the public and infrastructure metadata part of T01.
T01 should remain open until the authenticated admin inventory and missing repo
classification are complete.
## Addendum (2026-07-04) — migration ladder and new repos
`RAIL-HO-WP-0005` now uses a **staged per-repo ladder** instead of an isolated
probe namespace (T03 cancelled). Repos to add or re-classify on next inventory
refresh:
| Repo | On Gitea (2026-06) | On Forgejo (2026-07-04) | Tier | Notes |
| --- | --- | --- | ---: | --- |
| `forgejo-actions-probe` | — | yes | 0 | Disposable runner/OCI probe |
| `glas-harness` | yes (not in table above) | yes (canonical) | 1 | Git+SSH+CI pilot; see `the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md` |
**Tier definitions** (for per-repo `migration tier` column in a future refresh):
| Tier | Criteria | Examples |
| ---: | --- | --- |
| 0 | Disposable integration probes | `forgejo-actions-probe` |
| 1 | Non-production; git+CI only | `glas-harness` |
| 2 | Non-production with container image + registry pull | TBD (`key-cape` candidate) |
| 3 | Production drain wave / sweep registration | `state-hub`, `issue-core`, … |
Production repos stay on Gitea until tier 02 gates and T09 backup drill pass.

View File

@@ -8,7 +8,7 @@ status: active
owner: railiance
topic_slug: railiance
created: "2026-05-03"
updated: "2026-06-04"
updated: "2026-07-04"
state_hub_workstream_id: "84e17675-0d15-4268-a8bd-540124d37018"
---
@@ -24,6 +24,13 @@ Forgejo will become the heart of Railiance infrastructure. The work must be
fully automated, backup-backed, recovery-drilled, and suitable for long-lived
operation on railiance01 before any production cutover happens.
**Sequencing update (2026-07-04):** Production Forgejo is live on railiance01
with Gitea still canonical per the safety contract. Repo cutover proceeds
**staged per-repo** using a migration ladder (disposable probes → non-production
pilots → image-capable pilots → production repos). `state-hub` is last. See
`CUST-WP-0054-T04` and
`the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`.
## Placement in the Railiance Tooling Set
This workplan lives in `railiance-infra` because it is the cross-layer
@@ -48,7 +55,7 @@ change is made there.
1. ~~Public/private hostname for Forgejo~~ **DECIDED 2026-07-03:**
`forgejo.coulomb.social` → railiance01 (`92.205.62.239`). DNS active;
Traefik edge live; Forgejo workload not deployed yet (404). Gitea remains
Traefik edge live; Forgejo workload deployed and serving HTTPS. Gitea remains
canonical until migration drills pass. Record:
`the-custodian/docs/forgejo-production-decisions.md`.
2. Mail delivery path for password reset and account recovery
@@ -60,8 +67,9 @@ change is made there.
host runner retired after cutover.
5. Backup destination and retention target for database, repositories,
attachments, LFS, Actions artifacts/logs, and package data.
6. Cutover mode: freeze-and-migrate all repos in one window, or staged
project-by-project transition.
6. Cutover mode: ~~freeze-all vs staged~~ **LEANING staged per-repo (2026-07-04)**
based on `glas-harness` pilot; operator confirmation still needed. Freeze-all
remains fallback for final production wave if drift risk is unacceptable.
## Safety Contract
@@ -80,23 +88,30 @@ change is made there.
repo. No plaintext SMTP passwords, admin tokens, runner tokens, or registry
credentials in Git.
## Probe Strategy
## Probe and pilot strategy (revised 2026-07-04)
A `forgejo-railiance-probe` is reasonable and should be treated as a disposable
S5/S4 integration probe, not as the production install.
Original T03 planned a **disposable isolated-namespace probe** before any
production install. That path was **superseded**: production Forgejo deployed on
railiance01 under the safety contract (Gitea remains canonical; no Gitea deletes).
The probe should prove:
Integration evidence now comes from **in-production probes and repo pilots**:
- Helm values and cnpg database wiring converge cleanly.
- Initial admin bootstrap is automated and repeatable.
- SMTP/password reset works end-to-end.
- Package registry endpoints work for the package types Railiance needs first.
- Forgejo Actions can run a minimal workflow and publish a test package.
- Backup and restore works in an isolated namespace.
- Migration from a sample Gitea repo preserves git history, issues, releases,
wiki, LFS or attachments where applicable.
| Tier | Repo | Purpose | Status |
| --- | --- | --- | --- |
| 0 | `coulomb/forgejo-actions-probe` | Runner scheduling, DinD, OCI image-build | **done** |
| 1 | `coulomb/glas-harness` | Non-production git+SSH+CI routing drill | **done** |
| 2 | TBD (small lib with image, e.g. `key-cape`) | Image-build workflow + registry pull on railiance01 | **next** |
| 3 | Production set (`state-hub`, `issue-core`, …) | Canonical remotes, sweep paths, deploy loops | **gated** |
The probe is destroyed or explicitly archived after production Forgejo is live.
Each tier must pass before the next. T03 (isolated probe namespace) is cancelled;
acceptance criteria below are tracked across T05, T07, T08, and T10 instead.
Still to prove before T11:
- SMTP/password reset end-to-end (T06).
- Backup and restore in isolated namespace (T09).
- Issues/releases/wiki/LFS per inventory classification (T10 matrix).
- Operator SSH identity on Forgejo beyond interim `forgejo_admin` keys (T02/T10).
## Target Architecture
@@ -141,6 +156,10 @@ Minimum inventory:
Forgejo before cutover and classifies each migration item as automatic,
manual, unsupported, or explicitly out of scope.
**Gap (2026-07-04):** first-pass inventory predates repos created after
2026-06-04 (e.g. `glas-harness`, `forgejo-actions-probe`). Refresh org repo
list and add a **migration tier** column (03) per repo before T11.
---
### T02 — Resolve Forgejo production design decisions
@@ -155,8 +174,10 @@ state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3"
Decide the production choices listed in "Key Decisions to Confirm".
**Partial (2026-07-03):** hostname and in-cluster runner model decided (`ADR-004`).
Remaining: SMTP, package scope, backup, cutover mode. See
**Partial (2026-07-04):** hostname, exposure, deployment pattern, live deploy,
and in-cluster runner model decided (`ADR-004`). Cutover mode **leaning** staged
per-repo (glas-harness pilot). Remaining operator decisions: SMTP, package scope
beyond OCI, backup target, final cutover confirmation. See
`the-custodian/docs/forgejo-production-decisions.md`.
Expected output:
@@ -174,36 +195,21 @@ choices.
---
### T03 — Build forgejo-railiance-probe
### T03 — Build forgejo-railiance-probe (isolated namespace)
```task
id: RAIL-HO-WP-0005-T03
status: todo
status: cancel
priority: high
state_hub_task_id: "b516018a-415e-4a58-8c62-07c14ece9353"
```
Create a disposable probe environment for Forgejo before touching production.
Expected repo ownership:
- `railiance-platform`: probe cnpg database and storage dependencies.
- `railiance-apps`: probe Forgejo Helm values and namespace.
- `railiance-enablement`: probe Actions runner template and workflows.
Probe acceptance:
- `make forgejo-probe-deploy` or equivalent converges from a clean cluster
state.
- Admin bootstrap is automated.
- A test user can reset a password via email.
- A test repository can be created, cloned, pushed, and protected.
- A test package can be published and pulled.
- A test Forgejo Actions workflow runs successfully.
- A probe backup restores into an isolated namespace.
**Done when:** the probe demonstrates the whole lifecycle without manual
cluster surgery.
**Cancelled 2026-07-04:** superseded by production Forgejo on railiance01 (T05)
plus in-production integration probes (`forgejo-actions-probe`, `glas-harness`).
Isolated-namespace probe added latency without reducing risk given the safety
contract (Gitea canonical, no deletes). Remaining T03 acceptance items map to:
T05 (deploy), T06 (mail), T07 (packages), T08 (Actions), T09 (backup restore),
T10 (repo migration drill).
---
@@ -227,6 +233,11 @@ Minimum scope:
packages, Actions artifacts, and logs.
- Restore runbook for database and blob/package data.
**Partial (2026-07-04):** `forgejo-db` CNPG cluster healthy on railiance01
(`make forgejo-db-status` → Cluster in healthy state). SOPS secret path and
network policies in `railiance-platform`. Remaining: backup/WAL archiving to
approved target, blob/package storage restore drill (feeds T09).
**Done when:** platform dependencies can be deployed and restored without the
Forgejo app running.
@@ -252,9 +263,11 @@ Minimum scope:
- Health/status targets in the Makefile.
- Migration-safe configuration for coexistence with Gitea during the cutover.
**Partial (2026-07-03):** `railiance-apps` deploy live — HTTPS smoke pass, Actions
enabled, `coulomb` org + probe workflow success. Remaining: SOPS secrets,
SMTP, Docker on runner host for image builds, migration drills.
**Partial (2026-07-04):** `railiance-apps` deploy live — HTTPS smoke pass,
ingress + TLS, SSH NodePort `30022`, Actions enabled, `coulomb` org,
`railiance01-build-01` runner (ADR-004). Git push/pull via HTTPS and
`forgejo-remote` SSH proven. Remaining: SOPS hardening for all secrets,
SMTP (T06), operator user accounts beyond `forgejo_admin`.
**Done when:** Forgejo runs on railiance01 against production platform
services and can serve login, git clone/push, package registry, and admin
@@ -312,8 +325,13 @@ Acceptance:
- Retention and cleanup expectations are documented.
- Package data is included in backup and restore drills.
**Done when:** `state-hub` or a probe image can be published to Forgejo and
pulled by railiance01.
**Partial (2026-07-04):** OCI registry live (`/v2/` auth challenge). Probe image
`forgejo.coulomb.social/coulomb/forgejo-actions-probe` built and pushed via
Actions. Remaining: publish and pull a **tier-2 pilot** app image (not yet
`state-hub`); document retention; include packages in backup drill (T09).
**Done when:** a tier-2 pilot image (or `state-hub` after explicit approval) can
be published to Forgejo and pulled by railiance01 k3s.
---
@@ -321,7 +339,7 @@ pulled by railiance01.
```task
id: RAIL-HO-WP-0005-T08
status: todo
status: progress
priority: high
state_hub_task_id: "f45f98c9-2f02-4224-bbfd-c2e1ec38581e"
```
@@ -337,8 +355,16 @@ Minimum scope:
- Secret handling policy for Actions.
- Resource limits to avoid repeating previous single-node overload patterns.
**Done when:** a representative repository can run Forgejo Actions and publish
a test artifact without privileged cluster-wide credentials.
**Partial (2026-07-04):** in-cluster runner live (`railiance-apps/manifests/
forgejo-runner.yaml`, ADR-004). Proven workflows: `forgejo-actions-probe`
(image-build), `glas-harness` (host+container CI smoke). Org secrets
`REGISTRY_USER`/`REGISTRY_TOKEN` set. Documented constraints: host runner is
non-root (static docker-cli, no `apk add`); `actions/checkout@v4` fails — use
`git clone` in job. Remaining: reusable workflow templates in
`railiance-enablement` (S4); resource limits review; no cluster-admin on runner.
**Done when:** tier-2 pilot repo runs Forgejo Actions end-to-end and publishes
a pullable image without privileged cluster-wide credentials.
---
@@ -376,29 +402,38 @@ with repository, package, and user recovery checks passing.
---
### T10 — Drill Gitea to Forgejo migration
### T10 — Drill Gitea to Forgejo migration (staged ladder)
```task
id: RAIL-HO-WP-0005-T10
status: todo
status: progress
priority: high
state_hub_task_id: "6befde73-00bc-4643-be0b-a7ce7944e75f"
```
Run a non-production migration drill from Gitea to Forgejo.
Run staged migration drills from Gitea to Forgejo before production repos move.
Minimum checks:
**Tier 1 complete (2026-07-04):** `glas-harness` — git history preserved,
`origin` on Forgejo, `gitea` legacy remote retained, SSH+HTTPS push, CI smoke
green. Result matrix:
`the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`.
Minimum checks (per tier):
- Git history and default branches preserved.
- Issues, labels, milestones, releases, wiki, and attachments handled per
inventory classification.
- SSH/HTTPS clone and push paths work.
- Existing local remotes can be transformed predictably.
- State Hub registered repo remotes can be updated safely.
- Rollback plan is rehearsed.
inventory classification (N/A for tier-1 git-only repos).
- SSH/HTTPS clone and push paths work (`forgejo-remote` in `~/.ssh/config`).
- Existing local remotes can be transformed predictably (`origin`/`gitea` split).
- State Hub registered repo remotes can be updated safely (deferred for tier-1).
- Rollback plan is rehearsed (Gitea copy unchanged).
**Done when:** a sample migration has a written result matrix and no unknown
critical migration gaps remain.
**Next:** tier-2 repo with container image + `.gitea/workflows` port to
`.forgejo/workflows`. **Not ready:** `state-hub` until hub-core build context
template and sweep `remote_url` playbook exist.
**Done when:** tiers 02 pass with written result matrices and no unknown
critical migration gaps remain for production repos.
---
@@ -412,19 +447,21 @@ needs_human: true
state_hub_task_id: "b1b66687-ca33-4971-b312-743c8e059c5e"
```
Execute the production migration only after the probe, backup restore, package
registry, email recovery, and Actions gates pass.
Execute production migration only after T06, T07, T08, T09, and T10 tier 02
gates pass. `state-hub` and other Wave-1 production repos require explicit
operator approval per `CUST-WP-0054` drain sequence.
Cutover sequence:
**Preferred cutover (staged per-repo):**
1. Announce freeze window.
2. Take final Gitea backup and verify it exists.
3. Freeze Gitea writes.
4. Migrate repositories and metadata to Forgejo.
5. Validate critical repositories and package pulls.
6. Update State Hub repo remotes and host paths as needed.
7. Update local and railiance01 remotes.
8. Keep Gitea read-only as rollback until the stabilization window passes.
1. Per repo: Gitea backup snapshot (or org-wide before each wave).
2. Mirror git to Forgejo; switch workstation `origin` to `forgejo-remote`.
3. Port/verify Actions workflows on Forgejo runner.
4. Update State Hub `remote_url` and railiance01 sweep checkouts when promoted.
5. Mark Gitea repo read-only (org policy); do not delete.
6. Repeat until production set complete.
**Freeze-all fallback:** single window if staged drift is unacceptable — same
steps but all repos in one maintenance period.
**Done when:** all Railiance/Custodian repos use Forgejo as primary, Gitea is
read-only fallback, and rollback instructions are documented.
@@ -458,19 +495,28 @@ legacy Gitea either archived or intentionally retained as documented fallback.
## Phasing and Dependencies
```
T01 inventory ──► T02 decisions ─┬─► T03 probe ─┬─► T04 platform
│ ├─► T05 app
├─► T06 mail recovery
│ ├─► T07 packages
│ ├─► T08 actions
└─► T09 backups
└────────────────────────────────────► T10 migration drill
T01 inventory ──► T02 decisions ──┬──► T04 platform (forgejo-db ✓ partial)
├──► T05 app (live ✓ partial)
─► T06 mail recovery
├──► T07 packages (OCI probe ✓ partial)
├──► T08 actions (runner ✓ partial)
─► T09 backups
T03-T10 all pass ─► T11 production cutover ─► T12 legacy Gitea retirement
T05+T08 ──► T10 migration ladder ──► T11 production cutover ─► T12 Gitea retire
tier0 probe ✓
tier1 glas-harness ✓
tier2 image repo (next)
tier3 production (gated)
T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots)
```
Recommended first slice: T01, T02, T03. Do not start T11 until T06, T07, T08,
T09, and T10 are complete.
**Current focus (2026-07-04):** T10 tier-2 image pilot; parallel T09 backup
drill and T02 open decisions (SMTP, backup target). Do not start T11
`state-hub` until T09 complete and `CUST-WP-0054` Wave-1 gates satisfied.
**Absorbed by `CUST-WP-0054-T04`:** forge + CI on railiance01; workstation
build retirement; staged repo promotion before State Hub primary move (T05).
## railiance-bootstrap Note
@@ -490,7 +536,14 @@ purpose is identified.
- `RAIL-HO-WP-0004-production-readiness.md`
- `RAIL-HO-WP-0003-5repo-stack-restructure.md`
- `CUST-WP-0054-workstation-independence-and-fleet-realignment.md` (T04 forge+CI)
- `CUST-WP-0014-repo-sync-automation.md`
- `CUST-WP-0021-multi-host-repo-paths.md`
- `docs/adr/ADR-004-forgejo-in-cluster-actions-runner.md`
- `docs/forgejo-migration-inventory.md`
- `the-custodian/docs/forgejo-production-decisions.md`
- `the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`
- `railiance-apps/docs/forgejo-on-railiance01.md`
- `railiance-forge/docs/forgejo-actions-runner-substrate.md`
- `ops/incidents/2026-03-25-gitea-pgpool-crashloop.md`
- `ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`