diff --git a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md new file mode 100644 index 0000000..15f2d12 --- /dev/null +++ b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md @@ -0,0 +1,483 @@ +--- +id: RAIL-HO-WP-0005 +type: workplan +title: "Forgejo Production Migration on railiance01" +domain: railiance +repo: railiance-infra +status: active +owner: railiance +topic_slug: railiance +created: "2026-05-03" +updated: "2026-05-03" +state_hub_workstream_id: "84e17675-0d15-4268-a8bd-540124d37018" +--- + +# Forgejo Production Migration on railiance01 + +## Goal + +Establish Forgejo as the production-grade source forge and package base for +Railiance, then migrate all repositories and workflows currently relying on +Gitea to the new Forgejo installation. + +Forgejo will become the heart of Railiance infrastructure. The work must be +fully automated, backup-backed, recovery-drilled, and suitable for long-lived +operation on railiance01 before any production cutover happens. + +## Placement in the Railiance Tooling Set + +This workplan lives in `railiance-infra` because it is the cross-layer +production infrastructure coordination plan and belongs next to +`RAIL-HO-WP-0004-production-readiness.md`. + +Implementation must respect the OAS repo boundaries: + +| Concern | Repo | Layer | +|---------|------|-------| +| Server prerequisites, inventory, OS packages, SSH/system users | `railiance-infra` | S1 | +| k3s runtime prerequisites, namespaces, ingress class, cluster backup hooks | `railiance-cluster` | S2 | +| PostgreSQL, object storage, backup targets, registry storage dependencies | `railiance-platform` | S3 | +| Forgejo Actions runner templates, CI conventions, migration automation | `railiance-enablement` | S4 | +| Forgejo Helm release, app config, mail config, package registry, app backups | `railiance-apps` | S5 | + +This file is the umbrella plan. If an implementation step requires files in a +different repo, that repo should receive its own workplan or task before the +change is made there. + +## Key Decisions to Confirm + +1. Public/private hostname for Forgejo and whether Gitea remains reachable + during the transition. +2. Mail delivery path for password reset and account recovery + (SMTP relay, sender domain, SPF/DKIM/DMARC expectations). +3. Package registry scope: container images only at first, or also generic, + npm, PyPI, Go, Maven, and Helm packages. +4. Actions runner model: in-cluster ephemeral runners, long-lived runner pod, + or isolated host runner. +5. Backup destination and retention target for database, repositories, + attachments, LFS, Actions artifacts/logs, and package data. +6. Cutover mode: freeze-and-migrate all repos in one window, or staged + project-by-project transition. + +## Safety Contract + +- Gitea remains the production source of truth until Forgejo restore and + migration drills pass. +- No repository is deleted from Gitea during this workplan. +- A fresh Gitea backup must be taken before every migration drill and before + final cutover. +- Forgejo backups must be restored into an isolated namespace before accepting + production use. +- Password reset and email recovery must be verified with a real controlled + account before onboarding users. +- Forgejo Actions may not receive broad cluster credentials by default; runner + permissions must be least-privilege and repo-scoped where practical. +- Secrets stay in SOPS/age or Kubernetes Secrets managed by the appropriate + repo. No plaintext SMTP passwords, admin tokens, runner tokens, or registry + credentials in Git. + +## Probe Strategy + +A `forgejo-railiance-probe` is reasonable and should be treated as a disposable +S5/S4 integration probe, not as the production install. + +The probe should prove: + +- Helm values and cnpg database wiring converge cleanly. +- Initial admin bootstrap is automated and repeatable. +- SMTP/password reset works end-to-end. +- Package registry endpoints work for the package types Railiance needs first. +- Forgejo Actions can run a minimal workflow and publish a test package. +- Backup and restore works in an isolated namespace. +- Migration from a sample Gitea repo preserves git history, issues, releases, + wiki, LFS or attachments where applicable. + +The probe is destroyed or explicitly archived after production Forgejo is live. + +## Target Architecture + +``` +operator / agents / developers + -> private HTTPS endpoint + -> railiance01 ingress + -> forgejo Service in forgejo namespace + -> Forgejo Deployment/StatefulSet + -> forgejo-db CloudNative PG Cluster in databases namespace + -> Valkey/cache if required + -> persistent storage for repositories, attachments, LFS, packages + -> Actions runner(s) with restricted execution scope + -> backup jobs to the approved backup target +``` + +## Tasks + +### T01 — Inventory current Gitea functionality and migration requirements + +```task +id: RAIL-HO-WP-0005-T01 +status: todo +priority: high +state_hub_task_id: "cf59d171-5629-45c9-9d44-8d6499827ffc" +``` + +Create a source-of-truth inventory of current Gitea usage. + +Minimum inventory: + +- All repositories in the `coulomb` organization. +- Registered vs unregistered State Hub repos. +- Users, organizations, teams, deploy keys, SSH keys, access tokens. +- Issues, labels, milestones, releases, wiki, packages, LFS, attachments. +- Existing webhook usage and automation assumptions. +- Current Gitea package registry status and the missing `[packages]` config + that is blocking container image publication. + +**Done when:** the inventory identifies every feature that must work in +Forgejo before cutover and classifies each migration item as automatic, +manual, unsupported, or explicitly out of scope. + +--- + +### T02 — Resolve Forgejo production design decisions + +```task +id: RAIL-HO-WP-0005-T02 +status: todo +priority: high +needs_human: true +state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3" +``` + +Decide the production choices listed in "Key Decisions to Confirm". + +Expected output: + +- A short decision record in this workplan or a dedicated ADR. +- Hostname and exposure model. +- SMTP provider and sender identity. +- Package registry scope. +- Actions runner isolation model. +- Backup target, retention, encryption, and restore cadence. +- Cutover strategy and rollback window. + +**Done when:** implementation tasks are no longer blocked by open production +choices. + +--- + +### T03 — Build forgejo-railiance-probe + +```task +id: RAIL-HO-WP-0005-T03 +status: todo +priority: high +state_hub_task_id: "b516018a-415e-4a58-8c62-07c14ece9353" +``` + +Create a disposable probe environment for Forgejo before touching production. + +Expected repo ownership: + +- `railiance-platform`: probe cnpg database and storage dependencies. +- `railiance-apps`: probe Forgejo Helm values and namespace. +- `railiance-enablement`: probe Actions runner template and workflows. + +Probe acceptance: + +- `make forgejo-probe-deploy` or equivalent converges from a clean cluster + state. +- Admin bootstrap is automated. +- A test user can reset a password via email. +- A test repository can be created, cloned, pushed, and protected. +- A test package can be published and pulled. +- A test Forgejo Actions workflow runs successfully. +- A probe backup restores into an isolated namespace. + +**Done when:** the probe demonstrates the whole lifecycle without manual +cluster surgery. + +--- + +### T04 — Define Forgejo platform services + +```task +id: RAIL-HO-WP-0005-T04 +status: todo +priority: high +state_hub_task_id: "28b351fe-bfbe-4a8b-bbfa-1b148e69f8e0" +``` + +In `railiance-platform`, define production platform services for Forgejo. + +Minimum scope: + +- `forgejo-db` CloudNative PG cluster. +- Database credentials via SOPS-managed Secret or approved secret flow. +- Backup configuration for database base backups and WAL archiving. +- Object storage or persistent volume plan for repositories, attachments, LFS, + packages, Actions artifacts, and logs. +- Restore runbook for database and blob/package data. + +**Done when:** platform dependencies can be deployed and restored without the +Forgejo app running. + +--- + +### T05 — Define production Forgejo application deployment + +```task +id: RAIL-HO-WP-0005-T05 +status: todo +priority: high +state_hub_task_id: "11540ba4-d31c-4f64-836b-c6de69107aa4" +``` + +In `railiance-apps`, create the production Forgejo deployment. + +Minimum scope: + +- Forgejo Helm release or manifests in the S5 boundary. +- App configuration for database, SSH, HTTPS, mailer, packages, LFS, and + security settings. +- Initial admin/user bootstrap that is automated but does not commit secrets. +- Health/status targets in the Makefile. +- Migration-safe configuration for coexistence with Gitea during the cutover. + +**Done when:** Forgejo runs on railiance01 against production platform +services and can serve login, git clone/push, package registry, and admin +operations. + +--- + +### T06 — Implement usable email recovery cycle + +```task +id: RAIL-HO-WP-0005-T06 +status: todo +priority: high +needs_human: true +state_hub_task_id: "417faa4d-eab8-4247-9485-4f80e5d5b7ff" +``` + +Configure and test mail delivery for account recovery. + +Minimum scope: + +- SMTP credentials stored through the approved secret path. +- Sender address and domain alignment documented. +- Password reset email works for a controlled non-admin account. +- Account recovery runbook covers lost password, lost MFA, disabled account, + and emergency admin access. +- Mail failure is observable through logs or a health check. + +**Done when:** a user can complete password recovery without operator database +edits, and the operator has a documented emergency path. + +--- + +### T07 — Enable and harden package registry base + +```task +id: RAIL-HO-WP-0005-T07 +status: todo +priority: high +state_hub_task_id: "9578f672-e2b8-43a3-8419-5f86f8871326" +``` + +Enable Forgejo packages for Railiance's near-term build and deployment needs. + +Initial package types: + +- Container registry for State Hub and future app images. +- Generic packages for release artifacts. +- Additional package types only after the inventory proves they are needed. + +Acceptance: + +- Authenticated push and pull works from operator workstation and railiance01. +- Container image pull works from k3s deployments. +- Retention and cleanup expectations are documented. +- Package data is included in backup and restore drills. + +**Done when:** `state-hub` or a probe image can be published to Forgejo and +pulled by railiance01. + +--- + +### T08 — Enable Forgejo Actions + +```task +id: RAIL-HO-WP-0005-T08 +status: todo +priority: high +state_hub_task_id: "f45f98c9-2f02-4224-bbfd-c2e1ec38581e" +``` + +Enable Forgejo Actions with a least-privilege runner model. + +Minimum scope: + +- Runner registration automated without committing runner tokens. +- Runner isolation model documented. +- Minimal workflows for lint/test/build on representative repositories. +- Workflow to build and publish a probe container image to Forgejo packages. +- Secret handling policy for Actions. +- Resource limits to avoid repeating previous single-node overload patterns. + +**Done when:** a representative repository can run Forgejo Actions and publish +a test artifact without privileged cluster-wide credentials. + +--- + +### T09 — Implement Forgejo backup and restore automation + +```task +id: RAIL-HO-WP-0005-T09 +status: todo +priority: high +state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d" +``` + +Create backup automation for all Forgejo state. + +Must cover: + +- PostgreSQL database. +- Git repositories. +- Attachments. +- LFS. +- Packages. +- Avatars and app data. +- Actions logs/artifacts if retained. +- App configuration required for restore. + +Acceptance: + +- Scheduled backups run without manual intervention. +- Backups are encrypted or stored in an approved protected target. +- Restore into an isolated namespace is drilled and documented. +- RPO/RTO expectations are recorded. + +**Done when:** a fresh backup restores to a working isolated Forgejo instance +with repository, package, and user recovery checks passing. + +--- + +### T10 — Drill Gitea to Forgejo migration + +```task +id: RAIL-HO-WP-0005-T10 +status: todo +priority: high +state_hub_task_id: "6befde73-00bc-4643-be0b-a7ce7944e75f" +``` + +Run a non-production migration drill from Gitea to Forgejo. + +Minimum checks: + +- Git history and default branches preserved. +- Issues, labels, milestones, releases, wiki, and attachments handled per + inventory classification. +- SSH/HTTPS clone and push paths work. +- Existing local remotes can be transformed predictably. +- State Hub registered repo remotes can be updated safely. +- Rollback plan is rehearsed. + +**Done when:** a sample migration has a written result matrix and no unknown +critical migration gaps remain. + +--- + +### T11 — Production cutover from Gitea to Forgejo + +```task +id: RAIL-HO-WP-0005-T11 +status: todo +priority: high +needs_human: true +state_hub_task_id: "b1b66687-ca33-4971-b312-743c8e059c5e" +``` + +Execute the production migration only after the probe, backup restore, package +registry, email recovery, and Actions gates pass. + +Cutover sequence: + +1. Announce freeze window. +2. Take final Gitea backup and verify it exists. +3. Freeze Gitea writes. +4. Migrate repositories and metadata to Forgejo. +5. Validate critical repositories and package pulls. +6. Update State Hub repo remotes and host paths as needed. +7. Update local and railiance01 remotes. +8. Keep Gitea read-only as rollback until the stabilization window passes. + +**Done when:** all Railiance/Custodian repos use Forgejo as primary, Gitea is +read-only fallback, and rollback instructions are documented. + +--- + +### T12 — Retire or archive legacy Gitea + +```task +id: RAIL-HO-WP-0005-T12 +status: todo +priority: medium +needs_human: true +state_hub_task_id: "a63147b0-31d5-4705-89ea-40c10faf779f" +``` + +Retire legacy Gitea only after a stabilization period and explicit approval. + +Minimum scope: + +- Confirm no active remotes, webhooks, packages, or dashboards depend on Gitea. +- Preserve final Gitea backup. +- Update runbooks and dashboards from Gitea to Forgejo. +- Remove or archive Gitea Helm release according to the rollback decision. +- Close stale State Hub references to `railiance-bootstrap` if confirmed as + an alias rather than a real repo. + +**Done when:** Forgejo is the only active source forge and package base, with +legacy Gitea either archived or intentionally retained as documented fallback. + +## Phasing and Dependencies + +``` +T01 inventory ─┬─► T02 decisions ─┬─► T03 probe ─┬─► T04 platform + │ │ ├─► T05 app + │ │ ├─► T06 mail recovery + │ │ ├─► T07 packages + │ │ ├─► T08 actions + │ │ └─► T09 backups + └────────────────────────────────────► T10 migration drill + +T03-T10 all pass ─► T11 production cutover ─► T12 legacy Gitea retirement +``` + +Recommended first slice: T01, T02, T03. Do not start T11 until T06, T07, T08, +T09, and T10 are complete. + +## railiance-bootstrap Note + +State Hub currently registers both `railiance-bootstrap` and +`railiance-cluster`, but they point to the same local path +(`/home/worsch/railiance-cluster`) and the same git fingerprint. The +`railiance-bootstrap` entry has no remote URL. The earlier restructure workplan +(`RAIL-HO-WP-0003-T03`) says `railiance-bootstrap` was renamed to +`railiance-cluster`. + +Working assumption: `railiance-bootstrap` is a stale logical alias or leftover +repo goal, not a separate Gitea repository. This workplan should not create a +new Forgejo repository named `railiance-bootstrap` unless a concrete remaining +purpose is identified. + +## References + +- `RAIL-HO-WP-0004-production-readiness.md` +- `RAIL-HO-WP-0003-5repo-stack-restructure.md` +- `CUST-WP-0014-repo-sync-automation.md` +- `CUST-WP-0021-multi-host-repo-paths.md` +- `ops/incidents/2026-03-25-gitea-pgpool-crashloop.md` +- `ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`