diff --git a/Makefile b/Makefile index 93c1d79..c0a7975 100644 --- a/Makefile +++ b/Makefile @@ -10,6 +10,7 @@ GITEA_INGRESS ?= manifests/gitea-ingress.yaml GITEA_DB_CLUSTER ?= gitea-db GITEA_DB_NAMESPACE ?= databases REGISTRY_DOCS ?= docs/gitea-container-registry.md docs/gitea-package-registry.md +EVIDENCE_DOCS ?= docs/observability-operating-evidence.md docs/ci-runner-actions-gitops-ownership.md docs/backup-restore-secret-handoff.md SOPS_SENTINEL ?= $(GITEA_VALUES) ##@ Operator checks @@ -40,6 +41,12 @@ registry-docs: ## Print canonical registry docs sed -n '1,220p' "$$doc"; \ done +evidence-docs: ## Print forge evidence and handoff contracts + @for doc in $(EVIDENCE_DOCS); do \ + printf '\n## %s\n\n' "$$doc"; \ + sed -n '1,260p' "$$doc"; \ + done + ##@ Current Gitea gitea-deploy: ## Deploy / upgrade current Gitea forge runtime @@ -70,4 +77,4 @@ help: ## Show this help /^[a-zA-Z0-9_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \ /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST) -.PHONY: check-tools check-sops registry-docs gitea-deploy gitea-ingress-deploy gitea-status help +.PHONY: check-tools check-sops registry-docs evidence-docs gitea-deploy gitea-ingress-deploy gitea-status help diff --git a/README.md b/README.md index 960e51c..7d9ebb1 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ Key contracts: - `docs/initial-operating-contracts.md` - `docs/ci-runner-actions-gitops-ownership.md` - `docs/backup-restore-secret-handoff.md` +- `docs/observability-operating-evidence.md` - `docs/gitea-container-registry.md` - `docs/gitea-package-registry.md` @@ -31,6 +32,7 @@ Useful entry points: ```bash make registry-docs +make evidence-docs make check-tools make gitea-status make gitea-deploy diff --git a/SCOPE.md b/SCOPE.md index bfff64c..6897fc5 100644 --- a/SCOPE.md +++ b/SCOPE.md @@ -36,6 +36,8 @@ The runner, Actions, and GitOps ownership contract lives in `docs/ci-runner-actions-gitops-ownership.md`. The backup, restore, and secret custody handoff contract lives in `docs/backup-restore-secret-handoff.md`. +The observability and operating evidence contract lives in +`docs/observability-operating-evidence.md`. --- @@ -182,7 +184,9 @@ Known starting point: `docs/ci-runner-actions-gitops-ownership.md`. 7. For backup, restore, and secret custody handoffs, read `docs/backup-restore-secret-handoff.md`. -8. For migration context, read +8. For observability and release-readiness evidence, read + `docs/observability-operating-evidence.md`. +9. For migration context, read `/home/worsch/railiance-apps/workplans/RAILIANCE-WP-0006-railiance-forge-extraction.md`. --- diff --git a/docs/backup-restore-secret-handoff.md b/docs/backup-restore-secret-handoff.md index 2573da8..2ea7e02 100644 --- a/docs/backup-restore-secret-handoff.md +++ b/docs/backup-restore-secret-handoff.md @@ -202,8 +202,8 @@ follow-up against the owning layer before relying on the artifact. ## Follow-Ups -- WP-0006-T08 should turn backup, restore, storage growth, and runner status - evidence into inspectable operating signals. +- `docs/observability-operating-evidence.md` defines the inspectable storage + growth, restore-evidence, and runner-status signals for this contract. - WP-0006-T09 should model forge backup/restore and secret-delivery edges in Railiance Fabric. - `RAILIANCE-WP-0005-T04` should use this contract when documenting S5 app data diff --git a/docs/ci-runner-actions-gitops-ownership.md b/docs/ci-runner-actions-gitops-ownership.md index f32c0b3..b25f229 100644 --- a/docs/ci-runner-actions-gitops-ownership.md +++ b/docs/ci-runner-actions-gitops-ownership.md @@ -163,8 +163,8 @@ should provide: ## Open Follow-Ups -- WP-0006-T08 should turn runner health and artifact evidence into explicit - observability requirements. +- `docs/observability-operating-evidence.md` defines the runner health and + artifact evidence signals that consumers may cite. - WP-0006-T09 should declare runner substrate, label contracts, and evidence edges in Railiance Fabric. - `RAILIANCE-WP-0005-T05` should document app-side dry-run behavior once forge diff --git a/docs/initial-operating-contracts.md b/docs/initial-operating-contracts.md index 8202ccd..e819699 100644 --- a/docs/initial-operating-contracts.md +++ b/docs/initial-operating-contracts.md @@ -93,3 +93,5 @@ leaving live deploy and secret custody changes behind separate review gates. than repeating registry implementation details. - Future monitoring should turn the manual status checks into durable signals once the Railiance observability layer is ready. +- The detailed observability and operating evidence contract lives in + `docs/observability-operating-evidence.md`. diff --git a/docs/observability-operating-evidence.md b/docs/observability-operating-evidence.md new file mode 100644 index 0000000..0b59852 --- /dev/null +++ b/docs/observability-operating-evidence.md @@ -0,0 +1,206 @@ +# Forge Observability And Operating Evidence + +Last reviewed: 2026-06-05 + +Status: contract v1. This document defines checks, evidence, and future +monitoring expectations. It does not authorize a live monitoring deployment, +alert route, dashboard rollout, credential change, or forge cutover. + +## Purpose + +Forge availability affects source hosting, artifact publication, package +installation, and downstream app releases. Operators should be able to inspect +the current forge and produce release-readiness evidence without reconstructing +the system from historical workplans. + +This contract defines: + +- endpoint health checks for current Gitea and future Forgejo; +- log, dashboard, storage, and runner evidence expectations; +- manual thresholds until centralized monitoring exists; +- what S5 application releases can cite as forge readiness evidence; +- where future centralized observability should live. + +## Signal Ownership + +| Signal | Owner | Consumer | +| --- | --- | --- | +| Web and API endpoint health | `railiance-forge` defines checks; `railiance-cluster` provides ingress/DNS/TLS primitives | Operators, source repos, S5 release checks | +| Git SSH reachability | `railiance-forge` defines checks; `railiance-cluster` provides published Service/ingress path when present | Source repos, automation | +| Container registry health | `railiance-forge` defines `/v2/` checks and package evidence | S5 app image consumers | +| Python package registry health | `railiance-forge` defines PyPI endpoint checks and package evidence | Source repos and app build pipelines | +| Actions/runner health | `railiance-forge` owns runner substrate signals | S4 templates and source/app workflows | +| Gitea database status | `railiance-forge` checks consumer health; `railiance-platform` owns CNPG backup/restore mechanisms | Forge operators | +| Package/blob storage growth | `railiance-forge` tracks growth and thresholds; lower layers own durable storage/backup mechanisms | Forge operators and S5 release gates | +| Logs and audit trails | `railiance-forge` defines useful slices; platform/future observability owns durable aggregation | Operators and incident review | +| Release-readiness evidence | `railiance-forge` defines the evidence bundle | `railiance-apps` and source repos cite it | + +## Read-Only Health Checks + +Run `make gitea-status` first. It checks the Gitea pod, Service, Ingress, and +CNPG-backed `gitea-db` status when the operator has a kubeconfig pointed at the +Railiance cluster. + +Additional checks should stay read-only: + +```bash +# Web/API health: expect HTTP 200/3xx for the web route, not 5xx. +curl -fsSI https://gitea.coulomb.social/ +curl -fsS https://gitea.coulomb.social/api/v1/version + +# Container registry health: expect an OCI auth challenge, normally HTTP 401, +# with Docker-Distribution-Api-Version: registry/2.0. +curl -i https://gitea.coulomb.social/v2/ + +# Python package registry health: expect reachable endpoint behavior. Depending +# on package visibility this may be 200, 401, or 404; 5xx is not acceptable. +curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/ +``` + +Git SSH: + +- If a Git SSH endpoint is published, verify it with a read-only `git ls-remote` + against a known non-secret repository or with an SSH banner check. +- If no SSH endpoint is intentionally exposed, record `not exposed` rather than + silently skipping the signal. +- Do not paste private keys, tokens, or signed SSH command output containing + secret material into evidence. + +Actions and runners: + +- Record runner inventory by semantic label, trust level, and last successful + sample job. +- For privileged labels such as `package-publish`, `registry-publish`, + `cluster-dry-run`, or `s5-release-check`, record a recent non-production + sample job or release job reference. +- If no runner currently provides a required label, mark the dependent workflow + as blocked on runner prerequisites instead of weakening the workflow. + +## Storage Growth Checks + +Current package blobs live under `/data/packages` on the +`default/gitea-shared-storage` PVC. The known baseline was about 798.5 MiB on +2026-05-19 against a 10 GiB `local-path` PVC. + +Read-only inspection: + +```bash +kubectl get pvc gitea-shared-storage -n default + +pod="$(kubectl get pod -n default \ + -l app.kubernetes.io/instance=gitea \ + -o jsonpath='{.items[0].metadata.name}')" + +kubectl exec -n default "$pod" -- du -sh /data/packages +kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l +``` + +Manual thresholds until centralized metrics exist: + +- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB + since the previous recorded check. +- Action required: usage reaches 85%, package restore has not been drilled for + production-critical artifacts, or smoke-test tags are accumulating without a + cleanup owner. +- Block production reliance: usage reaches 90%, package/blob restore evidence is + missing for a production-critical artifact, or registry pulls/install checks + fail with 5xx/server errors. + +Growth evidence should record date, operator, PVC size, package directory size, +largest known package family if inspected, and whether cleanup or backup +follow-up is required. + +## Logs And Dashboards + +Minimum manual log checks: + +```bash +kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200 +kubectl describe pod -n default -l app.kubernetes.io/instance=gitea +kubectl get events -n default --sort-by=.lastTimestamp +``` + +What to look for: + +- repeated 5xx errors; +- failed package uploads or downloads; +- registry authentication loops; +- database connection errors; +- PVC mount, quota, or disk-pressure warnings; +- runner registration failures or stuck jobs; +- TLS/cert renewal failures at the ingress boundary. + +Dashboard expectations: + +- Current state: manual checks and `make gitea-status` are the authoritative + operator path. +- Next state: forge should publish signal definitions that a future dashboard + can render without changing ownership boundaries. +- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database, + storage, and recent publish evidence in one view. +- Dashboard absence is not a reason to skip evidence; keep recording manual + evidence until a durable view exists. + +## Release-Readiness Evidence + +Before an S5 app release cites a forge artifact as ready, forge evidence should +include: + +- date and operator or automation id; +- forge repo commit containing the active operating contract; +- Gitea/Forgejo version or endpoint version response; +- web/API check result; +- Git SSH result or explicit `not exposed`; +- container registry `/v2/` challenge result; +- Python package endpoint result; +- runner label and sample job result when automation produced the artifact; +- source repo, commit SHA, package/image identity, and version/tag/digest; +- package/blob storage usage check if the artifact is production-critical; +- backup/restore evidence reference if production reliance depends on the + artifact being recoverable; +- log review result for the relevant window; +- known risks or missing signals. + +S5 may cite this evidence from an app runbook or workplan. S5 should not repeat +forge-internal backup procedures, package tokens, runner tokens, or registry +write credentials. + +## Alert And Intervention Rules + +Until centralized alerting exists, record a State Hub note or human intervention +when any of these occur: + +- Gitea web/API endpoint returns 5xx or is unreachable. +- `/v2/` no longer returns an OCI registry response. +- PyPI endpoint returns 5xx or package install checks fail for a published + release package. +- `make gitea-status` shows unavailable pods, missing Service/Ingress, or an + unhealthy `gitea-db`. +- Package/blob usage crosses the warning or action thresholds above. +- Runner labels required by S4 templates or S5 checks disappear. +- A privileged runner label runs without a recorded trust/credential purpose. +- Logs show repeated database, storage, registry, or runner failures. +- Restore evidence is missing for a production-critical package/image/source + dependency. + +Use `needs_human=true` on the relevant State Hub task when the intervention +requires secret custody, credential minting, production restore decisions, or +live infrastructure changes. + +## Future Centralized Observability + +The stable split should be: + +- `railiance-forge` owns forge signal definitions, evidence requirements, and + runbook interpretation. +- `railiance-platform` should own shared metrics/log storage, durable retention, + and platform service dashboards if observability remains an S3 capability. +- `railiance-enablement` may own reusable dashboard templates, workflow evidence + templates, and developer-facing self-service views. +- A future dedicated observability repo may own cross-domain dashboards, alert + routing, and log pipelines if Railiance chooses to separate that scope. + +Moving collection, dashboards, or alert routing out of this repo must not move +the meaning of forge signals. Forge remains the source of truth for what counts +as healthy source hosting, registry service, package service, runner substrate, +and release artifact evidence.