8.9 KiB
Forge Observability And Operating Evidence
Last reviewed: 2026-06-05
Status: contract v1. This document defines checks, evidence, and future monitoring expectations. It does not authorize a live monitoring deployment, alert route, dashboard rollout, credential change, or forge cutover.
Purpose
Forge availability affects source hosting, artifact publication, package installation, and downstream app releases. Operators should be able to inspect the current forge and produce release-readiness evidence without reconstructing the system from historical workplans.
This contract defines:
- endpoint health checks for current Gitea and future Forgejo;
- log, dashboard, storage, and runner evidence expectations;
- manual thresholds until centralized monitoring exists;
- what S5 application releases can cite as forge readiness evidence;
- where future centralized observability should live.
Signal Ownership
| Signal | Owner | Consumer |
|---|---|---|
| Web and API endpoint health | railiance-forge defines checks; railiance-cluster provides ingress/DNS/TLS primitives |
Operators, source repos, S5 release checks |
| Git SSH reachability | railiance-forge defines checks; railiance-cluster provides published Service/ingress path when present |
Source repos, automation |
| Container registry health | railiance-forge defines /v2/ checks and package evidence |
S5 app image consumers |
| Python package registry health | railiance-forge defines PyPI endpoint checks and package evidence |
Source repos and app build pipelines |
| Actions/runner health | railiance-forge owns runner substrate signals |
S4 templates and source/app workflows |
| Gitea database status | railiance-forge checks consumer health; railiance-platform owns CNPG backup/restore mechanisms |
Forge operators |
| Package/blob storage growth | railiance-forge tracks growth and thresholds; lower layers own durable storage/backup mechanisms |
Forge operators and S5 release gates |
| Logs and audit trails | railiance-forge defines useful slices; platform/future observability owns durable aggregation |
Operators and incident review |
| Release-readiness evidence | railiance-forge defines the evidence bundle |
railiance-apps and source repos cite it |
Read-Only Health Checks
Run make gitea-status first. It checks the Gitea pod, Service, Ingress, and
CNPG-backed gitea-db status when the operator has a kubeconfig pointed at the
Railiance cluster.
Additional checks should stay read-only:
# Web/API health: expect HTTP 200/3xx for the web route, not 5xx.
curl -fsSI https://gitea.coulomb.social/
curl -fsS https://gitea.coulomb.social/api/v1/version
# Container registry health: expect an OCI auth challenge, normally HTTP 401,
# with Docker-Distribution-Api-Version: registry/2.0.
curl -i https://gitea.coulomb.social/v2/
# Python package registry health: expect reachable endpoint behavior. Depending
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/
Git SSH:
- If a Git SSH endpoint is published, verify it with a read-only
git ls-remoteagainst a known non-secret repository or with an SSH banner check. - If no SSH endpoint is intentionally exposed, record
not exposedrather than silently skipping the signal. - Do not paste private keys, tokens, or signed SSH command output containing secret material into evidence.
Actions and runners:
- Record runner inventory by semantic label, trust level, and last successful sample job.
- For privileged labels such as
package-publish,registry-publish,cluster-dry-run, ors5-release-check, record a recent non-production sample job or release job reference. - If no runner currently provides a required label, mark the dependent workflow as blocked on runner prerequisites instead of weakening the workflow.
Storage Growth Checks
Current package blobs live under /data/packages on the
default/gitea-shared-storage PVC. The known baseline was about 798.5 MiB on
2026-05-19 against a 10 GiB local-path PVC.
Read-only inspection:
kubectl get pvc gitea-shared-storage -n default
pod="$(kubectl get pod -n default \
-l app.kubernetes.io/instance=gitea \
-o jsonpath='{.items[0].metadata.name}')"
kubectl exec -n default "$pod" -- du -sh /data/packages
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l
Manual thresholds until centralized metrics exist:
- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB since the previous recorded check.
- Action required: usage reaches 85%, package restore has not been drilled for production-critical artifacts, or smoke-test tags are accumulating without a cleanup owner.
- Block production reliance: usage reaches 90%, package/blob restore evidence is missing for a production-critical artifact, or registry pulls/install checks fail with 5xx/server errors.
Growth evidence should record date, operator, PVC size, package directory size, largest known package family if inspected, and whether cleanup or backup follow-up is required.
Logs And Dashboards
Minimum manual log checks:
kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
kubectl get events -n default --sort-by=.lastTimestamp
What to look for:
- repeated 5xx errors;
- failed package uploads or downloads;
- registry authentication loops;
- database connection errors;
- PVC mount, quota, or disk-pressure warnings;
- runner registration failures or stuck jobs;
- TLS/cert renewal failures at the ingress boundary.
Dashboard expectations:
- Current state: manual checks and
make gitea-statusare the authoritative operator path. - Next state: forge should publish signal definitions that a future dashboard can render without changing ownership boundaries.
- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database, storage, and recent publish evidence in one view.
- Dashboard absence is not a reason to skip evidence; keep recording manual evidence until a durable view exists.
Release-Readiness Evidence
Before an S5 app release cites a forge artifact as ready, forge evidence should include:
- date and operator or automation id;
- forge repo commit containing the active operating contract;
- Gitea/Forgejo version or endpoint version response;
- web/API check result;
- Git SSH result or explicit
not exposed; - container registry
/v2/challenge result; - Python package endpoint result;
- runner label and sample job result when automation produced the artifact;
- source repo, commit SHA, package/image identity, and version/tag/digest;
- package/blob storage usage check if the artifact is production-critical;
- backup/restore evidence reference if production reliance depends on the artifact being recoverable;
- log review result for the relevant window;
- known risks or missing signals.
S5 may cite this evidence from an app runbook or workplan. S5 should not repeat forge-internal backup procedures, package tokens, runner tokens, or registry write credentials.
Alert And Intervention Rules
Until centralized alerting exists, record a State Hub note or human intervention when any of these occur:
- Gitea web/API endpoint returns 5xx or is unreachable.
/v2/no longer returns an OCI registry response.- PyPI endpoint returns 5xx or package install checks fail for a published release package.
make gitea-statusshows unavailable pods, missing Service/Ingress, or an unhealthygitea-db.- Package/blob usage crosses the warning or action thresholds above.
- Runner labels required by S4 templates or S5 checks disappear.
- A privileged runner label runs without a recorded trust/credential purpose.
- Logs show repeated database, storage, registry, or runner failures.
- Restore evidence is missing for a production-critical package/image/source dependency.
Use needs_human=true on the relevant State Hub task when the intervention
requires secret custody, credential minting, production restore decisions, or
live infrastructure changes.
Future Centralized Observability
The stable split should be:
railiance-forgeowns forge signal definitions, evidence requirements, and runbook interpretation.railiance-platformshould own shared metrics/log storage, durable retention, and platform service dashboards if observability remains an S3 capability.railiance-enablementmay own reusable dashboard templates, workflow evidence templates, and developer-facing self-service views.- A future dedicated observability repo may own cross-domain dashboards, alert routing, and log pipelines if Railiance chooses to separate that scope.
Moving collection, dashboards, or alert routing out of this repo must not move the meaning of forge signals. Forge remains the source of truth for what counts as healthy source hosting, registry service, package service, runner substrate, and release artifact evidence.