Files
railiance-forge/docs/observability-operating-evidence.md

8.9 KiB

Forge Observability And Operating Evidence

Last reviewed: 2026-06-05

Status: contract v1. This document defines checks, evidence, and future monitoring expectations. It does not authorize a live monitoring deployment, alert route, dashboard rollout, credential change, or forge cutover.

Purpose

Forge availability affects source hosting, artifact publication, package installation, and downstream app releases. Operators should be able to inspect the current forge and produce release-readiness evidence without reconstructing the system from historical workplans.

This contract defines:

  • endpoint health checks for current Gitea and future Forgejo;
  • log, dashboard, storage, and runner evidence expectations;
  • manual thresholds until centralized monitoring exists;
  • what S5 application releases can cite as forge readiness evidence;
  • where future centralized observability should live.

Signal Ownership

Signal Owner Consumer
Web and API endpoint health railiance-forge defines checks; railiance-cluster provides ingress/DNS/TLS primitives Operators, source repos, S5 release checks
Git SSH reachability railiance-forge defines checks; railiance-cluster provides published Service/ingress path when present Source repos, automation
Container registry health railiance-forge defines /v2/ checks and package evidence S5 app image consumers
Python package registry health railiance-forge defines PyPI endpoint checks and package evidence Source repos and app build pipelines
Actions/runner health railiance-forge owns runner substrate signals S4 templates and source/app workflows
Gitea database status railiance-forge checks consumer health; railiance-platform owns CNPG backup/restore mechanisms Forge operators
Package/blob storage growth railiance-forge tracks growth and thresholds; lower layers own durable storage/backup mechanisms Forge operators and S5 release gates
Logs and audit trails railiance-forge defines useful slices; platform/future observability owns durable aggregation Operators and incident review
Release-readiness evidence railiance-forge defines the evidence bundle railiance-apps and source repos cite it

Read-Only Health Checks

Run make gitea-status first. It checks the Gitea pod, Service, Ingress, and CNPG-backed gitea-db status when the operator has a kubeconfig pointed at the Railiance cluster.

Additional checks should stay read-only:

# Web/API health: expect HTTP 200/3xx for the web route, not 5xx.
curl -fsSI https://gitea.coulomb.social/
curl -fsS https://gitea.coulomb.social/api/v1/version

# Container registry health: expect an OCI auth challenge, normally HTTP 401,
# with Docker-Distribution-Api-Version: registry/2.0.
curl -i https://gitea.coulomb.social/v2/

# Python package registry health: expect reachable endpoint behavior. Depending
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/

Git SSH:

  • If a Git SSH endpoint is published, verify it with a read-only git ls-remote against a known non-secret repository or with an SSH banner check.
  • If no SSH endpoint is intentionally exposed, record not exposed rather than silently skipping the signal.
  • Do not paste private keys, tokens, or signed SSH command output containing secret material into evidence.

Actions and runners:

  • Record runner inventory by semantic label, trust level, and last successful sample job.
  • For privileged labels such as package-publish, registry-publish, cluster-dry-run, or s5-release-check, record a recent non-production sample job or release job reference.
  • If no runner currently provides a required label, mark the dependent workflow as blocked on runner prerequisites instead of weakening the workflow.

Storage Growth Checks

Current package blobs live under /data/packages on the default/gitea-shared-storage PVC. The known baseline was about 798.5 MiB on 2026-05-19 against a 10 GiB local-path PVC.

Read-only inspection:

kubectl get pvc gitea-shared-storage -n default

pod="$(kubectl get pod -n default \
  -l app.kubernetes.io/instance=gitea \
  -o jsonpath='{.items[0].metadata.name}')"

kubectl exec -n default "$pod" -- du -sh /data/packages
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l

Manual thresholds until centralized metrics exist:

  • Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB since the previous recorded check.
  • Action required: usage reaches 85%, package restore has not been drilled for production-critical artifacts, or smoke-test tags are accumulating without a cleanup owner.
  • Block production reliance: usage reaches 90%, package/blob restore evidence is missing for a production-critical artifact, or registry pulls/install checks fail with 5xx/server errors.

Growth evidence should record date, operator, PVC size, package directory size, largest known package family if inspected, and whether cleanup or backup follow-up is required.

Logs And Dashboards

Minimum manual log checks:

kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
kubectl get events -n default --sort-by=.lastTimestamp

What to look for:

  • repeated 5xx errors;
  • failed package uploads or downloads;
  • registry authentication loops;
  • database connection errors;
  • PVC mount, quota, or disk-pressure warnings;
  • runner registration failures or stuck jobs;
  • TLS/cert renewal failures at the ingress boundary.

Dashboard expectations:

  • Current state: manual checks and make gitea-status are the authoritative operator path.
  • Next state: forge should publish signal definitions that a future dashboard can render without changing ownership boundaries.
  • A useful dashboard should show web/API, SSH, registry, PyPI, runner, database, storage, and recent publish evidence in one view.
  • Dashboard absence is not a reason to skip evidence; keep recording manual evidence until a durable view exists.

Release-Readiness Evidence

Before an S5 app release cites a forge artifact as ready, forge evidence should include:

  • date and operator or automation id;
  • forge repo commit containing the active operating contract;
  • Gitea/Forgejo version or endpoint version response;
  • web/API check result;
  • Git SSH result or explicit not exposed;
  • container registry /v2/ challenge result;
  • Python package endpoint result;
  • runner label and sample job result when automation produced the artifact;
  • source repo, commit SHA, package/image identity, and version/tag/digest;
  • package/blob storage usage check if the artifact is production-critical;
  • backup/restore evidence reference if production reliance depends on the artifact being recoverable;
  • log review result for the relevant window;
  • known risks or missing signals.

S5 may cite this evidence from an app runbook or workplan. S5 should not repeat forge-internal backup procedures, package tokens, runner tokens, or registry write credentials.

Alert And Intervention Rules

Until centralized alerting exists, record a State Hub note or human intervention when any of these occur:

  • Gitea web/API endpoint returns 5xx or is unreachable.
  • /v2/ no longer returns an OCI registry response.
  • PyPI endpoint returns 5xx or package install checks fail for a published release package.
  • make gitea-status shows unavailable pods, missing Service/Ingress, or an unhealthy gitea-db.
  • Package/blob usage crosses the warning or action thresholds above.
  • Runner labels required by S4 templates or S5 checks disappear.
  • A privileged runner label runs without a recorded trust/credential purpose.
  • Logs show repeated database, storage, registry, or runner failures.
  • Restore evidence is missing for a production-critical package/image/source dependency.

Use needs_human=true on the relevant State Hub task when the intervention requires secret custody, credential minting, production restore decisions, or live infrastructure changes.

Future Centralized Observability

The stable split should be:

  • railiance-forge owns forge signal definitions, evidence requirements, and runbook interpretation.
  • railiance-platform should own shared metrics/log storage, durable retention, and platform service dashboards if observability remains an S3 capability.
  • railiance-enablement may own reusable dashboard templates, workflow evidence templates, and developer-facing self-service views.
  • A future dedicated observability repo may own cross-domain dashboards, alert routing, and log pipelines if Railiance chooses to separate that scope.

Moving collection, dashboards, or alert routing out of this repo must not move the meaning of forge signals. Forge remains the source of truth for what counts as healthy source hosting, registry service, package service, runner substrate, and release artifact evidence.