railiance-forge/docs/observability-operating-evidence.md

# Forge Observability And Operating Evidence

Last reviewed: 2026-06-13

Status: contract v1. This document defines checks, evidence, and future
monitoring expectations. It does not authorize a live monitoring deployment,
alert route, dashboard rollout, credential change, or forge cutover.

## Purpose

Forge availability affects source hosting, artifact publication, package
installation, and downstream app releases. Operators should be able to inspect
the current forge and produce release-readiness evidence without reconstructing
the system from historical workplans.

This contract defines:

- endpoint health checks for current Gitea and future Forgejo;
- log, dashboard, storage, and runner evidence expectations;
- manual thresholds until centralized monitoring exists;
- what S5 application releases can cite as forge readiness evidence;
- where future centralized observability should live.

## Signal Ownership

| Signal | Owner | Consumer |
| --- | --- | --- |
| Web and API endpoint health | `railiance-forge` defines checks; `railiance-cluster` provides ingress/DNS/TLS primitives | Operators, source repos, S5 release checks |
| Git SSH reachability | `railiance-forge` defines checks; `railiance-cluster` provides published Service/ingress path when present | Source repos, automation |
| Container registry health | `railiance-forge` defines `/v2/` checks and package evidence | S5 app image consumers |
| Python package registry health | `railiance-forge` defines PyPI endpoint checks and package evidence | Source repos and app build pipelines |
| Actions/runner health | `railiance-forge` owns runner substrate signals | S4 templates and source/app workflows |
| Gitea database status | `railiance-forge` checks consumer health; `railiance-platform` owns CNPG backup/restore mechanisms | Forge operators |
| Package/blob storage growth | `railiance-forge` tracks growth and thresholds; lower layers own durable storage/backup mechanisms | Forge operators and S5 release gates |
| Logs and audit trails | `railiance-forge` defines useful slices; platform/future observability owns durable aggregation | Operators and incident review |
| Release-readiness evidence | `railiance-forge` defines the evidence bundle | `railiance-apps` and source repos cite it |

## Read-Only Health Checks

Run `make gitea-status` first. It checks the Gitea pod, Service, Ingress, and
CNPG-backed `gitea-db` status when the operator has a kubeconfig pointed at the
Railiance cluster.

Additional checks should stay read-only:

```bash
# Web/API health: expect HTTP 200/3xx for the web route, not 404/5xx.
curl -fsSI https://gitea.coulomb.social/
curl -fsS https://gitea.coulomb.social/api/v1/version

# Container registry health: expect an OCI auth challenge, normally HTTP 401,
# with Docker-Distribution-Api-Version: registry/2.0.
curl -i https://gitea.coulomb.social/v2/

# Python package registry health: expect reachable endpoint behavior. Depending
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/
```

The raw node IP HTTP NodePort is intentionally not part of the public health
surface. Treat any reachable `http://<node-ip>:<gitea-nodeport>/` web route as
a regression to close, not as an alternate supported endpoint.

Git SSH:

- If a Git SSH endpoint is published, verify it with a read-only `git ls-remote`
  against a known non-secret repository or with an SSH banner check.
- If no SSH endpoint is intentionally exposed, record `not exposed` rather than
  silently skipping the signal.
- Do not paste private keys, tokens, or signed SSH command output containing
  secret material into evidence.

Actions and runners:

- Run `make runner-status` for the current read-only runner, public endpoint,
  and inter-hub registry probes. The target degrades when optional tools such as
  `skopeo` or `act_runner` are unavailable.
- Record runner inventory by semantic label, trust level, and last successful
  sample job.
- For privileged labels such as `package-publish`, `registry-publish`,
  `cluster-dry-run`, or `s5-release-check`, record a recent non-production
  sample job or release job reference.
- If no runner currently provides a required label, mark the dependent workflow
  as blocked on runner prerequisites instead of weakening the workflow.
- The current runner evidence log lives in
  `docs/gitea-actions-runner-evidence.md`.

## Storage Growth Checks

Current package blobs live under `/data/packages` on the
`default/gitea-shared-storage` PVC. The known baseline was about 798.5 MiB on
2026-05-19 against a 10 GiB `local-path` PVC.

Read-only inspection:

```bash
kubectl get pvc gitea-shared-storage -n default

pod="$(kubectl get pod -n default \
  -l app.kubernetes.io/instance=gitea \
  -o jsonpath='{.items[0].metadata.name}')"

kubectl exec -n default "$pod" -- du -sh /data/packages
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l
```

Manual thresholds until centralized metrics exist:

- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB
  since the previous recorded check.
- Action required: usage reaches 85%, package restore has not been drilled for
  production-critical artifacts, or smoke-test tags are accumulating without a
  cleanup owner.
- Block production reliance: usage reaches 90%, package/blob restore evidence is
  missing for a production-critical artifact, or registry pulls/install checks
  fail with 5xx/server errors.

Growth evidence should record date, operator, PVC size, package directory size,
largest known package family if inspected, and whether cleanup or backup
follow-up is required.

## Logs And Dashboards

Minimum manual log checks:

```bash
kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
kubectl get events -n default --sort-by=.lastTimestamp
```

What to look for:

- repeated 5xx errors;
- failed package uploads or downloads;
- registry authentication loops;
- database connection errors;
- PVC mount, quota, or disk-pressure warnings;
- runner registration failures or stuck jobs;
- TLS/cert renewal failures at the ingress boundary.

Dashboard expectations:

- Current state: manual checks and `make gitea-status` are the authoritative
  operator path.
- Next state: forge should publish signal definitions that a future dashboard
  can render without changing ownership boundaries.
- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database,
  storage, and recent publish evidence in one view.
- Dashboard absence is not a reason to skip evidence; keep recording manual
  evidence until a durable view exists.

## Release-Readiness Evidence

Before an S5 app release cites a forge artifact as ready, forge evidence should
include:

- date and operator or automation id;
- forge repo commit containing the active operating contract;
- Gitea/Forgejo version or endpoint version response;
- web/API check result;
- Git SSH result or explicit `not exposed`;
- container registry `/v2/` challenge result;
- Python package endpoint result;
- runner label and sample job result when automation produced the artifact;
- source repo, commit SHA, package/image identity, and version/tag/digest;
- package/blob storage usage check if the artifact is production-critical;
- backup/restore evidence reference if production reliance depends on the
  artifact being recoverable;
- log review result for the relevant window;
- known risks or missing signals.

S5 may cite this evidence from an app runbook or workplan. S5 should not repeat
forge-internal backup procedures, package tokens, runner tokens, or registry
write credentials.

## Alert And Intervention Rules

Until centralized alerting exists, record a State Hub note or human intervention
when any of these occur:

- Gitea web/API endpoint returns 5xx or is unreachable.
- `/v2/` no longer returns an OCI registry response.
- PyPI endpoint returns 5xx or package install checks fail for a published
  release package.
- `make gitea-status` shows unavailable pods, missing Service/Ingress, or an
  unhealthy `gitea-db`.
- Package/blob usage crosses the warning or action thresholds above.
- Runner labels required by S4 templates or S5 checks disappear.
- A privileged runner label runs without a recorded trust/credential purpose.
- Logs show repeated database, storage, registry, or runner failures.
- Restore evidence is missing for a production-critical package/image/source
  dependency.

Use `needs_human=true` on the relevant State Hub task when the intervention
requires secret custody, credential minting, production restore decisions, or
live infrastructure changes.

## Future Centralized Observability

The stable split should be:

- `railiance-forge` owns forge signal definitions, evidence requirements, and
  runbook interpretation.
- `railiance-platform` should own shared metrics/log storage, durable retention,
  and platform service dashboards if observability remains an S3 capability.
- `railiance-enablement` may own reusable dashboard templates, workflow evidence
  templates, and developer-facing self-service views.
- A future dedicated observability repo may own cross-domain dashboards, alert
  routing, and log pipelines if Railiance chooses to separate that scope.

Moving collection, dashboards, or alert routing out of this repo must not move
the meaning of forge signals. Forge remains the source of truth for what counts
as healthy source hosting, registry service, package service, runner substrate,
and release artifact evidence.