generated from coulomb/repo-seed
All checks were successful
Forge Runner Smoke / compatibility-smoke (push) Successful in 0s
216 lines
9.4 KiB
Markdown
216 lines
9.4 KiB
Markdown
# Forge Observability And Operating Evidence
|
|
|
|
Last reviewed: 2026-06-13
|
|
|
|
Status: contract v1. This document defines checks, evidence, and future
|
|
monitoring expectations. It does not authorize a live monitoring deployment,
|
|
alert route, dashboard rollout, credential change, or forge cutover.
|
|
|
|
## Purpose
|
|
|
|
Forge availability affects source hosting, artifact publication, package
|
|
installation, and downstream app releases. Operators should be able to inspect
|
|
the current forge and produce release-readiness evidence without reconstructing
|
|
the system from historical workplans.
|
|
|
|
This contract defines:
|
|
|
|
- endpoint health checks for current Gitea and future Forgejo;
|
|
- log, dashboard, storage, and runner evidence expectations;
|
|
- manual thresholds until centralized monitoring exists;
|
|
- what S5 application releases can cite as forge readiness evidence;
|
|
- where future centralized observability should live.
|
|
|
|
## Signal Ownership
|
|
|
|
| Signal | Owner | Consumer |
|
|
| --- | --- | --- |
|
|
| Web and API endpoint health | `railiance-forge` defines checks; `railiance-cluster` provides ingress/DNS/TLS primitives | Operators, source repos, S5 release checks |
|
|
| Git SSH reachability | `railiance-forge` defines checks; `railiance-cluster` provides published Service/ingress path when present | Source repos, automation |
|
|
| Container registry health | `railiance-forge` defines `/v2/` checks and package evidence | S5 app image consumers |
|
|
| Python package registry health | `railiance-forge` defines PyPI endpoint checks and package evidence | Source repos and app build pipelines |
|
|
| Actions/runner health | `railiance-forge` owns runner substrate signals | S4 templates and source/app workflows |
|
|
| Gitea database status | `railiance-forge` checks consumer health; `railiance-platform` owns CNPG backup/restore mechanisms | Forge operators |
|
|
| Package/blob storage growth | `railiance-forge` tracks growth and thresholds; lower layers own durable storage/backup mechanisms | Forge operators and S5 release gates |
|
|
| Logs and audit trails | `railiance-forge` defines useful slices; platform/future observability owns durable aggregation | Operators and incident review |
|
|
| Release-readiness evidence | `railiance-forge` defines the evidence bundle | `railiance-apps` and source repos cite it |
|
|
|
|
## Read-Only Health Checks
|
|
|
|
Run `make gitea-status` first. It checks the Gitea pod, Service, Ingress, and
|
|
CNPG-backed `gitea-db` status when the operator has a kubeconfig pointed at the
|
|
Railiance cluster.
|
|
|
|
Additional checks should stay read-only:
|
|
|
|
```bash
|
|
# Web/API health: expect HTTP 200/3xx for the web route, not 404/5xx.
|
|
curl -fsSI https://gitea.coulomb.social/
|
|
curl -fsS https://gitea.coulomb.social/api/v1/version
|
|
|
|
# Container registry health: expect an OCI auth challenge, normally HTTP 401,
|
|
# with Docker-Distribution-Api-Version: registry/2.0.
|
|
curl -i https://gitea.coulomb.social/v2/
|
|
|
|
# Python package registry health: expect reachable endpoint behavior. Depending
|
|
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
|
|
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/
|
|
```
|
|
|
|
The raw node IP HTTP NodePort is intentionally not part of the public health
|
|
surface. Treat any reachable `http://<node-ip>:<gitea-nodeport>/` web route as
|
|
a regression to close, not as an alternate supported endpoint.
|
|
|
|
Git SSH:
|
|
|
|
- If a Git SSH endpoint is published, verify it with a read-only `git ls-remote`
|
|
against a known non-secret repository or with an SSH banner check.
|
|
- If no SSH endpoint is intentionally exposed, record `not exposed` rather than
|
|
silently skipping the signal.
|
|
- Do not paste private keys, tokens, or signed SSH command output containing
|
|
secret material into evidence.
|
|
|
|
Actions and runners:
|
|
|
|
- Run `make runner-status` for the current read-only runner, public endpoint,
|
|
and inter-hub registry probes. The target degrades when optional tools such as
|
|
`skopeo` or `act_runner` are unavailable.
|
|
- Record runner inventory by semantic label, trust level, and last successful
|
|
sample job.
|
|
- For privileged labels such as `package-publish`, `registry-publish`,
|
|
`cluster-dry-run`, or `s5-release-check`, record a recent non-production
|
|
sample job or release job reference.
|
|
- If no runner currently provides a required label, mark the dependent workflow
|
|
as blocked on runner prerequisites instead of weakening the workflow.
|
|
- The current runner evidence log lives in
|
|
`docs/gitea-actions-runner-evidence.md`.
|
|
|
|
## Storage Growth Checks
|
|
|
|
Current package blobs live under `/data/packages` on the
|
|
`default/gitea-shared-storage` PVC. The known baseline was about 798.5 MiB on
|
|
2026-05-19 against a 10 GiB `local-path` PVC.
|
|
|
|
Read-only inspection:
|
|
|
|
```bash
|
|
kubectl get pvc gitea-shared-storage -n default
|
|
|
|
pod="$(kubectl get pod -n default \
|
|
-l app.kubernetes.io/instance=gitea \
|
|
-o jsonpath='{.items[0].metadata.name}')"
|
|
|
|
kubectl exec -n default "$pod" -- du -sh /data/packages
|
|
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l
|
|
```
|
|
|
|
Manual thresholds until centralized metrics exist:
|
|
|
|
- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB
|
|
since the previous recorded check.
|
|
- Action required: usage reaches 85%, package restore has not been drilled for
|
|
production-critical artifacts, or smoke-test tags are accumulating without a
|
|
cleanup owner.
|
|
- Block production reliance: usage reaches 90%, package/blob restore evidence is
|
|
missing for a production-critical artifact, or registry pulls/install checks
|
|
fail with 5xx/server errors.
|
|
|
|
Growth evidence should record date, operator, PVC size, package directory size,
|
|
largest known package family if inspected, and whether cleanup or backup
|
|
follow-up is required.
|
|
|
|
## Logs And Dashboards
|
|
|
|
Minimum manual log checks:
|
|
|
|
```bash
|
|
kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
|
|
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
|
|
kubectl get events -n default --sort-by=.lastTimestamp
|
|
```
|
|
|
|
What to look for:
|
|
|
|
- repeated 5xx errors;
|
|
- failed package uploads or downloads;
|
|
- registry authentication loops;
|
|
- database connection errors;
|
|
- PVC mount, quota, or disk-pressure warnings;
|
|
- runner registration failures or stuck jobs;
|
|
- TLS/cert renewal failures at the ingress boundary.
|
|
|
|
Dashboard expectations:
|
|
|
|
- Current state: manual checks and `make gitea-status` are the authoritative
|
|
operator path.
|
|
- Next state: forge should publish signal definitions that a future dashboard
|
|
can render without changing ownership boundaries.
|
|
- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database,
|
|
storage, and recent publish evidence in one view.
|
|
- Dashboard absence is not a reason to skip evidence; keep recording manual
|
|
evidence until a durable view exists.
|
|
|
|
## Release-Readiness Evidence
|
|
|
|
Before an S5 app release cites a forge artifact as ready, forge evidence should
|
|
include:
|
|
|
|
- date and operator or automation id;
|
|
- forge repo commit containing the active operating contract;
|
|
- Gitea/Forgejo version or endpoint version response;
|
|
- web/API check result;
|
|
- Git SSH result or explicit `not exposed`;
|
|
- container registry `/v2/` challenge result;
|
|
- Python package endpoint result;
|
|
- runner label and sample job result when automation produced the artifact;
|
|
- source repo, commit SHA, package/image identity, and version/tag/digest;
|
|
- package/blob storage usage check if the artifact is production-critical;
|
|
- backup/restore evidence reference if production reliance depends on the
|
|
artifact being recoverable;
|
|
- log review result for the relevant window;
|
|
- known risks or missing signals.
|
|
|
|
S5 may cite this evidence from an app runbook or workplan. S5 should not repeat
|
|
forge-internal backup procedures, package tokens, runner tokens, or registry
|
|
write credentials.
|
|
|
|
## Alert And Intervention Rules
|
|
|
|
Until centralized alerting exists, record a State Hub note or human intervention
|
|
when any of these occur:
|
|
|
|
- Gitea web/API endpoint returns 5xx or is unreachable.
|
|
- `/v2/` no longer returns an OCI registry response.
|
|
- PyPI endpoint returns 5xx or package install checks fail for a published
|
|
release package.
|
|
- `make gitea-status` shows unavailable pods, missing Service/Ingress, or an
|
|
unhealthy `gitea-db`.
|
|
- Package/blob usage crosses the warning or action thresholds above.
|
|
- Runner labels required by S4 templates or S5 checks disappear.
|
|
- A privileged runner label runs without a recorded trust/credential purpose.
|
|
- Logs show repeated database, storage, registry, or runner failures.
|
|
- Restore evidence is missing for a production-critical package/image/source
|
|
dependency.
|
|
|
|
Use `needs_human=true` on the relevant State Hub task when the intervention
|
|
requires secret custody, credential minting, production restore decisions, or
|
|
live infrastructure changes.
|
|
|
|
## Future Centralized Observability
|
|
|
|
The stable split should be:
|
|
|
|
- `railiance-forge` owns forge signal definitions, evidence requirements, and
|
|
runbook interpretation.
|
|
- `railiance-platform` should own shared metrics/log storage, durable retention,
|
|
and platform service dashboards if observability remains an S3 capability.
|
|
- `railiance-enablement` may own reusable dashboard templates, workflow evidence
|
|
templates, and developer-facing self-service views.
|
|
- A future dedicated observability repo may own cross-domain dashboards, alert
|
|
routing, and log pipelines if Railiance chooses to separate that scope.
|
|
|
|
Moving collection, dashboards, or alert routing out of this repo must not move
|
|
the meaning of forge signals. Forge remains the source of truth for what counts
|
|
as healthy source hosting, registry service, package service, runner substrate,
|
|
and release artifact evidence.
|