generated from coulomb/repo-seed
Define forge observability evidence
This commit is contained in:
9
Makefile
9
Makefile
@@ -10,6 +10,7 @@ GITEA_INGRESS ?= manifests/gitea-ingress.yaml
|
||||
GITEA_DB_CLUSTER ?= gitea-db
|
||||
GITEA_DB_NAMESPACE ?= databases
|
||||
REGISTRY_DOCS ?= docs/gitea-container-registry.md docs/gitea-package-registry.md
|
||||
EVIDENCE_DOCS ?= docs/observability-operating-evidence.md docs/ci-runner-actions-gitops-ownership.md docs/backup-restore-secret-handoff.md
|
||||
SOPS_SENTINEL ?= $(GITEA_VALUES)
|
||||
|
||||
##@ Operator checks
|
||||
@@ -40,6 +41,12 @@ registry-docs: ## Print canonical registry docs
|
||||
sed -n '1,220p' "$$doc"; \
|
||||
done
|
||||
|
||||
evidence-docs: ## Print forge evidence and handoff contracts
|
||||
@for doc in $(EVIDENCE_DOCS); do \
|
||||
printf '\n## %s\n\n' "$$doc"; \
|
||||
sed -n '1,260p' "$$doc"; \
|
||||
done
|
||||
|
||||
##@ Current Gitea
|
||||
|
||||
gitea-deploy: ## Deploy / upgrade current Gitea forge runtime
|
||||
@@ -70,4 +77,4 @@ help: ## Show this help
|
||||
/^[a-zA-Z0-9_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \
|
||||
/^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
|
||||
|
||||
.PHONY: check-tools check-sops registry-docs gitea-deploy gitea-ingress-deploy gitea-status help
|
||||
.PHONY: check-tools check-sops registry-docs evidence-docs gitea-deploy gitea-ingress-deploy gitea-status help
|
||||
|
||||
@@ -24,6 +24,7 @@ Key contracts:
|
||||
- `docs/initial-operating-contracts.md`
|
||||
- `docs/ci-runner-actions-gitops-ownership.md`
|
||||
- `docs/backup-restore-secret-handoff.md`
|
||||
- `docs/observability-operating-evidence.md`
|
||||
- `docs/gitea-container-registry.md`
|
||||
- `docs/gitea-package-registry.md`
|
||||
|
||||
@@ -31,6 +32,7 @@ Useful entry points:
|
||||
|
||||
```bash
|
||||
make registry-docs
|
||||
make evidence-docs
|
||||
make check-tools
|
||||
make gitea-status
|
||||
make gitea-deploy
|
||||
|
||||
6
SCOPE.md
6
SCOPE.md
@@ -36,6 +36,8 @@ The runner, Actions, and GitOps ownership contract lives in
|
||||
`docs/ci-runner-actions-gitops-ownership.md`.
|
||||
The backup, restore, and secret custody handoff contract lives in
|
||||
`docs/backup-restore-secret-handoff.md`.
|
||||
The observability and operating evidence contract lives in
|
||||
`docs/observability-operating-evidence.md`.
|
||||
|
||||
---
|
||||
|
||||
@@ -182,7 +184,9 @@ Known starting point:
|
||||
`docs/ci-runner-actions-gitops-ownership.md`.
|
||||
7. For backup, restore, and secret custody handoffs, read
|
||||
`docs/backup-restore-secret-handoff.md`.
|
||||
8. For migration context, read
|
||||
8. For observability and release-readiness evidence, read
|
||||
`docs/observability-operating-evidence.md`.
|
||||
9. For migration context, read
|
||||
`/home/worsch/railiance-apps/workplans/RAILIANCE-WP-0006-railiance-forge-extraction.md`.
|
||||
|
||||
---
|
||||
|
||||
@@ -202,8 +202,8 @@ follow-up against the owning layer before relying on the artifact.
|
||||
|
||||
## Follow-Ups
|
||||
|
||||
- WP-0006-T08 should turn backup, restore, storage growth, and runner status
|
||||
evidence into inspectable operating signals.
|
||||
- `docs/observability-operating-evidence.md` defines the inspectable storage
|
||||
growth, restore-evidence, and runner-status signals for this contract.
|
||||
- WP-0006-T09 should model forge backup/restore and secret-delivery edges in
|
||||
Railiance Fabric.
|
||||
- `RAILIANCE-WP-0005-T04` should use this contract when documenting S5 app data
|
||||
|
||||
@@ -163,8 +163,8 @@ should provide:
|
||||
|
||||
## Open Follow-Ups
|
||||
|
||||
- WP-0006-T08 should turn runner health and artifact evidence into explicit
|
||||
observability requirements.
|
||||
- `docs/observability-operating-evidence.md` defines the runner health and
|
||||
artifact evidence signals that consumers may cite.
|
||||
- WP-0006-T09 should declare runner substrate, label contracts, and evidence
|
||||
edges in Railiance Fabric.
|
||||
- `RAILIANCE-WP-0005-T05` should document app-side dry-run behavior once forge
|
||||
|
||||
@@ -93,3 +93,5 @@ leaving live deploy and secret custody changes behind separate review gates.
|
||||
than repeating registry implementation details.
|
||||
- Future monitoring should turn the manual status checks into durable signals
|
||||
once the Railiance observability layer is ready.
|
||||
- The detailed observability and operating evidence contract lives in
|
||||
`docs/observability-operating-evidence.md`.
|
||||
|
||||
206
docs/observability-operating-evidence.md
Normal file
206
docs/observability-operating-evidence.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Forge Observability And Operating Evidence
|
||||
|
||||
Last reviewed: 2026-06-05
|
||||
|
||||
Status: contract v1. This document defines checks, evidence, and future
|
||||
monitoring expectations. It does not authorize a live monitoring deployment,
|
||||
alert route, dashboard rollout, credential change, or forge cutover.
|
||||
|
||||
## Purpose
|
||||
|
||||
Forge availability affects source hosting, artifact publication, package
|
||||
installation, and downstream app releases. Operators should be able to inspect
|
||||
the current forge and produce release-readiness evidence without reconstructing
|
||||
the system from historical workplans.
|
||||
|
||||
This contract defines:
|
||||
|
||||
- endpoint health checks for current Gitea and future Forgejo;
|
||||
- log, dashboard, storage, and runner evidence expectations;
|
||||
- manual thresholds until centralized monitoring exists;
|
||||
- what S5 application releases can cite as forge readiness evidence;
|
||||
- where future centralized observability should live.
|
||||
|
||||
## Signal Ownership
|
||||
|
||||
| Signal | Owner | Consumer |
|
||||
| --- | --- | --- |
|
||||
| Web and API endpoint health | `railiance-forge` defines checks; `railiance-cluster` provides ingress/DNS/TLS primitives | Operators, source repos, S5 release checks |
|
||||
| Git SSH reachability | `railiance-forge` defines checks; `railiance-cluster` provides published Service/ingress path when present | Source repos, automation |
|
||||
| Container registry health | `railiance-forge` defines `/v2/` checks and package evidence | S5 app image consumers |
|
||||
| Python package registry health | `railiance-forge` defines PyPI endpoint checks and package evidence | Source repos and app build pipelines |
|
||||
| Actions/runner health | `railiance-forge` owns runner substrate signals | S4 templates and source/app workflows |
|
||||
| Gitea database status | `railiance-forge` checks consumer health; `railiance-platform` owns CNPG backup/restore mechanisms | Forge operators |
|
||||
| Package/blob storage growth | `railiance-forge` tracks growth and thresholds; lower layers own durable storage/backup mechanisms | Forge operators and S5 release gates |
|
||||
| Logs and audit trails | `railiance-forge` defines useful slices; platform/future observability owns durable aggregation | Operators and incident review |
|
||||
| Release-readiness evidence | `railiance-forge` defines the evidence bundle | `railiance-apps` and source repos cite it |
|
||||
|
||||
## Read-Only Health Checks
|
||||
|
||||
Run `make gitea-status` first. It checks the Gitea pod, Service, Ingress, and
|
||||
CNPG-backed `gitea-db` status when the operator has a kubeconfig pointed at the
|
||||
Railiance cluster.
|
||||
|
||||
Additional checks should stay read-only:
|
||||
|
||||
```bash
|
||||
# Web/API health: expect HTTP 200/3xx for the web route, not 5xx.
|
||||
curl -fsSI https://gitea.coulomb.social/
|
||||
curl -fsS https://gitea.coulomb.social/api/v1/version
|
||||
|
||||
# Container registry health: expect an OCI auth challenge, normally HTTP 401,
|
||||
# with Docker-Distribution-Api-Version: registry/2.0.
|
||||
curl -i https://gitea.coulomb.social/v2/
|
||||
|
||||
# Python package registry health: expect reachable endpoint behavior. Depending
|
||||
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
|
||||
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/
|
||||
```
|
||||
|
||||
Git SSH:
|
||||
|
||||
- If a Git SSH endpoint is published, verify it with a read-only `git ls-remote`
|
||||
against a known non-secret repository or with an SSH banner check.
|
||||
- If no SSH endpoint is intentionally exposed, record `not exposed` rather than
|
||||
silently skipping the signal.
|
||||
- Do not paste private keys, tokens, or signed SSH command output containing
|
||||
secret material into evidence.
|
||||
|
||||
Actions and runners:
|
||||
|
||||
- Record runner inventory by semantic label, trust level, and last successful
|
||||
sample job.
|
||||
- For privileged labels such as `package-publish`, `registry-publish`,
|
||||
`cluster-dry-run`, or `s5-release-check`, record a recent non-production
|
||||
sample job or release job reference.
|
||||
- If no runner currently provides a required label, mark the dependent workflow
|
||||
as blocked on runner prerequisites instead of weakening the workflow.
|
||||
|
||||
## Storage Growth Checks
|
||||
|
||||
Current package blobs live under `/data/packages` on the
|
||||
`default/gitea-shared-storage` PVC. The known baseline was about 798.5 MiB on
|
||||
2026-05-19 against a 10 GiB `local-path` PVC.
|
||||
|
||||
Read-only inspection:
|
||||
|
||||
```bash
|
||||
kubectl get pvc gitea-shared-storage -n default
|
||||
|
||||
pod="$(kubectl get pod -n default \
|
||||
-l app.kubernetes.io/instance=gitea \
|
||||
-o jsonpath='{.items[0].metadata.name}')"
|
||||
|
||||
kubectl exec -n default "$pod" -- du -sh /data/packages
|
||||
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l
|
||||
```
|
||||
|
||||
Manual thresholds until centralized metrics exist:
|
||||
|
||||
- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB
|
||||
since the previous recorded check.
|
||||
- Action required: usage reaches 85%, package restore has not been drilled for
|
||||
production-critical artifacts, or smoke-test tags are accumulating without a
|
||||
cleanup owner.
|
||||
- Block production reliance: usage reaches 90%, package/blob restore evidence is
|
||||
missing for a production-critical artifact, or registry pulls/install checks
|
||||
fail with 5xx/server errors.
|
||||
|
||||
Growth evidence should record date, operator, PVC size, package directory size,
|
||||
largest known package family if inspected, and whether cleanup or backup
|
||||
follow-up is required.
|
||||
|
||||
## Logs And Dashboards
|
||||
|
||||
Minimum manual log checks:
|
||||
|
||||
```bash
|
||||
kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
|
||||
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
|
||||
kubectl get events -n default --sort-by=.lastTimestamp
|
||||
```
|
||||
|
||||
What to look for:
|
||||
|
||||
- repeated 5xx errors;
|
||||
- failed package uploads or downloads;
|
||||
- registry authentication loops;
|
||||
- database connection errors;
|
||||
- PVC mount, quota, or disk-pressure warnings;
|
||||
- runner registration failures or stuck jobs;
|
||||
- TLS/cert renewal failures at the ingress boundary.
|
||||
|
||||
Dashboard expectations:
|
||||
|
||||
- Current state: manual checks and `make gitea-status` are the authoritative
|
||||
operator path.
|
||||
- Next state: forge should publish signal definitions that a future dashboard
|
||||
can render without changing ownership boundaries.
|
||||
- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database,
|
||||
storage, and recent publish evidence in one view.
|
||||
- Dashboard absence is not a reason to skip evidence; keep recording manual
|
||||
evidence until a durable view exists.
|
||||
|
||||
## Release-Readiness Evidence
|
||||
|
||||
Before an S5 app release cites a forge artifact as ready, forge evidence should
|
||||
include:
|
||||
|
||||
- date and operator or automation id;
|
||||
- forge repo commit containing the active operating contract;
|
||||
- Gitea/Forgejo version or endpoint version response;
|
||||
- web/API check result;
|
||||
- Git SSH result or explicit `not exposed`;
|
||||
- container registry `/v2/` challenge result;
|
||||
- Python package endpoint result;
|
||||
- runner label and sample job result when automation produced the artifact;
|
||||
- source repo, commit SHA, package/image identity, and version/tag/digest;
|
||||
- package/blob storage usage check if the artifact is production-critical;
|
||||
- backup/restore evidence reference if production reliance depends on the
|
||||
artifact being recoverable;
|
||||
- log review result for the relevant window;
|
||||
- known risks or missing signals.
|
||||
|
||||
S5 may cite this evidence from an app runbook or workplan. S5 should not repeat
|
||||
forge-internal backup procedures, package tokens, runner tokens, or registry
|
||||
write credentials.
|
||||
|
||||
## Alert And Intervention Rules
|
||||
|
||||
Until centralized alerting exists, record a State Hub note or human intervention
|
||||
when any of these occur:
|
||||
|
||||
- Gitea web/API endpoint returns 5xx or is unreachable.
|
||||
- `/v2/` no longer returns an OCI registry response.
|
||||
- PyPI endpoint returns 5xx or package install checks fail for a published
|
||||
release package.
|
||||
- `make gitea-status` shows unavailable pods, missing Service/Ingress, or an
|
||||
unhealthy `gitea-db`.
|
||||
- Package/blob usage crosses the warning or action thresholds above.
|
||||
- Runner labels required by S4 templates or S5 checks disappear.
|
||||
- A privileged runner label runs without a recorded trust/credential purpose.
|
||||
- Logs show repeated database, storage, registry, or runner failures.
|
||||
- Restore evidence is missing for a production-critical package/image/source
|
||||
dependency.
|
||||
|
||||
Use `needs_human=true` on the relevant State Hub task when the intervention
|
||||
requires secret custody, credential minting, production restore decisions, or
|
||||
live infrastructure changes.
|
||||
|
||||
## Future Centralized Observability
|
||||
|
||||
The stable split should be:
|
||||
|
||||
- `railiance-forge` owns forge signal definitions, evidence requirements, and
|
||||
runbook interpretation.
|
||||
- `railiance-platform` should own shared metrics/log storage, durable retention,
|
||||
and platform service dashboards if observability remains an S3 capability.
|
||||
- `railiance-enablement` may own reusable dashboard templates, workflow evidence
|
||||
templates, and developer-facing self-service views.
|
||||
- A future dedicated observability repo may own cross-domain dashboards, alert
|
||||
routing, and log pipelines if Railiance chooses to separate that scope.
|
||||
|
||||
Moving collection, dashboards, or alert routing out of this repo must not move
|
||||
the meaning of forge signals. Forge remains the source of truth for what counts
|
||||
as healthy source hosting, registry service, package service, runner substrate,
|
||||
and release artifact evidence.
|
||||
Reference in New Issue
Block a user