generated from coulomb/repo-seed
Define forge observability evidence
This commit is contained in:
9
Makefile
9
Makefile
@@ -10,6 +10,7 @@ GITEA_INGRESS ?= manifests/gitea-ingress.yaml
|
|||||||
GITEA_DB_CLUSTER ?= gitea-db
|
GITEA_DB_CLUSTER ?= gitea-db
|
||||||
GITEA_DB_NAMESPACE ?= databases
|
GITEA_DB_NAMESPACE ?= databases
|
||||||
REGISTRY_DOCS ?= docs/gitea-container-registry.md docs/gitea-package-registry.md
|
REGISTRY_DOCS ?= docs/gitea-container-registry.md docs/gitea-package-registry.md
|
||||||
|
EVIDENCE_DOCS ?= docs/observability-operating-evidence.md docs/ci-runner-actions-gitops-ownership.md docs/backup-restore-secret-handoff.md
|
||||||
SOPS_SENTINEL ?= $(GITEA_VALUES)
|
SOPS_SENTINEL ?= $(GITEA_VALUES)
|
||||||
|
|
||||||
##@ Operator checks
|
##@ Operator checks
|
||||||
@@ -40,6 +41,12 @@ registry-docs: ## Print canonical registry docs
|
|||||||
sed -n '1,220p' "$$doc"; \
|
sed -n '1,220p' "$$doc"; \
|
||||||
done
|
done
|
||||||
|
|
||||||
|
evidence-docs: ## Print forge evidence and handoff contracts
|
||||||
|
@for doc in $(EVIDENCE_DOCS); do \
|
||||||
|
printf '\n## %s\n\n' "$$doc"; \
|
||||||
|
sed -n '1,260p' "$$doc"; \
|
||||||
|
done
|
||||||
|
|
||||||
##@ Current Gitea
|
##@ Current Gitea
|
||||||
|
|
||||||
gitea-deploy: ## Deploy / upgrade current Gitea forge runtime
|
gitea-deploy: ## Deploy / upgrade current Gitea forge runtime
|
||||||
@@ -70,4 +77,4 @@ help: ## Show this help
|
|||||||
/^[a-zA-Z0-9_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \
|
/^[a-zA-Z0-9_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \
|
||||||
/^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
|
/^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
|
||||||
|
|
||||||
.PHONY: check-tools check-sops registry-docs gitea-deploy gitea-ingress-deploy gitea-status help
|
.PHONY: check-tools check-sops registry-docs evidence-docs gitea-deploy gitea-ingress-deploy gitea-status help
|
||||||
|
|||||||
@@ -24,6 +24,7 @@ Key contracts:
|
|||||||
- `docs/initial-operating-contracts.md`
|
- `docs/initial-operating-contracts.md`
|
||||||
- `docs/ci-runner-actions-gitops-ownership.md`
|
- `docs/ci-runner-actions-gitops-ownership.md`
|
||||||
- `docs/backup-restore-secret-handoff.md`
|
- `docs/backup-restore-secret-handoff.md`
|
||||||
|
- `docs/observability-operating-evidence.md`
|
||||||
- `docs/gitea-container-registry.md`
|
- `docs/gitea-container-registry.md`
|
||||||
- `docs/gitea-package-registry.md`
|
- `docs/gitea-package-registry.md`
|
||||||
|
|
||||||
@@ -31,6 +32,7 @@ Useful entry points:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
make registry-docs
|
make registry-docs
|
||||||
|
make evidence-docs
|
||||||
make check-tools
|
make check-tools
|
||||||
make gitea-status
|
make gitea-status
|
||||||
make gitea-deploy
|
make gitea-deploy
|
||||||
|
|||||||
6
SCOPE.md
6
SCOPE.md
@@ -36,6 +36,8 @@ The runner, Actions, and GitOps ownership contract lives in
|
|||||||
`docs/ci-runner-actions-gitops-ownership.md`.
|
`docs/ci-runner-actions-gitops-ownership.md`.
|
||||||
The backup, restore, and secret custody handoff contract lives in
|
The backup, restore, and secret custody handoff contract lives in
|
||||||
`docs/backup-restore-secret-handoff.md`.
|
`docs/backup-restore-secret-handoff.md`.
|
||||||
|
The observability and operating evidence contract lives in
|
||||||
|
`docs/observability-operating-evidence.md`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -182,7 +184,9 @@ Known starting point:
|
|||||||
`docs/ci-runner-actions-gitops-ownership.md`.
|
`docs/ci-runner-actions-gitops-ownership.md`.
|
||||||
7. For backup, restore, and secret custody handoffs, read
|
7. For backup, restore, and secret custody handoffs, read
|
||||||
`docs/backup-restore-secret-handoff.md`.
|
`docs/backup-restore-secret-handoff.md`.
|
||||||
8. For migration context, read
|
8. For observability and release-readiness evidence, read
|
||||||
|
`docs/observability-operating-evidence.md`.
|
||||||
|
9. For migration context, read
|
||||||
`/home/worsch/railiance-apps/workplans/RAILIANCE-WP-0006-railiance-forge-extraction.md`.
|
`/home/worsch/railiance-apps/workplans/RAILIANCE-WP-0006-railiance-forge-extraction.md`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@@ -202,8 +202,8 @@ follow-up against the owning layer before relying on the artifact.
|
|||||||
|
|
||||||
## Follow-Ups
|
## Follow-Ups
|
||||||
|
|
||||||
- WP-0006-T08 should turn backup, restore, storage growth, and runner status
|
- `docs/observability-operating-evidence.md` defines the inspectable storage
|
||||||
evidence into inspectable operating signals.
|
growth, restore-evidence, and runner-status signals for this contract.
|
||||||
- WP-0006-T09 should model forge backup/restore and secret-delivery edges in
|
- WP-0006-T09 should model forge backup/restore and secret-delivery edges in
|
||||||
Railiance Fabric.
|
Railiance Fabric.
|
||||||
- `RAILIANCE-WP-0005-T04` should use this contract when documenting S5 app data
|
- `RAILIANCE-WP-0005-T04` should use this contract when documenting S5 app data
|
||||||
|
|||||||
@@ -163,8 +163,8 @@ should provide:
|
|||||||
|
|
||||||
## Open Follow-Ups
|
## Open Follow-Ups
|
||||||
|
|
||||||
- WP-0006-T08 should turn runner health and artifact evidence into explicit
|
- `docs/observability-operating-evidence.md` defines the runner health and
|
||||||
observability requirements.
|
artifact evidence signals that consumers may cite.
|
||||||
- WP-0006-T09 should declare runner substrate, label contracts, and evidence
|
- WP-0006-T09 should declare runner substrate, label contracts, and evidence
|
||||||
edges in Railiance Fabric.
|
edges in Railiance Fabric.
|
||||||
- `RAILIANCE-WP-0005-T05` should document app-side dry-run behavior once forge
|
- `RAILIANCE-WP-0005-T05` should document app-side dry-run behavior once forge
|
||||||
|
|||||||
@@ -93,3 +93,5 @@ leaving live deploy and secret custody changes behind separate review gates.
|
|||||||
than repeating registry implementation details.
|
than repeating registry implementation details.
|
||||||
- Future monitoring should turn the manual status checks into durable signals
|
- Future monitoring should turn the manual status checks into durable signals
|
||||||
once the Railiance observability layer is ready.
|
once the Railiance observability layer is ready.
|
||||||
|
- The detailed observability and operating evidence contract lives in
|
||||||
|
`docs/observability-operating-evidence.md`.
|
||||||
|
|||||||
206
docs/observability-operating-evidence.md
Normal file
206
docs/observability-operating-evidence.md
Normal file
@@ -0,0 +1,206 @@
|
|||||||
|
# Forge Observability And Operating Evidence
|
||||||
|
|
||||||
|
Last reviewed: 2026-06-05
|
||||||
|
|
||||||
|
Status: contract v1. This document defines checks, evidence, and future
|
||||||
|
monitoring expectations. It does not authorize a live monitoring deployment,
|
||||||
|
alert route, dashboard rollout, credential change, or forge cutover.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Forge availability affects source hosting, artifact publication, package
|
||||||
|
installation, and downstream app releases. Operators should be able to inspect
|
||||||
|
the current forge and produce release-readiness evidence without reconstructing
|
||||||
|
the system from historical workplans.
|
||||||
|
|
||||||
|
This contract defines:
|
||||||
|
|
||||||
|
- endpoint health checks for current Gitea and future Forgejo;
|
||||||
|
- log, dashboard, storage, and runner evidence expectations;
|
||||||
|
- manual thresholds until centralized monitoring exists;
|
||||||
|
- what S5 application releases can cite as forge readiness evidence;
|
||||||
|
- where future centralized observability should live.
|
||||||
|
|
||||||
|
## Signal Ownership
|
||||||
|
|
||||||
|
| Signal | Owner | Consumer |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Web and API endpoint health | `railiance-forge` defines checks; `railiance-cluster` provides ingress/DNS/TLS primitives | Operators, source repos, S5 release checks |
|
||||||
|
| Git SSH reachability | `railiance-forge` defines checks; `railiance-cluster` provides published Service/ingress path when present | Source repos, automation |
|
||||||
|
| Container registry health | `railiance-forge` defines `/v2/` checks and package evidence | S5 app image consumers |
|
||||||
|
| Python package registry health | `railiance-forge` defines PyPI endpoint checks and package evidence | Source repos and app build pipelines |
|
||||||
|
| Actions/runner health | `railiance-forge` owns runner substrate signals | S4 templates and source/app workflows |
|
||||||
|
| Gitea database status | `railiance-forge` checks consumer health; `railiance-platform` owns CNPG backup/restore mechanisms | Forge operators |
|
||||||
|
| Package/blob storage growth | `railiance-forge` tracks growth and thresholds; lower layers own durable storage/backup mechanisms | Forge operators and S5 release gates |
|
||||||
|
| Logs and audit trails | `railiance-forge` defines useful slices; platform/future observability owns durable aggregation | Operators and incident review |
|
||||||
|
| Release-readiness evidence | `railiance-forge` defines the evidence bundle | `railiance-apps` and source repos cite it |
|
||||||
|
|
||||||
|
## Read-Only Health Checks
|
||||||
|
|
||||||
|
Run `make gitea-status` first. It checks the Gitea pod, Service, Ingress, and
|
||||||
|
CNPG-backed `gitea-db` status when the operator has a kubeconfig pointed at the
|
||||||
|
Railiance cluster.
|
||||||
|
|
||||||
|
Additional checks should stay read-only:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Web/API health: expect HTTP 200/3xx for the web route, not 5xx.
|
||||||
|
curl -fsSI https://gitea.coulomb.social/
|
||||||
|
curl -fsS https://gitea.coulomb.social/api/v1/version
|
||||||
|
|
||||||
|
# Container registry health: expect an OCI auth challenge, normally HTTP 401,
|
||||||
|
# with Docker-Distribution-Api-Version: registry/2.0.
|
||||||
|
curl -i https://gitea.coulomb.social/v2/
|
||||||
|
|
||||||
|
# Python package registry health: expect reachable endpoint behavior. Depending
|
||||||
|
# on package visibility this may be 200, 401, or 404; 5xx is not acceptable.
|
||||||
|
curl -i https://gitea.coulomb.social/api/packages/coulomb/pypi/simple/
|
||||||
|
```
|
||||||
|
|
||||||
|
Git SSH:
|
||||||
|
|
||||||
|
- If a Git SSH endpoint is published, verify it with a read-only `git ls-remote`
|
||||||
|
against a known non-secret repository or with an SSH banner check.
|
||||||
|
- If no SSH endpoint is intentionally exposed, record `not exposed` rather than
|
||||||
|
silently skipping the signal.
|
||||||
|
- Do not paste private keys, tokens, or signed SSH command output containing
|
||||||
|
secret material into evidence.
|
||||||
|
|
||||||
|
Actions and runners:
|
||||||
|
|
||||||
|
- Record runner inventory by semantic label, trust level, and last successful
|
||||||
|
sample job.
|
||||||
|
- For privileged labels such as `package-publish`, `registry-publish`,
|
||||||
|
`cluster-dry-run`, or `s5-release-check`, record a recent non-production
|
||||||
|
sample job or release job reference.
|
||||||
|
- If no runner currently provides a required label, mark the dependent workflow
|
||||||
|
as blocked on runner prerequisites instead of weakening the workflow.
|
||||||
|
|
||||||
|
## Storage Growth Checks
|
||||||
|
|
||||||
|
Current package blobs live under `/data/packages` on the
|
||||||
|
`default/gitea-shared-storage` PVC. The known baseline was about 798.5 MiB on
|
||||||
|
2026-05-19 against a 10 GiB `local-path` PVC.
|
||||||
|
|
||||||
|
Read-only inspection:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get pvc gitea-shared-storage -n default
|
||||||
|
|
||||||
|
pod="$(kubectl get pod -n default \
|
||||||
|
-l app.kubernetes.io/instance=gitea \
|
||||||
|
-o jsonpath='{.items[0].metadata.name}')"
|
||||||
|
|
||||||
|
kubectl exec -n default "$pod" -- du -sh /data/packages
|
||||||
|
kubectl exec -n default "$pod" -- find /data/packages -maxdepth 2 -type d | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
Manual thresholds until centralized metrics exist:
|
||||||
|
|
||||||
|
- Warning: package/blob usage reaches 70% of the PVC or grows by more than 2 GiB
|
||||||
|
since the previous recorded check.
|
||||||
|
- Action required: usage reaches 85%, package restore has not been drilled for
|
||||||
|
production-critical artifacts, or smoke-test tags are accumulating without a
|
||||||
|
cleanup owner.
|
||||||
|
- Block production reliance: usage reaches 90%, package/blob restore evidence is
|
||||||
|
missing for a production-critical artifact, or registry pulls/install checks
|
||||||
|
fail with 5xx/server errors.
|
||||||
|
|
||||||
|
Growth evidence should record date, operator, PVC size, package directory size,
|
||||||
|
largest known package family if inspected, and whether cleanup or backup
|
||||||
|
follow-up is required.
|
||||||
|
|
||||||
|
## Logs And Dashboards
|
||||||
|
|
||||||
|
Minimum manual log checks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl logs -n default -l app.kubernetes.io/instance=gitea --tail=200
|
||||||
|
kubectl describe pod -n default -l app.kubernetes.io/instance=gitea
|
||||||
|
kubectl get events -n default --sort-by=.lastTimestamp
|
||||||
|
```
|
||||||
|
|
||||||
|
What to look for:
|
||||||
|
|
||||||
|
- repeated 5xx errors;
|
||||||
|
- failed package uploads or downloads;
|
||||||
|
- registry authentication loops;
|
||||||
|
- database connection errors;
|
||||||
|
- PVC mount, quota, or disk-pressure warnings;
|
||||||
|
- runner registration failures or stuck jobs;
|
||||||
|
- TLS/cert renewal failures at the ingress boundary.
|
||||||
|
|
||||||
|
Dashboard expectations:
|
||||||
|
|
||||||
|
- Current state: manual checks and `make gitea-status` are the authoritative
|
||||||
|
operator path.
|
||||||
|
- Next state: forge should publish signal definitions that a future dashboard
|
||||||
|
can render without changing ownership boundaries.
|
||||||
|
- A useful dashboard should show web/API, SSH, registry, PyPI, runner, database,
|
||||||
|
storage, and recent publish evidence in one view.
|
||||||
|
- Dashboard absence is not a reason to skip evidence; keep recording manual
|
||||||
|
evidence until a durable view exists.
|
||||||
|
|
||||||
|
## Release-Readiness Evidence
|
||||||
|
|
||||||
|
Before an S5 app release cites a forge artifact as ready, forge evidence should
|
||||||
|
include:
|
||||||
|
|
||||||
|
- date and operator or automation id;
|
||||||
|
- forge repo commit containing the active operating contract;
|
||||||
|
- Gitea/Forgejo version or endpoint version response;
|
||||||
|
- web/API check result;
|
||||||
|
- Git SSH result or explicit `not exposed`;
|
||||||
|
- container registry `/v2/` challenge result;
|
||||||
|
- Python package endpoint result;
|
||||||
|
- runner label and sample job result when automation produced the artifact;
|
||||||
|
- source repo, commit SHA, package/image identity, and version/tag/digest;
|
||||||
|
- package/blob storage usage check if the artifact is production-critical;
|
||||||
|
- backup/restore evidence reference if production reliance depends on the
|
||||||
|
artifact being recoverable;
|
||||||
|
- log review result for the relevant window;
|
||||||
|
- known risks or missing signals.
|
||||||
|
|
||||||
|
S5 may cite this evidence from an app runbook or workplan. S5 should not repeat
|
||||||
|
forge-internal backup procedures, package tokens, runner tokens, or registry
|
||||||
|
write credentials.
|
||||||
|
|
||||||
|
## Alert And Intervention Rules
|
||||||
|
|
||||||
|
Until centralized alerting exists, record a State Hub note or human intervention
|
||||||
|
when any of these occur:
|
||||||
|
|
||||||
|
- Gitea web/API endpoint returns 5xx or is unreachable.
|
||||||
|
- `/v2/` no longer returns an OCI registry response.
|
||||||
|
- PyPI endpoint returns 5xx or package install checks fail for a published
|
||||||
|
release package.
|
||||||
|
- `make gitea-status` shows unavailable pods, missing Service/Ingress, or an
|
||||||
|
unhealthy `gitea-db`.
|
||||||
|
- Package/blob usage crosses the warning or action thresholds above.
|
||||||
|
- Runner labels required by S4 templates or S5 checks disappear.
|
||||||
|
- A privileged runner label runs without a recorded trust/credential purpose.
|
||||||
|
- Logs show repeated database, storage, registry, or runner failures.
|
||||||
|
- Restore evidence is missing for a production-critical package/image/source
|
||||||
|
dependency.
|
||||||
|
|
||||||
|
Use `needs_human=true` on the relevant State Hub task when the intervention
|
||||||
|
requires secret custody, credential minting, production restore decisions, or
|
||||||
|
live infrastructure changes.
|
||||||
|
|
||||||
|
## Future Centralized Observability
|
||||||
|
|
||||||
|
The stable split should be:
|
||||||
|
|
||||||
|
- `railiance-forge` owns forge signal definitions, evidence requirements, and
|
||||||
|
runbook interpretation.
|
||||||
|
- `railiance-platform` should own shared metrics/log storage, durable retention,
|
||||||
|
and platform service dashboards if observability remains an S3 capability.
|
||||||
|
- `railiance-enablement` may own reusable dashboard templates, workflow evidence
|
||||||
|
templates, and developer-facing self-service views.
|
||||||
|
- A future dedicated observability repo may own cross-domain dashboards, alert
|
||||||
|
routing, and log pipelines if Railiance chooses to separate that scope.
|
||||||
|
|
||||||
|
Moving collection, dashboards, or alert routing out of this repo must not move
|
||||||
|
the meaning of forge signals. Forge remains the source of truth for what counts
|
||||||
|
as healthy source hosting, registry service, package service, runner substrate,
|
||||||
|
and release artifact evidence.
|
||||||
Reference in New Issue
Block a user