generated from coulomb/repo-seed
Add credential routing advisories via warden route/access, live pilot evidence helpers, managed deployment pilot probes, evaluation trend regression gates, and expanded troubleshooting. Update operator runbook and maturity scorecard.
263 lines
8.3 KiB
Markdown
263 lines
8.3 KiB
Markdown
# Operator Readiness Runbook
|
|
|
|
Updated: 2026-07-02
|
|
|
|
This runbook covers the operational path for `phase-memory` without requiring
|
|
credentials in the default test suite.
|
|
|
|
## Modes
|
|
|
|
| Mode | Purpose | Credentials | Network |
|
|
| --- | --- | --- | --- |
|
|
| Local fixture | Default deterministic runtime and tests. | No | No |
|
|
| Live-shaped | Adapter manifests and behavior that model live services locally. | No | No |
|
|
| Credentialed live drill | Operator-provided smoke drill for real endpoints. | Yes, via env only | Optional |
|
|
|
|
Credentialed drills require:
|
|
|
|
- `PHASE_MEMORY_MARKITECT_URL`
|
|
- `PHASE_MEMORY_MARKITECT_TOKEN`
|
|
- `PHASE_MEMORY_KONTEXTUAL_URL`
|
|
- `PHASE_MEMORY_KONTEXTUAL_TOKEN`
|
|
|
|
Obtain credentials through ops-warden routing — ops-warden does not vend
|
|
secret values:
|
|
|
|
```bash
|
|
warden route find "phase-memory markitect kontextual api token" --json
|
|
warden access "phase-memory markitect kontextual api token" --json
|
|
```
|
|
|
|
Export the returned values into the drill shell only. Do not store those values
|
|
in Git, workplans, progress logs, or release notes.
|
|
|
|
## Service Startup
|
|
|
|
The deployable stdlib entrypoint is `phase-memory-service`.
|
|
|
|
Readiness check without listening:
|
|
|
|
```bash
|
|
phase-memory-service --check --store .phase-memory-local
|
|
```
|
|
|
|
Start the stdlib WSGI service:
|
|
|
|
```bash
|
|
phase-memory-service --host 127.0.0.1 --port 8080 --store .phase-memory-local
|
|
```
|
|
|
|
Routes:
|
|
|
|
- `GET /health`
|
|
- `GET /ready`
|
|
- `GET /contracts`
|
|
- `POST /operations/{operation}`
|
|
- `POST /operations` with `{"operation": "...", "payload": {...}}`
|
|
|
|
## Readiness Checks
|
|
|
|
Before accepting traffic:
|
|
|
|
1. Run `phase-memory-service --check`.
|
|
2. Verify `/ready` reports `ok: true`.
|
|
3. Verify `unsupported_operations` is empty.
|
|
4. Verify adapter diagnostics have no `error` severity.
|
|
5. Verify the public API snapshot test passes after any operation/export change.
|
|
|
|
## Migration Apply
|
|
|
|
Plan and apply local-store metadata migrations through the runtime:
|
|
|
|
```python
|
|
from phase_memory import RuntimeConfig, runtime_from_config
|
|
|
|
config = RuntimeConfig(local_store_path=".phase-memory-local")
|
|
runtime = runtime_from_config(config)
|
|
plan = runtime.plan_store_migration(source_ref=config.local_store_path)
|
|
result = runtime.apply_store_migration(
|
|
plan["data"]["migration_plan"],
|
|
actor="operator",
|
|
source_ref=config.local_store_path,
|
|
)
|
|
```
|
|
|
|
Expected:
|
|
|
|
- no `error` diagnostics in the plan;
|
|
- `result["valid"] is True`;
|
|
- metadata is updated atomically;
|
|
- `audit.query` can find the `store.migration.apply` event.
|
|
|
|
Rollback:
|
|
|
|
- stop the service;
|
|
- restore the previous local store directory from backup;
|
|
- rerun `phase-memory-service --check`;
|
|
- rerun `runtime.repair_diagnostics()`.
|
|
|
|
## Audit Export And Retention
|
|
|
|
Plan retention:
|
|
|
|
```python
|
|
plan = runtime.audit_retention_plan(retention_days=30)
|
|
```
|
|
|
|
Apply retention:
|
|
|
|
```python
|
|
result = runtime.apply_audit_retention(plan["plan"])
|
|
```
|
|
|
|
Expected:
|
|
|
|
- eligible operation ids are pruned;
|
|
- `audit.retention.apply` is recorded after pruning;
|
|
- no retention apply happens when the sink reports unsupported behavior.
|
|
|
|
Export a trace batch:
|
|
|
|
```python
|
|
export = runtime.export_audit_events({"operation": "package.compile"})
|
|
```
|
|
|
|
Use export batches for operator review, not as a credential or secret store.
|
|
|
|
## Credentialed Drill
|
|
|
|
Resolve credential routing before running live drills:
|
|
|
|
```python
|
|
from phase_memory import resolve_credentialed_environ, warden_credential_routing_advisory
|
|
|
|
advisory = warden_credential_routing_advisory()
|
|
status = resolve_credentialed_environ()
|
|
```
|
|
|
|
Run the credentialed smoke test only from an operator environment:
|
|
|
|
```bash
|
|
PHASE_MEMORY_MARKITECT_URL=... \
|
|
PHASE_MEMORY_MARKITECT_TOKEN=... \
|
|
PHASE_MEMORY_KONTEXTUAL_URL=... \
|
|
PHASE_MEMORY_KONTEXTUAL_TOKEN=... \
|
|
python3 -m pytest tests/test_credentialed_drills.py
|
|
```
|
|
|
|
The report redacts tokens and uses a credential fingerprint rather than
|
|
persisting secrets.
|
|
|
|
Persist a redacted operator report from the same environment:
|
|
|
|
```python
|
|
from phase_memory import write_credentialed_operator_report
|
|
|
|
write_credentialed_operator_report("reports/credentialed-operator-report.json")
|
|
```
|
|
|
|
Run the credentialed telemetry retention drill when an operator has approved
|
|
using the local fixture path or the required credentials are present:
|
|
|
|
```python
|
|
from phase_memory import credentialed_telemetry_retention_drill
|
|
|
|
report = credentialed_telemetry_retention_drill(operator_approved_fixture=True)
|
|
```
|
|
|
|
The drill records old and new audit events, plans retention, applies pruning,
|
|
and reports retained/pruned operation ids without storing credential values.
|
|
|
|
## Live Pilot Evidence
|
|
|
|
Collect credential-safe pilot artifacts for operator review:
|
|
|
|
```python
|
|
from phase_memory import write_live_pilot_evidence
|
|
|
|
write_live_pilot_evidence("reports/live-pilot", environ=os.environ)
|
|
```
|
|
|
|
Artifacts include:
|
|
|
|
- `live-pilot-report.json` — aggregate pilot status and live_evidence flags
|
|
- `credentialed-operator-report.json` — redacted smoke report
|
|
- `managed-deployment-pilot.json` — manifest validation and probe results
|
|
- `telemetry-retention-evidence.json` — retention apply audit trace
|
|
- `evaluation-trend-history.json` — persisted trend artifacts
|
|
- `evaluation-regression-gate.json` — operator regression gate
|
|
- `credential-routing-advisory.json` — ops-warden routing without secrets
|
|
|
|
## Managed Deployment Manifest
|
|
|
|
Build and validate a deployment manifest before handing it to platform-specific
|
|
packaging:
|
|
|
|
```python
|
|
from phase_memory import managed_deployment_manifest, validate_managed_deployment_manifest
|
|
from phase_memory import ServiceAppConfig
|
|
|
|
manifest = managed_deployment_manifest(
|
|
ServiceAppConfig(host="0.0.0.0", port=8080, local_store_path="/var/lib/phase-memory")
|
|
)
|
|
validation = validate_managed_deployment_manifest(manifest)
|
|
```
|
|
|
|
Required manifest features:
|
|
|
|
- `phase-memory-service` command entrypoint;
|
|
- `/health` liveness probe;
|
|
- `/ready` readiness probe;
|
|
- writable local-store mount;
|
|
- rollback checks that include `phase-memory-service --check` and
|
|
`runtime.repair_diagnostics`.
|
|
|
|
## Evaluation Trend History
|
|
|
|
Persist trend artifacts into a history file after evaluation runs:
|
|
|
|
```python
|
|
from phase_memory import write_evaluation_trend_history
|
|
|
|
history = write_evaluation_trend_history("reports/evaluation-trend-history.json", trend)
|
|
```
|
|
|
|
Repeated writes of the same trend id do not duplicate the run.
|
|
|
|
Gate promotion on evaluation regressions:
|
|
|
|
```python
|
|
from phase_memory import evaluation_trend_regression_gate, load_evaluation_trend_history
|
|
|
|
history = load_evaluation_trend_history("reports/evaluation-trend-history.json")
|
|
gate = evaluation_trend_regression_gate(history)
|
|
```
|
|
|
|
Compare the latest artifact metrics in `evaluation-trend-history.json` against
|
|
the previous run id. Block promotion when `metric_regressions` or
|
|
`threshold_failures` are non-empty.
|
|
|
|
## Troubleshooting Matrix
|
|
|
|
| Category | Diagnostic | Operator action |
|
|
| --- | --- | --- |
|
|
| Credentials | `credential_env_missing` | Set the four credential environment variables in the drill shell; do not write them to files. |
|
|
| Readiness | `unsupported_operation` | Run service contract and public API snapshot tests, then update dispatch or release notes. |
|
|
| Migrations | `store_migration_unsupported` | Use a file-backed local store or run repair diagnostics before accepting traffic. |
|
|
| Audit retention | `audit_retention_apply_unsupported` | Switch to a JSONL or telemetry audit sink with retention support, then rerun the retention drill. |
|
|
| Adapter manifest | `adapter_pack_manifest_invalid` | Regenerate and validate the adapter pack manifest before using the pack. |
|
|
| Credential routing | `warden_cli_unavailable` | Install warden from ops-warden, then run `warden route find` before exporting PHASE_MEMORY_* variables. |
|
|
| Deployment | `managed_deployment_probe_failed` | Run `phase-memory-service --check` and validate managed deployment manifest probes before promotion. |
|
|
| Evaluation | `evaluation_metric_regressed` | Compare latest and previous trend artifacts; inspect scenario diagnostics before release. |
|
|
| Pilot | `pilot_credentialed_env_missing` | Obtain credentials through ops-warden routing and rerun `write_live_pilot_evidence`. |
|
|
|
|
## Compatibility Release Discipline
|
|
|
|
When public exports or service operations change:
|
|
|
|
1. Update `tests/fixtures/public-api-snapshot.json`.
|
|
2. Fill in `docs/release-note-template.md`.
|
|
3. Call out changed exports, changed service operations, migration needs, and
|
|
operator action.
|
|
4. Link the workplan or decision that authorized the change.
|