Files
phase-memory/docs/operator-readiness-runbook.md
tegwick feb60c1d96 feat(PMEM-WP-0017): federated store management CLI and activity reports
Add phase_memory.management for cross-store discovery and windowed
activity reporting. Extend the phase-memory CLI with stores list and
report commands, plus make report-7d for the default weekly operator view.
2026-07-03 01:24:16 +02:00

277 lines
8.7 KiB
Markdown

# Operator Readiness Runbook
Updated: 2026-07-02
This runbook covers the operational path for `phase-memory` without requiring
credentials in the default test suite.
## Modes
| Mode | Purpose | Credentials | Network |
| --- | --- | --- | --- |
| Local fixture | Default deterministic runtime and tests. | No | No |
| Live-shaped | Adapter manifests and behavior that model live services locally. | No | No |
| Credentialed live drill | Operator-provided smoke drill for real endpoints. | Yes, via env only | Optional |
Credentialed drills require:
- `PHASE_MEMORY_MARKITECT_URL`
- `PHASE_MEMORY_MARKITECT_TOKEN`
- `PHASE_MEMORY_KONTEXTUAL_URL`
- `PHASE_MEMORY_KONTEXTUAL_TOKEN`
Obtain credentials through ops-warden routing — ops-warden does not vend
secret values:
```bash
warden route find "phase-memory markitect kontextual api token" --json
warden access "phase-memory markitect kontextual api token" --json
```
Export the returned values into the drill shell only. Do not store those values
in Git, workplans, progress logs, or release notes.
## Service Startup
The deployable stdlib entrypoint is `phase-memory-service`.
Readiness check without listening:
```bash
phase-memory-service --check --store .phase-memory-local
```
Start the stdlib WSGI service:
```bash
phase-memory-service --host 127.0.0.1 --port 8080 --store .phase-memory-local
```
Routes:
- `GET /health`
- `GET /ready`
- `GET /contracts`
- `POST /operations/{operation}`
- `POST /operations` with `{"operation": "...", "payload": {...}}`
## Cross-Store Activity
Use the phase-memory CLI as the canonical cross-store activity check:
```bash
phase-memory stores list --format summary
phase-memory report --days 7 --format summary
make report-7d
```
Federated reports aggregate ops-warden coordination episodes and local graph
store events/audit lines for the requested window. Drill into one store with
`phase-memory report --store <store_id-or-path> --days N`.
## Readiness Checks
Before accepting traffic:
1. Run `phase-memory-service --check`.
2. Verify `/ready` reports `ok: true`.
3. Verify `unsupported_operations` is empty.
4. Verify adapter diagnostics have no `error` severity.
5. Verify the public API snapshot test passes after any operation/export change.
## Migration Apply
Plan and apply local-store metadata migrations through the runtime:
```python
from phase_memory import RuntimeConfig, runtime_from_config
config = RuntimeConfig(local_store_path=".phase-memory-local")
runtime = runtime_from_config(config)
plan = runtime.plan_store_migration(source_ref=config.local_store_path)
result = runtime.apply_store_migration(
plan["data"]["migration_plan"],
actor="operator",
source_ref=config.local_store_path,
)
```
Expected:
- no `error` diagnostics in the plan;
- `result["valid"] is True`;
- metadata is updated atomically;
- `audit.query` can find the `store.migration.apply` event.
Rollback:
- stop the service;
- restore the previous local store directory from backup;
- rerun `phase-memory-service --check`;
- rerun `runtime.repair_diagnostics()`.
## Audit Export And Retention
Plan retention:
```python
plan = runtime.audit_retention_plan(retention_days=30)
```
Apply retention:
```python
result = runtime.apply_audit_retention(plan["plan"])
```
Expected:
- eligible operation ids are pruned;
- `audit.retention.apply` is recorded after pruning;
- no retention apply happens when the sink reports unsupported behavior.
Export a trace batch:
```python
export = runtime.export_audit_events({"operation": "package.compile"})
```
Use export batches for operator review, not as a credential or secret store.
## Credentialed Drill
Resolve credential routing before running live drills:
```python
from phase_memory import resolve_credentialed_environ, warden_credential_routing_advisory
advisory = warden_credential_routing_advisory()
status = resolve_credentialed_environ()
```
Run the credentialed smoke test only from an operator environment:
```bash
PHASE_MEMORY_MARKITECT_URL=... \
PHASE_MEMORY_MARKITECT_TOKEN=... \
PHASE_MEMORY_KONTEXTUAL_URL=... \
PHASE_MEMORY_KONTEXTUAL_TOKEN=... \
python3 -m pytest tests/test_credentialed_drills.py
```
The report redacts tokens and uses a credential fingerprint rather than
persisting secrets.
Persist a redacted operator report from the same environment:
```python
from phase_memory import write_credentialed_operator_report
write_credentialed_operator_report("reports/credentialed-operator-report.json")
```
Run the credentialed telemetry retention drill when an operator has approved
using the local fixture path or the required credentials are present:
```python
from phase_memory import credentialed_telemetry_retention_drill
report = credentialed_telemetry_retention_drill(operator_approved_fixture=True)
```
The drill records old and new audit events, plans retention, applies pruning,
and reports retained/pruned operation ids without storing credential values.
## Live Pilot Evidence
Collect credential-safe pilot artifacts for operator review:
```python
from phase_memory import write_live_pilot_evidence
write_live_pilot_evidence("reports/live-pilot", environ=os.environ)
```
Artifacts include:
- `live-pilot-report.json` — aggregate pilot status and live_evidence flags
- `credentialed-operator-report.json` — redacted smoke report
- `managed-deployment-pilot.json` — manifest validation and probe results
- `telemetry-retention-evidence.json` — retention apply audit trace
- `evaluation-trend-history.json` — persisted trend artifacts
- `evaluation-regression-gate.json` — operator regression gate
- `credential-routing-advisory.json` — ops-warden routing without secrets
## Managed Deployment Manifest
Build and validate a deployment manifest before handing it to platform-specific
packaging:
```python
from phase_memory import managed_deployment_manifest, validate_managed_deployment_manifest
from phase_memory import ServiceAppConfig
manifest = managed_deployment_manifest(
ServiceAppConfig(host="0.0.0.0", port=8080, local_store_path="/var/lib/phase-memory")
)
validation = validate_managed_deployment_manifest(manifest)
```
Required manifest features:
- `phase-memory-service` command entrypoint;
- `/health` liveness probe;
- `/ready` readiness probe;
- writable local-store mount;
- rollback checks that include `phase-memory-service --check` and
`runtime.repair_diagnostics`.
## Evaluation Trend History
Persist trend artifacts into a history file after evaluation runs:
```python
from phase_memory import write_evaluation_trend_history
history = write_evaluation_trend_history("reports/evaluation-trend-history.json", trend)
```
Repeated writes of the same trend id do not duplicate the run.
Gate promotion on evaluation regressions:
```python
from phase_memory import evaluation_trend_regression_gate, load_evaluation_trend_history
history = load_evaluation_trend_history("reports/evaluation-trend-history.json")
gate = evaluation_trend_regression_gate(history)
```
Compare the latest artifact metrics in `evaluation-trend-history.json` against
the previous run id. Block promotion when `metric_regressions` or
`threshold_failures` are non-empty.
## Troubleshooting Matrix
| Category | Diagnostic | Operator action |
| --- | --- | --- |
| Credentials | `credential_env_missing` | Set the four credential environment variables in the drill shell; do not write them to files. |
| Readiness | `unsupported_operation` | Run service contract and public API snapshot tests, then update dispatch or release notes. |
| Migrations | `store_migration_unsupported` | Use a file-backed local store or run repair diagnostics before accepting traffic. |
| Audit retention | `audit_retention_apply_unsupported` | Switch to a JSONL or telemetry audit sink with retention support, then rerun the retention drill. |
| Adapter manifest | `adapter_pack_manifest_invalid` | Regenerate and validate the adapter pack manifest before using the pack. |
| Credential routing | `warden_cli_unavailable` | Install warden from ops-warden, then run `warden route find` before exporting PHASE_MEMORY_* variables. |
| Deployment | `managed_deployment_probe_failed` | Run `phase-memory-service --check` and validate managed deployment manifest probes before promotion. |
| Evaluation | `evaluation_metric_regressed` | Compare latest and previous trend artifacts; inspect scenario diagnostics before release. |
| Pilot | `pilot_credentialed_env_missing` | Obtain credentials through ops-warden routing and rerun `write_live_pilot_evidence`. |
## Compatibility Release Discipline
When public exports or service operations change:
1. Update `tests/fixtures/public-api-snapshot.json`.
2. Fill in `docs/release-note-template.md`.
3. Call out changed exports, changed service operations, migration needs, and
operator action.
4. Link the workplan or decision that authorized the change.