Files

tegwick 29f893b905 Implement PMEM-WP-0015 credentialed live pilot with ops-warden routing.

Add credential routing advisories via warden route/access, live pilot evidence
helpers, managed deployment pilot probes, evaluation trend regression gates,
and expanded troubleshooting. Update operator runbook and maturity scorecard.

2026-07-02 23:24:35 +02:00

8.3 KiB

Raw Blame History

Operator Readiness Runbook

Updated: 2026-07-02

This runbook covers the operational path for phase-memory without requiring credentials in the default test suite.

Modes

Mode	Purpose	Credentials	Network
Local fixture	Default deterministic runtime and tests.	No	No
Live-shaped	Adapter manifests and behavior that model live services locally.	No	No
Credentialed live drill	Operator-provided smoke drill for real endpoints.	Yes, via env only	Optional

Credentialed drills require:

PHASE_MEMORY_MARKITECT_URL
PHASE_MEMORY_MARKITECT_TOKEN
PHASE_MEMORY_KONTEXTUAL_URL
PHASE_MEMORY_KONTEXTUAL_TOKEN

Obtain credentials through ops-warden routing — ops-warden does not vend secret values:

warden route find "phase-memory markitect kontextual api token" --json
warden access "phase-memory markitect kontextual api token" --json

Export the returned values into the drill shell only. Do not store those values in Git, workplans, progress logs, or release notes.

Service Startup

The deployable stdlib entrypoint is phase-memory-service.

Readiness check without listening:

phase-memory-service --check --store .phase-memory-local

Start the stdlib WSGI service:

phase-memory-service --host 127.0.0.1 --port 8080 --store .phase-memory-local

Routes:

GET /health
GET /ready
GET /contracts
POST /operations/{operation}
POST /operations with {"operation": "...", "payload": {...}}

Readiness Checks

Before accepting traffic:

Run phase-memory-service --check.
Verify /ready reports ok: true.
Verify unsupported_operations is empty.
Verify adapter diagnostics have no error severity.
Verify the public API snapshot test passes after any operation/export change.

Migration Apply

Plan and apply local-store metadata migrations through the runtime:

from phase_memory import RuntimeConfig, runtime_from_config

config = RuntimeConfig(local_store_path=".phase-memory-local")
runtime = runtime_from_config(config)
plan = runtime.plan_store_migration(source_ref=config.local_store_path)
result = runtime.apply_store_migration(
    plan["data"]["migration_plan"],
    actor="operator",
    source_ref=config.local_store_path,
)

Expected:

no error diagnostics in the plan;
result["valid"] is True;
metadata is updated atomically;
audit.query can find the store.migration.apply event.

Rollback:

stop the service;
restore the previous local store directory from backup;
rerun phase-memory-service --check;
rerun runtime.repair_diagnostics().

Audit Export And Retention

Plan retention:

plan = runtime.audit_retention_plan(retention_days=30)

Apply retention:

result = runtime.apply_audit_retention(plan["plan"])

Expected:

eligible operation ids are pruned;
audit.retention.apply is recorded after pruning;
no retention apply happens when the sink reports unsupported behavior.

Export a trace batch:

export = runtime.export_audit_events({"operation": "package.compile"})

Use export batches for operator review, not as a credential or secret store.

Credentialed Drill

Resolve credential routing before running live drills:

from phase_memory import resolve_credentialed_environ, warden_credential_routing_advisory

advisory = warden_credential_routing_advisory()
status = resolve_credentialed_environ()

Run the credentialed smoke test only from an operator environment:

PHASE_MEMORY_MARKITECT_URL=... \
PHASE_MEMORY_MARKITECT_TOKEN=... \
PHASE_MEMORY_KONTEXTUAL_URL=... \
PHASE_MEMORY_KONTEXTUAL_TOKEN=... \
python3 -m pytest tests/test_credentialed_drills.py

The report redacts tokens and uses a credential fingerprint rather than persisting secrets.

Persist a redacted operator report from the same environment:

from phase_memory import write_credentialed_operator_report

write_credentialed_operator_report("reports/credentialed-operator-report.json")

Run the credentialed telemetry retention drill when an operator has approved using the local fixture path or the required credentials are present:

from phase_memory import credentialed_telemetry_retention_drill

report = credentialed_telemetry_retention_drill(operator_approved_fixture=True)

The drill records old and new audit events, plans retention, applies pruning, and reports retained/pruned operation ids without storing credential values.

Live Pilot Evidence

Collect credential-safe pilot artifacts for operator review:

from phase_memory import write_live_pilot_evidence

write_live_pilot_evidence("reports/live-pilot", environ=os.environ)

Artifacts include:

live-pilot-report.json — aggregate pilot status and live_evidence flags
credentialed-operator-report.json — redacted smoke report
managed-deployment-pilot.json — manifest validation and probe results
telemetry-retention-evidence.json — retention apply audit trace
evaluation-trend-history.json — persisted trend artifacts
evaluation-regression-gate.json — operator regression gate
credential-routing-advisory.json — ops-warden routing without secrets

Managed Deployment Manifest

Build and validate a deployment manifest before handing it to platform-specific packaging:

from phase_memory import managed_deployment_manifest, validate_managed_deployment_manifest
from phase_memory import ServiceAppConfig

manifest = managed_deployment_manifest(
    ServiceAppConfig(host="0.0.0.0", port=8080, local_store_path="/var/lib/phase-memory")
)
validation = validate_managed_deployment_manifest(manifest)

Required manifest features:

phase-memory-service command entrypoint;
/health liveness probe;
/ready readiness probe;
writable local-store mount;
rollback checks that include phase-memory-service --check and runtime.repair_diagnostics.

Evaluation Trend History

Persist trend artifacts into a history file after evaluation runs:

from phase_memory import write_evaluation_trend_history

history = write_evaluation_trend_history("reports/evaluation-trend-history.json", trend)

Repeated writes of the same trend id do not duplicate the run.

Gate promotion on evaluation regressions:

from phase_memory import evaluation_trend_regression_gate, load_evaluation_trend_history

history = load_evaluation_trend_history("reports/evaluation-trend-history.json")
gate = evaluation_trend_regression_gate(history)

Compare the latest artifact metrics in evaluation-trend-history.json against the previous run id. Block promotion when metric_regressions or threshold_failures are non-empty.

Troubleshooting Matrix

Category	Diagnostic	Operator action
Credentials	`credential_env_missing`	Set the four credential environment variables in the drill shell; do not write them to files.
Readiness	`unsupported_operation`	Run service contract and public API snapshot tests, then update dispatch or release notes.
Migrations	`store_migration_unsupported`	Use a file-backed local store or run repair diagnostics before accepting traffic.
Audit retention	`audit_retention_apply_unsupported`	Switch to a JSONL or telemetry audit sink with retention support, then rerun the retention drill.
Adapter manifest	`adapter_pack_manifest_invalid`	Regenerate and validate the adapter pack manifest before using the pack.
Credential routing	`warden_cli_unavailable`	Install warden from ops-warden, then run `warden route find` before exporting PHASE_MEMORY_* variables.
Deployment	`managed_deployment_probe_failed`	Run `phase-memory-service --check` and validate managed deployment manifest probes before promotion.
Evaluation	`evaluation_metric_regressed`	Compare latest and previous trend artifacts; inspect scenario diagnostics before release.
Pilot	`pilot_credentialed_env_missing`	Obtain credentials through ops-warden routing and rerun `write_live_pilot_evidence`.

Compatibility Release Discipline

When public exports or service operations change:

Update tests/fixtures/public-api-snapshot.json.
Fill in docs/release-note-template.md.
Call out changed exports, changed service operations, migration needs, and operator action.
Link the workplan or decision that authorized the change.

8.3 KiB Raw Blame History