Compare commits

...

2 Commits

Author SHA1 Message Date
f3ca5b9c3a Normalize agent instructions and workplan frontmatter (STATE-WP-0067)
- Align agent files with on-disk workplan prefixes (infer from workplan ids)
- Set workplan domain to registered domain_slug; add topic_slug where applicable
- Repair frontmatter delimiter formatting; migrate legacy task status literals
- Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
2026-06-22 23:16:28 +02:00
129a229e38 Seeded intent and wiki pages 2026-06-22 19:09:24 +02:00
18 changed files with 1113 additions and 21 deletions

20
.claude/rules/agents.md Normal file
View File

@@ -0,0 +1,20 @@
## Kaizen Agents
Specialized agent personas available on demand via the state-hub MCP.
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
Common agents:
| Agent | Category | When to use |
|-------|----------|-------------|
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
| `test-maintenance` | testing | Diagnose and fix failing tests |
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
| `keepaTodofile` | process | Maintain TODO.md during work |
| `project-management` | process | Track status, determine next steps |
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
All 17 agents: call `list_kaizen_agents()` for the full list.

View File

@@ -0,0 +1,8 @@
## Architecture
<!-- TODO: Describe the key design decisions and component structure.
Key modules, data flows, external integrations, state machines, etc. -->
## Quick Reference
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=tele-mcp` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -0,0 +1,38 @@
## First Session Protocol
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured.
**Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work**
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
**Step 3 — Propose workstreams to Bernd**
Propose 13 workstreams — each a coherent strand, weeks to months, anchored to a
roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)**
```
workplans/TELE-WP-NNNN-<slug>.md ← write this first
```
Then register in the hub:
```
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
```
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M}
)
```
<!-- Delete or archive this file once past first session -->

View File

@@ -0,0 +1,8 @@
## Repo boundary
This repo owns **tele-mcp** only. It does not own:
<!-- TODO: List what belongs in adjacent repos, e.g.:
- SSH key management → railiance-infra/
- State hub code → state-hub/
-->

View File

@@ -0,0 +1,5 @@
**Purpose:** **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
**Domain:** infotech
**Repo slug:** tele-mcp
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -0,0 +1,85 @@
## Session Protocol
Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
**Step 1 — Orient**
Read the offline-safe brief first — it works without a live hub connection:
```bash
cat .custodian-brief.md
```
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
```
get_domain_summary("infotech")
```
If MCP tools are unavailable in the current agent session, use the REST API:
```bash
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
```
If the hub is offline: `cd ~/state-hub && make api`
**Step 2 — Check inbox**
With MCP tools:
```
get_messages(to_agent="tele-mcp", unread_only=True)
```
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
requests before proceeding.
Without MCP tools:
```bash
curl -s "http://127.0.0.1:8000/messages/?to_agent=tele-mcp&unread_only=true" \
| python3 -m json.tool
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
**Step 3 — Scan workplans**
```bash
ls workplans/
```
For each file with `status: ready`, `active`, or `blocked`, note pending
`wait`/`todo`/`progress` tasks.
**Step 4 — Present brief**
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:tele-mcp]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
- `alignment_warnings`: flag if active work is not aligned with current goal
4. **Suggested next action** — highest-priority open item
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
If no workstreams: follow First Session Protocol (`first-session.md`).
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
**Session close:**
With MCP tools:
```
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
```
Without MCP tools:
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
```
If workplan files were modified, ensure the local copy is up to date first:
```bash
git -C <repo_path> pull --ff-only
cd ~/state-hub && make fix-consistency REPO=tele-mcp
```
For repos where implementation runs on a remote machine (e.g. CoulombCore),
use the combined target which pulls before fixing:
```bash
cd ~/state-hub && make fix-consistency-remote REPO=tele-mcp
```
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
until you pull — intentional to prevent clobbering remote progress.

View File

@@ -0,0 +1,19 @@
## Stack
<!-- TODO: Fill in language, frameworks, and key dependencies -->
- **Language:**
- **Key deps:**
## Dev Commands
```bash
# TODO: Fill in the standard commands for this repo
# Install dependencies
# Run tests
# Lint / type check
# Build / package (if applicable)
```

View File

@@ -0,0 +1,40 @@
## Workplan Convention (ADR-001)
File location: `workplans/TELE-WP-NNNN-<slug>.md`
ID prefix: `TELE-WP-`
Work items originate as files in this repo **before** being registered in the hub.
Canonical workplan/workstream frontmatter statuses are:
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
Use `proposed` for a newly drafted plan, `ready` after review against current
repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-TELE-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
directly. Promote anything requiring analysis, design, approval, dependencies, or
multiple planned phases into a normal workplan.
Ecosystem todos from other agents arrive as `[repo:tele-mcp]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering
the workstream.
Task blocks use this shape:
```task
id: TELE-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

27
.custodian-brief.md Normal file
View File

@@ -0,0 +1,27 @@
<!-- custodian-brief: generated by statehub register; fix-consistency may replace this file -->
# Custodian Brief - tele-mcp
**Project:** tele-mcp
**Domain:** infotech
**State Hub:** http://127.0.0.1:8000
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
## Open Workplans
### Bootstrap State Hub integration
Workplan file: `workplans/TELE-WP-0001-statehub-bootstrap.md`
Open tasks:
- T01 - Review generated integration files
- T02 - Verify local developer workflow
- T03 - Seed first real workplan
## Session Start
1. Read `INTENT.md`, `SCOPE.md`, and `AGENTS.md`.
2. Check inbox: `GET /messages/?to_agent=tele-mcp&unread_only=true`.
3. Scan `workplans/`.
4. Update task statuses in workplan files as work progresses.
Last generated: 2026-06-22

219
AGENTS.md Normal file
View File

@@ -0,0 +1,219 @@
# tele-mcp — Agent Instructions
## Repo Identity
**Purpose:** **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
**Domain:** infotech
**Repo slug:** tele-mcp
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `TELE-WP-`
---
## State Hub Integration
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
there is no MCP server for Codex agents.
| Context | URL |
|---------|-----|
| Local workstation | `http://127.0.0.1:8000` |
| Remote via tunnel | `http://127.0.0.1:18000` |
### Orient at session start
```bash
# Offline brief — works without hub connection
cat .custodian-brief.md
# Active workstreams for this domain
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
| python3 -m json.tool
# Check inbox
curl -s "http://127.0.0.1:8000/messages/?to_agent=tele-mcp&unread_only=true" \
| python3 -m json.tool
```
Mark a message read:
```bash
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
### Log progress (required at session close)
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{
"summary": "what was done",
"event_type": "note",
"author": "codex",
"workstream_id": "<uuid>",
"task_id": "<uuid>"
}'
```
Omit `workstream_id` / `task_id` when not applicable.
### Update task status
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"status": "progress"}'
# values: wait | todo | progress | done | cancel
```
### Flag a task for human review
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"needs_human": true, "intervention_note": "reason"}'
```
---
## Session Protocol
**Start:**
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=tele-mcp&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:**
- Update task statuses in workplan files as tasks progress
- Record significant decisions via `POST /decisions/`
**Close:**
1. Update workplan file task statuses to reflect progress
2. Log: `POST /progress/` with a summary of what changed
3. Note for the custodian operator: after workplan file changes, run from
`~/state-hub`:
```bash
make fix-consistency REPO=tele-mcp
```
This syncs task status from files into the hub DB.
---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=tele-mcp` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a
read/cache/index layer that rebuilds from files.
**File location:** `workplans/TELE-WP-NNNN-<slug>.md`
**Archived location:** finished workplans may move to
`workplans/archived/YYMMDD-TELE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
the completion/archive date; the frontmatter `id` does not change.
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
this only for low-risk work completed directly; create a normal workplan for
anything needing analysis, design, approval, dependencies, or multiple phases.
**Frontmatter:**
```yaml
---
id: TELE-WP-NNNN
type: workplan
title: "..."
domain: infotech
repo: tele-mcp
status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex
topic_slug: ...
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
---
```
Use `proposed` for a new draft, `ready` after review against current repo
state, and `finished` after implementation. `stalled` and `needs_review` are
derived health labels, not frontmatter statuses.
**Task block format** (one per `##` section):
```
## Task Title
` ` `task
id: TELE-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
` ` `
Task description text.
```
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
To create a new workplan:
1. Write the file following the format above
2. Notify the custodian operator to run `make fix-consistency REPO=tele-mcp`
(or send a message to the hub agent via `POST /messages/`)

12
CLAUDE.md Normal file
View File

@@ -0,0 +1,12 @@
# tele-mcp — Claude Code Instructions
@SCOPE.md
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md

171
INTENT.md Normal file
View File

@@ -0,0 +1,171 @@
# TeleMcp — Project Intent
> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
---
## Problem
Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
TeleMcp closes that gap by:
1. **Collecting** telemetry with proven CNCF/Grafana stack components.
2. **Deploying** the stack repeatably via Ansible + Helm.
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
---
## Vision
A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
- *What is unhealthy right now?*
- *Which pods are crash-looping and why?*
- *Is disk or memory pressure building?*
- *What changed in the cluster since yesterday?*
The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
---
## Design Principles
| Principle | What it means |
|-----------|---------------|
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
---
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
monitoring namespace logging namespace
```
**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
---
## What We Capture
### Minimum viable (current target)
**Kubernetes**
- Cluster & node status, conditions, taints
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
- Services, Events (especially Warning/Error)
- Resource usage via Prometheus/cAdvisor/kube-state-metrics
**Logs & alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks
### Stretch (explicitly deferred)
- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API integration for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
- Multi-cluster federation
- Write/mutate operations (out of scope unless a separate, gated path is designed)
---
## Repository Layout
| Path | Role |
|------|------|
| `ansible/` | Bootstrap: install Helm, deploy all charts |
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
| `environments/` | Per-environment overrides and notes |
| `wiki/` | Extended design notes and blueprint |
---
## Current State (as of initial scaffold)
**Done**
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
- Helm values for monitoring, logging, optional OTel collector
- MCP bridge service with core tools and saved-query resources
- Read-only ClusterRole/Binding for the bridge ServiceAccount
- NetworkPolicy skeleton for the bridge
- Health check and `/mcp/schema` discovery endpoint
**Not yet done / known gaps**
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
- No Alertmanager integration in the bridge
- No metrics-server chart (useful for `kubectl top` semantics)
- No host-level DaemonSet sidecar for systemd/OS signals
- NetworkPolicy egress may need K8s API (443) allowance
- Wiki and README aligned to INTENT; keep them updated when scope shifts
---
## Success Criteria
We know TeleMcp is working when:
1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
2. `curl /mcp/schema` returns resources, tools, and prompts.
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
---
## Non-Goals
- Replacing Grafana or building a custom metrics database
- Providing arbitrary shell/exec access to the cluster
- Mutating cluster state (deploy, scale, delete) through the bridge
- Supporting non-Linux or non-Kubernetes targets in v1
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
---
## How to Use This Document
- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
For operational quick-start, see [README.md](README.md).
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).

View File

@@ -1,55 +1,103 @@
# TeleMcp
Telemetry + MCP bridge that auto-deploys on a Linux-based Kubernetes host via **Ansible + Helm**.
It exposes read-only metrics, logs, and k8s object state through an **MCP server** so an LLM agent can bootstrap, monitor, and operate the host.
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.
## Components
- **kube-prometheus-stack** (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics)
- **Loki + Promtail** (logs)
- **OpenTelemetry Collector** (optional fan-out)
- **mcp-telemetry-bridge** (FastAPI service exposing MCP resources/tools/prompts)
| Component | Namespace | Role |
|-----------|-----------|------|
| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
| **Loki + Promtail** | `logging` | Log aggregation and shipping |
| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |
## Quick Start
### 0) Prereqs
- Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
- Ansible 2.15+ on your control machine
- Helm 3 on the host (Ansible role installs if missing)
### 1) Run Ansible
```bash
cd ansible
ansible-playbook -i inventories/local.ini playbook.yml
```
### 2) Smoke tests (from any machine with kubectl context)
### 2) Smoke tests
From any machine with a `kubectl` context:
```bash
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get pods -n mcp
kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
curl http://localhost:8080/mcp/schema | jq .
curl http://localhost:8080/healthz
```
### 3) Point your LLM Agent
Configure your agent's MCP client to the service endpoint (ClusterIP/Ingress).
Use tools:
- `promql.query`
- `loki.query`
- `k8s.get`
- `k8s.events`
- `inventory.snapshot`
### 3) Point your LLM agent
Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
**Implemented tools:**
| Tool | Description |
|------|-------------|
| `promql.query` | Run a PromQL expression against Prometheus |
| `loki.query` | Run a LogQL query against Loki |
| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
| `k8s.events` | List cluster or namespace events |
| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
**Saved resources** (via `/mcp/resource?uri=...`):
- `res://dashboards/top-pods-by-cpu.promql`
- `res://dashboards/pod-restarts.promql`
- `res://dashboards/warn-events.logql`
> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).
## Repo layout
```
tele-mcp/
ansible/
INTENT.md # Project north star — goals, scope, current state
ansible/ # Bootstrap playbook and roles
helm/
mcp-telemetry-bridge/
environments/
values/ # Chart values for monitoring, logging, OTel
mcp-telemetry-bridge/ # Bridge Helm chart
mcp-telemetry-bridge/ # FastAPI bridge application
environments/ # Per-environment overrides
wiki/ # Extended project and design docs
```
## Documentation
| Document | Purpose |
|----------|---------|
| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
## Security
- MCP bridge ServiceAccount is read-only (RBAC get/list/watch)
- Optional NetworkPolicy limits egress/ingress
- Consider mTLS/OIDC if exposing outside the cluster
- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
- NetworkPolicy limits bridge egress to Prometheus and Loki
- Consider mTLS or OIDC if exposing the bridge outside the cluster
## Current limitations
See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
- No Alertmanager integration in the bridge yet
- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar

32
SCOPE.md Normal file
View File

@@ -0,0 +1,32 @@
# SCOPE
> This file was generated by `statehub register`. Refine it as the repository
> boundaries become clearer.
## One-liner
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
## Core Idea
tele-mcp exists to provide the capability described in INTENT.md.
## In Scope
- Maintain the repository's primary implementation.
- Keep docs, tests, and operational metadata current.
## Out of Scope
- Own unrelated adjacent systems.
- Make irreversible operational decisions without human approval.
## Current State
- Status: active; implementation and stability should be verified by the repo agent.
## Getting Oriented
- Start with: INTENT.md
- Agent instructions: AGENTS.md
- Workplans: workplans/

183
wiki/TeleMcpBlueprint.md Normal file
View File

@@ -0,0 +1,183 @@
# TeleMcp Blueprint
*Building a Kubernetes telemetry MCP bridge*
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
## Overview
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
---
## What we capture
### Minimum viable (current target)
**Kubernetes (control + workloads)**
- Cluster and node status, taints, conditions, kubelet health
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
- Services, Events (warning/error)
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
**Logs and alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
### Stretch (deferred)
**Host (Linux / node)**
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
- Distro/kernel/version, packages/updates
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
- Certificates (expiry), time sync status (chrony/ntp)
- Firewall/ports (nftables/ufw summary)
**Additional Kubernetes signals**
- Ingress, Jobs/CronJobs, HPA/VPA
- Throttling and OOM kill detail beyond default metrics
**Additional bridge capabilities**
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
---
## Reference architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
```
### On the cluster
| Component | Status | Role |
|-----------|--------|------|
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
---
## Why these charts?
| Chart | Rationale |
|-------|-----------|
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
---
## MCP Telemetry Bridge
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
### Implementation status
| Capability | Status |
|------------|--------|
| FastAPI service with health check | Done |
| `/mcp/schema` discovery endpoint | Done |
| `promql.query` | Done |
| `loki.query` | Done |
| `k8s.get` | Done |
| `k8s.events` | Done |
| `inventory.snapshot` | Done |
| Saved PromQL/LogQL resources | Done (3 queries) |
| `Triage-Now` prompt | Stub |
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
| `systemd.status` | Planned (requires DaemonSet sidecar) |
| `tail.pod_logs` | Planned |
| Alertmanager API | Planned |
| Full MCP protocol transport | Planned |
### Read-only backends
The bridge talks read-only to:
- **Prometheus HTTP API** — instant and range queries
- **Loki HTTP API** — LogQL queries
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
- **Alertmanager API** — planned for active-alert summaries
- **Node sidecar HTTP** — planned for host-level facts
### Tools (target API)
```
promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit) # planned
```
### Resources
```
res://dashboards/top-pods-by-cpu.promql # implemented
res://dashboards/pod-restarts.promql # implemented
res://dashboards/warn-events.logql # implemented
res://snapshots/cluster-inventory.json # planned (dynamic)
```
### Prompts
```
Triage-Now # stub — summarize alerts, top offenders, recent warnings
Capacity-Check # planned
CrashLoop-Playbook # planned
```
---
## Security model
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
---
## Related docs
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
- [README.md](../README.md) — quick start and smoke tests
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience

73
wiki/TeleMcpProject.md Normal file
View File

@@ -0,0 +1,73 @@
# TeleMcp Project
*Telemetry for autonomous control*
## What is TeleMcp?
TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.
The project name reflects its two halves:
- **Tele** — telemetry: metrics, logs, events, and cluster inventory
- **MCP** — the standardized bridge between observability backends and LLM agents
## Who is it for?
- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations
## What problem does it solve?
Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.
TeleMcp solves this in three steps:
1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
2. **Deploy** — bootstrap everything with a single Ansible playbook
3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`
## What can an agent do today?
With the current scaffold, an agent connected to the bridge can:
- Query Prometheus with `promql.query`
- Search logs with `loki.query`
- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
- Pull a cluster inventory snapshot with `inventory.snapshot`
- Use pre-built PromQL/LogQL resources for common triage queries
## What is planned?
Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.
## Design principles
| Principle | Summary |
|-----------|---------|
| Read-only by default | No cluster mutations through the bridge |
| Standard stack | CNCF/Grafana components, not custom collectors |
| MCP as the interface | One bridge, one contract for agents |
| Deployable in one shot | Ansible + Helm, no manual assembly |
| Least privilege | Scoped RBAC and NetworkPolicy |
## Repository map
| Path | Contents |
|------|----------|
| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
| [README.md](../README.md) | Quick start and operational guide |
| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
| `ansible/` | Bootstrap playbook |
| `helm/` | Chart values and bridge chart |
| `mcp-telemetry-bridge/` | FastAPI bridge source |
## Success criteria
TeleMcp is working when:
1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
2. `/mcp/schema` returns resources, tools, and prompts
3. An agent can query metrics, logs, and cluster state without direct API credentials
4. Default alert rules fire on induced failures and the agent can triage them
5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host

View File

@@ -0,0 +1,54 @@
---
id: TELE-WP-0001
type: workplan
title: "Bootstrap State Hub integration"
domain: infotech
repo: tele-mcp
status: ready
owner: codex
topic_slug: custodian
created: "2026-06-22"
updated: "2026-06-22"
---
# Bootstrap State Hub integration
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
## Review Generated Integration Files
```task
id: TELE-WP-0001-T01
status: todo
priority: high
```
Review `INTENT.md`, `SCOPE.md`, `AGENTS.md`, and `.custodian-brief.md`.
Replace generated placeholders with repo-specific facts where needed.
## Verify Local Developer Workflow
```task
id: TELE-WP-0001-T02
status: todo
priority: high
```
Identify the repo's install, test, lint, build, and run commands. Add or refine
those commands in the agent instructions so future coding sessions can verify
changes confidently.
## Seed First Real Workplan
```task
id: TELE-WP-0001-T03
status: todo
priority: medium
```
Create the first implementation workplan for the repository's most important
next change. After workplan file updates, run from `~/state-hub`:
```bash
make fix-consistency REPO=tele-mcp
```