Files
tele-mcp/README.md
tegwick 8f2584c1a0 Add MCP bridge local verification harness (TELE-WP-0002)
Introduce pytest smoke tests, run/verify scripts, and Makefile targets so
the bridge can be developed and validated without a full cluster deploy.
Document the local workflow and agent quickstart in README.
2026-06-24 18:18:00 +02:00

166 lines
5.4 KiB
Markdown

# TeleMcp
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.
## Components
| Component | Namespace | Role |
|-----------|-----------|------|
| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
| **Loki + Promtail** | `logging` | Log aggregation and shipping |
| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |
## Local development (no cluster)
Work on the MCP bridge without deploying the full observability stack.
### Install and verify
```bash
make bridge-install # venv + deps (once)
make bridge-test # pytest smoke: /healthz, /mcp/schema, /mcp/resource
```
Or from `mcp-telemetry-bridge/`:
```bash
./scripts/verify-local.sh
```
### Run locally
```bash
make bridge-run
# or: cd mcp-telemetry-bridge && ./scripts/run-local.sh
```
With the server up, optional live HTTP checks:
```bash
make bridge-smoke
# or: RUN_LIVE=1 ./mcp-telemetry-bridge/scripts/verify-local.sh
```
Manual curls:
```bash
curl http://127.0.0.1:8080/healthz
curl http://127.0.0.1:8080/mcp/schema | jq .
curl "http://127.0.0.1:8080/mcp/resource?uri=res://dashboards/top-pods-by-cpu.promql"
```
Tool calls use `POST /tools/<name>` with a JSON body (Prometheus/Loki/K8s backends are only reachable in-cluster).
### Agent quickstart
When changing the bridge, agents should:
1. Run `make bridge-test` after edits — fast, no cluster needed.
2. Introspect `GET /mcp/schema` for the current tools, resources, and prompts.
3. Call tools via `POST /tools/<tool-name>` (e.g. `POST /tools/promql.query` with `{"expr":"up"}`).
4. Fetch saved queries via `GET /mcp/resource?uri=<uri>`.
Expected smoke-test surface:
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/healthz` | GET | Liveness |
| `/mcp/schema` | GET | MCP catalog (tools, resources, prompts) |
| `/mcp/resource` | GET | Saved PromQL/LogQL query by URI |
| `/tools/*` | POST | Execute a tool (needs in-cluster backends) |
---
## Quick Start (full cluster deploy)
### 0) Prereqs
- Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
- Ansible 2.15+ on your control machine
- Helm 3 on the host (Ansible role installs if missing)
### 1) Run Ansible
```bash
cd ansible
ansible-playbook -i inventories/local.ini playbook.yml
```
### 2) Smoke tests
From any machine with a `kubectl` context:
```bash
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get pods -n mcp
kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
curl http://localhost:8080/mcp/schema | jq .
curl http://localhost:8080/healthz
```
### 3) Point your LLM agent
Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
**Implemented tools:**
| Tool | Description |
|------|-------------|
| `promql.query` | Run a PromQL expression against Prometheus |
| `loki.query` | Run a LogQL query against Loki |
| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
| `k8s.events` | List cluster or namespace events |
| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
**Saved resources** (via `/mcp/resource?uri=...`):
- `res://dashboards/top-pods-by-cpu.promql`
- `res://dashboards/pod-restarts.promql`
- `res://dashboards/warn-events.logql`
> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).
## Repo layout
```
tele-mcp/
INTENT.md # Project north star — goals, scope, current state
ansible/ # Bootstrap playbook and roles
helm/
values/ # Chart values for monitoring, logging, OTel
mcp-telemetry-bridge/ # Bridge Helm chart
mcp-telemetry-bridge/ # FastAPI bridge application
scripts/ # run-local.sh, verify-local.sh
tests/ # pytest smoke tests
environments/ # Per-environment overrides
wiki/ # Extended project and design docs
```
## Documentation
| Document | Purpose |
|----------|---------|
| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
## Security
- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
- NetworkPolicy limits bridge egress to Prometheus and Loki
- Consider mTLS or OIDC if exposing the bridge outside the cluster
## Current limitations
See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
- No Alertmanager integration in the bridge yet
- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar