tele-mcp/wiki/TeleMcpProject.md

# TeleMcp Project

*Telemetry for autonomous control*

## What is TeleMcp?

TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.

The project name reflects its two halves:

- **Tele** — telemetry: metrics, logs, events, and cluster inventory
- **MCP** — the standardized bridge between observability backends and LLM agents

## Who is it for?

- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations

## What problem does it solve?

Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.

TeleMcp solves this in three steps:

1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
2. **Deploy** — bootstrap everything with a single Ansible playbook
3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`

## What can an agent do today?

With the current scaffold, an agent connected to the bridge can:

- Query Prometheus with `promql.query`
- Search logs with `loki.query`
- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
- Pull a cluster inventory snapshot with `inventory.snapshot`
- Use pre-built PromQL/LogQL resources for common triage queries

## What is planned?

Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.

## Design principles

| Principle | Summary |
|-----------|---------|
| Read-only by default | No cluster mutations through the bridge |
| Standard stack | CNCF/Grafana components, not custom collectors |
| MCP as the interface | One bridge, one contract for agents |
| Deployable in one shot | Ansible + Helm, no manual assembly |
| Least privilege | Scoped RBAC and NetworkPolicy |

## Repository map

| Path | Contents |
|------|----------|
| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
| [README.md](../README.md) | Quick start and operational guide |
| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
| `ansible/` | Bootstrap playbook |
| `helm/` | Chart values and bridge chart |
| `mcp-telemetry-bridge/` | FastAPI bridge source |

## Success criteria

TeleMcp is working when:

1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
2. `/mcp/schema` returns resources, tools, and prompts
3. An agent can query metrics, logs, and cluster state without direct API credentials
4. Default alert rules fire on induced failures and the agent can triage them
5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host