171 lines
8.4 KiB
Markdown
171 lines
8.4 KiB
Markdown
# TeleMcp — Project Intent
|
|
|
|
> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
|
|
|
|
TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
|
|
|
|
This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
|
|
|
|
---
|
|
|
|
## Problem
|
|
|
|
Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
|
|
|
|
TeleMcp closes that gap by:
|
|
|
|
1. **Collecting** telemetry with proven CNCF/Grafana stack components.
|
|
2. **Deploying** the stack repeatably via Ansible + Helm.
|
|
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
|
|
|
|
---
|
|
|
|
## Vision
|
|
|
|
A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
|
|
|
|
- *What is unhealthy right now?*
|
|
- *Which pods are crash-looping and why?*
|
|
- *Is disk or memory pressure building?*
|
|
- *What changed in the cluster since yesterday?*
|
|
|
|
The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
|
|
|
|
---
|
|
|
|
## Design Principles
|
|
|
|
| Principle | What it means |
|
|
|-----------|---------------|
|
|
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
|
|
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
|
|
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
|
|
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
|
|
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
|
|
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ LLM Agent (MCP client) │
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
│ MCP (resources / tools / prompts)
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
|
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
|
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
|
└──────┬─────────────────┬────────────────────┬───────────────┘
|
|
│ │ │
|
|
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
|
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
|
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
|
│ Grafana │ │ │ │ │
|
|
│ KSM │ │ │ │ │
|
|
│ node-export │ │ │ │ │
|
|
└─────────────┘ └───────────────┘ └─────────────────┘
|
|
monitoring namespace logging namespace
|
|
```
|
|
|
|
**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
|
|
|
|
**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
|
|
|
|
---
|
|
|
|
## What We Capture
|
|
|
|
### Minimum viable (current target)
|
|
|
|
**Kubernetes**
|
|
- Cluster & node status, conditions, taints
|
|
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
|
|
- Services, Events (especially Warning/Error)
|
|
- Resource usage via Prometheus/cAdvisor/kube-state-metrics
|
|
|
|
**Logs & alerts**
|
|
- Pod and node logs via Loki/Promtail
|
|
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
|
|
|
|
**Bridge surface**
|
|
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
|
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
|
- Prompts: triage and operational playbooks
|
|
|
|
### Stretch (explicitly deferred)
|
|
|
|
- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
|
|
- `systemd.status`, `tail.pod_logs` tools
|
|
- Alertmanager API integration for active-alert summaries
|
|
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
|
- Multi-cluster federation
|
|
- Write/mutate operations (out of scope unless a separate, gated path is designed)
|
|
|
|
---
|
|
|
|
## Repository Layout
|
|
|
|
| Path | Role |
|
|
|------|------|
|
|
| `ansible/` | Bootstrap: install Helm, deploy all charts |
|
|
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
|
|
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
|
|
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
|
|
| `environments/` | Per-environment overrides and notes |
|
|
| `wiki/` | Extended design notes and blueprint |
|
|
|
|
---
|
|
|
|
## Current State (as of initial scaffold)
|
|
|
|
**Done**
|
|
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
|
|
- Helm values for monitoring, logging, optional OTel collector
|
|
- MCP bridge service with core tools and saved-query resources
|
|
- Read-only ClusterRole/Binding for the bridge ServiceAccount
|
|
- NetworkPolicy skeleton for the bridge
|
|
- Health check and `/mcp/schema` discovery endpoint
|
|
|
|
**Not yet done / known gaps**
|
|
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
|
|
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
|
|
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
|
|
- No Alertmanager integration in the bridge
|
|
- No metrics-server chart (useful for `kubectl top` semantics)
|
|
- No host-level DaemonSet sidecar for systemd/OS signals
|
|
- NetworkPolicy egress may need K8s API (443) allowance
|
|
- Wiki and README aligned to INTENT; keep them updated when scope shifts
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
We know TeleMcp is working when:
|
|
|
|
1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
|
|
2. `curl /mcp/schema` returns resources, tools, and prompts.
|
|
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
|
|
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
|
|
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
|
|
|
|
---
|
|
|
|
## Non-Goals
|
|
|
|
- Replacing Grafana or building a custom metrics database
|
|
- Providing arbitrary shell/exec access to the cluster
|
|
- Mutating cluster state (deploy, scale, delete) through the bridge
|
|
- Supporting non-Linux or non-Kubernetes targets in v1
|
|
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
|
|
|
|
---
|
|
|
|
## How to Use This Document
|
|
|
|
- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
|
|
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
|
|
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
|
|
|
|
For operational quick-start, see [README.md](README.md).
|
|
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md). |