Seeded intent and wiki pages

This commit is contained in:
2026-06-22 19:09:24 +02:00
parent b181465564
commit 129a229e38
4 changed files with 496 additions and 21 deletions

171
INTENT.md Normal file
View File

@@ -0,0 +1,171 @@
# TeleMcp — Project Intent
> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
---
## Problem
Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
TeleMcp closes that gap by:
1. **Collecting** telemetry with proven CNCF/Grafana stack components.
2. **Deploying** the stack repeatably via Ansible + Helm.
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
---
## Vision
A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
- *What is unhealthy right now?*
- *Which pods are crash-looping and why?*
- *Is disk or memory pressure building?*
- *What changed in the cluster since yesterday?*
The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
---
## Design Principles
| Principle | What it means |
|-----------|---------------|
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
---
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
monitoring namespace logging namespace
```
**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
---
## What We Capture
### Minimum viable (current target)
**Kubernetes**
- Cluster & node status, conditions, taints
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
- Services, Events (especially Warning/Error)
- Resource usage via Prometheus/cAdvisor/kube-state-metrics
**Logs & alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks
### Stretch (explicitly deferred)
- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API integration for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
- Multi-cluster federation
- Write/mutate operations (out of scope unless a separate, gated path is designed)
---
## Repository Layout
| Path | Role |
|------|------|
| `ansible/` | Bootstrap: install Helm, deploy all charts |
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
| `environments/` | Per-environment overrides and notes |
| `wiki/` | Extended design notes and blueprint |
---
## Current State (as of initial scaffold)
**Done**
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
- Helm values for monitoring, logging, optional OTel collector
- MCP bridge service with core tools and saved-query resources
- Read-only ClusterRole/Binding for the bridge ServiceAccount
- NetworkPolicy skeleton for the bridge
- Health check and `/mcp/schema` discovery endpoint
**Not yet done / known gaps**
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
- No Alertmanager integration in the bridge
- No metrics-server chart (useful for `kubectl top` semantics)
- No host-level DaemonSet sidecar for systemd/OS signals
- NetworkPolicy egress may need K8s API (443) allowance
- Wiki and README aligned to INTENT; keep them updated when scope shifts
---
## Success Criteria
We know TeleMcp is working when:
1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
2. `curl /mcp/schema` returns resources, tools, and prompts.
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
---
## Non-Goals
- Replacing Grafana or building a custom metrics database
- Providing arbitrary shell/exec access to the cluster
- Mutating cluster state (deploy, scale, delete) through the bridge
- Supporting non-Linux or non-Kubernetes targets in v1
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
---
## How to Use This Document
- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
For operational quick-start, see [README.md](README.md).
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).

View File

@@ -1,55 +1,103 @@
# TeleMcp
Telemetry + MCP bridge that auto-deploys on a Linux-based Kubernetes host via **Ansible + Helm**.
It exposes read-only metrics, logs, and k8s object state through an **MCP server** so an LLM agent can bootstrap, monitor, and operate the host.
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.
## Components
- **kube-prometheus-stack** (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics)
- **Loki + Promtail** (logs)
- **OpenTelemetry Collector** (optional fan-out)
- **mcp-telemetry-bridge** (FastAPI service exposing MCP resources/tools/prompts)
| Component | Namespace | Role |
|-----------|-----------|------|
| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
| **Loki + Promtail** | `logging` | Log aggregation and shipping |
| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |
## Quick Start
### 0) Prereqs
- Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
- Ansible 2.15+ on your control machine
- Helm 3 on the host (Ansible role installs if missing)
### 1) Run Ansible
```bash
cd ansible
ansible-playbook -i inventories/local.ini playbook.yml
```
### 2) Smoke tests (from any machine with kubectl context)
### 2) Smoke tests
From any machine with a `kubectl` context:
```bash
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get pods -n mcp
kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
curl http://localhost:8080/mcp/schema | jq .
curl http://localhost:8080/healthz
```
### 3) Point your LLM Agent
Configure your agent's MCP client to the service endpoint (ClusterIP/Ingress).
Use tools:
- `promql.query`
- `loki.query`
- `k8s.get`
- `k8s.events`
- `inventory.snapshot`
### 3) Point your LLM agent
Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
**Implemented tools:**
| Tool | Description |
|------|-------------|
| `promql.query` | Run a PromQL expression against Prometheus |
| `loki.query` | Run a LogQL query against Loki |
| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
| `k8s.events` | List cluster or namespace events |
| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
**Saved resources** (via `/mcp/resource?uri=...`):
- `res://dashboards/top-pods-by-cpu.promql`
- `res://dashboards/pod-restarts.promql`
- `res://dashboards/warn-events.logql`
> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).
## Repo layout
```
tele-mcp/
ansible/
INTENT.md # Project north star — goals, scope, current state
ansible/ # Bootstrap playbook and roles
helm/
mcp-telemetry-bridge/
environments/
values/ # Chart values for monitoring, logging, OTel
mcp-telemetry-bridge/ # Bridge Helm chart
mcp-telemetry-bridge/ # FastAPI bridge application
environments/ # Per-environment overrides
wiki/ # Extended project and design docs
```
## Documentation
| Document | Purpose |
|----------|---------|
| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
## Security
- MCP bridge ServiceAccount is read-only (RBAC get/list/watch)
- Optional NetworkPolicy limits egress/ingress
- Consider mTLS/OIDC if exposing outside the cluster
- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
- NetworkPolicy limits bridge egress to Prometheus and Loki
- Consider mTLS or OIDC if exposing the bridge outside the cluster
## Current limitations
See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
- No Alertmanager integration in the bridge yet
- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar

183
wiki/TeleMcpBlueprint.md Normal file
View File

@@ -0,0 +1,183 @@
# TeleMcp Blueprint
*Building a Kubernetes telemetry MCP bridge*
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
## Overview
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
---
## What we capture
### Minimum viable (current target)
**Kubernetes (control + workloads)**
- Cluster and node status, taints, conditions, kubelet health
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
- Services, Events (warning/error)
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
**Logs and alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
### Stretch (deferred)
**Host (Linux / node)**
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
- Distro/kernel/version, packages/updates
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
- Certificates (expiry), time sync status (chrony/ntp)
- Firewall/ports (nftables/ufw summary)
**Additional Kubernetes signals**
- Ingress, Jobs/CronJobs, HPA/VPA
- Throttling and OOM kill detail beyond default metrics
**Additional bridge capabilities**
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
---
## Reference architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
```
### On the cluster
| Component | Status | Role |
|-----------|--------|------|
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
---
## Why these charts?
| Chart | Rationale |
|-------|-----------|
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
---
## MCP Telemetry Bridge
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
### Implementation status
| Capability | Status |
|------------|--------|
| FastAPI service with health check | Done |
| `/mcp/schema` discovery endpoint | Done |
| `promql.query` | Done |
| `loki.query` | Done |
| `k8s.get` | Done |
| `k8s.events` | Done |
| `inventory.snapshot` | Done |
| Saved PromQL/LogQL resources | Done (3 queries) |
| `Triage-Now` prompt | Stub |
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
| `systemd.status` | Planned (requires DaemonSet sidecar) |
| `tail.pod_logs` | Planned |
| Alertmanager API | Planned |
| Full MCP protocol transport | Planned |
### Read-only backends
The bridge talks read-only to:
- **Prometheus HTTP API** — instant and range queries
- **Loki HTTP API** — LogQL queries
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
- **Alertmanager API** — planned for active-alert summaries
- **Node sidecar HTTP** — planned for host-level facts
### Tools (target API)
```
promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit) # planned
```
### Resources
```
res://dashboards/top-pods-by-cpu.promql # implemented
res://dashboards/pod-restarts.promql # implemented
res://dashboards/warn-events.logql # implemented
res://snapshots/cluster-inventory.json # planned (dynamic)
```
### Prompts
```
Triage-Now # stub — summarize alerts, top offenders, recent warnings
Capacity-Check # planned
CrashLoop-Playbook # planned
```
---
## Security model
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
---
## Related docs
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
- [README.md](../README.md) — quick start and smoke tests
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience

73
wiki/TeleMcpProject.md Normal file
View File

@@ -0,0 +1,73 @@
# TeleMcp Project
*Telemetry for autonomous control*
## What is TeleMcp?
TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.
The project name reflects its two halves:
- **Tele** — telemetry: metrics, logs, events, and cluster inventory
- **MCP** — the standardized bridge between observability backends and LLM agents
## Who is it for?
- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations
## What problem does it solve?
Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.
TeleMcp solves this in three steps:
1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
2. **Deploy** — bootstrap everything with a single Ansible playbook
3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`
## What can an agent do today?
With the current scaffold, an agent connected to the bridge can:
- Query Prometheus with `promql.query`
- Search logs with `loki.query`
- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
- Pull a cluster inventory snapshot with `inventory.snapshot`
- Use pre-built PromQL/LogQL resources for common triage queries
## What is planned?
Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.
## Design principles
| Principle | Summary |
|-----------|---------|
| Read-only by default | No cluster mutations through the bridge |
| Standard stack | CNCF/Grafana components, not custom collectors |
| MCP as the interface | One bridge, one contract for agents |
| Deployable in one shot | Ansible + Helm, no manual assembly |
| Least privilege | Scoped RBAC and NetworkPolicy |
## Repository map
| Path | Contents |
|------|----------|
| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
| [README.md](../README.md) | Quick start and operational guide |
| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
| `ansible/` | Bootstrap playbook |
| `helm/` | Chart values and bridge chart |
| `mcp-telemetry-bridge/` | FastAPI bridge source |
## Success criteria
TeleMcp is working when:
1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
2. `/mcp/schema` returns resources, tools, and prompts
3. An agent can query metrics, logs, and cluster state without direct API credentials
4. Default alert rules fire on induced failures and the agent can triage them
5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host