Seeded intent and wiki pages
This commit is contained in:
171
INTENT.md
Normal file
171
INTENT.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# TeleMcp — Project Intent
|
||||
|
||||
> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
|
||||
|
||||
TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
|
||||
|
||||
This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
|
||||
|
||||
TeleMcp closes that gap by:
|
||||
|
||||
1. **Collecting** telemetry with proven CNCF/Grafana stack components.
|
||||
2. **Deploying** the stack repeatably via Ansible + Helm.
|
||||
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
|
||||
|
||||
---
|
||||
|
||||
## Vision
|
||||
|
||||
A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
|
||||
|
||||
- *What is unhealthy right now?*
|
||||
- *Which pods are crash-looping and why?*
|
||||
- *Is disk or memory pressure building?*
|
||||
- *What changed in the cluster since yesterday?*
|
||||
|
||||
The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
| Principle | What it means |
|
||||
|-----------|---------------|
|
||||
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
|
||||
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
|
||||
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
|
||||
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
|
||||
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
|
||||
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ LLM Agent (MCP client) │
|
||||
└──────────────────────────┬──────────────────────────────────┘
|
||||
│ MCP (resources / tools / prompts)
|
||||
┌──────────────────────────▼──────────────────────────────────┐
|
||||
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
||||
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
||||
└──────┬─────────────────┬────────────────────┬───────────────┘
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
||||
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
||||
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
||||
│ Grafana │ │ │ │ │
|
||||
│ KSM │ │ │ │ │
|
||||
│ node-export │ │ │ │ │
|
||||
└─────────────┘ └───────────────┘ └─────────────────┘
|
||||
monitoring namespace logging namespace
|
||||
```
|
||||
|
||||
**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
|
||||
|
||||
**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
|
||||
|
||||
---
|
||||
|
||||
## What We Capture
|
||||
|
||||
### Minimum viable (current target)
|
||||
|
||||
**Kubernetes**
|
||||
- Cluster & node status, conditions, taints
|
||||
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
|
||||
- Services, Events (especially Warning/Error)
|
||||
- Resource usage via Prometheus/cAdvisor/kube-state-metrics
|
||||
|
||||
**Logs & alerts**
|
||||
- Pod and node logs via Loki/Promtail
|
||||
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
|
||||
|
||||
**Bridge surface**
|
||||
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
||||
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
||||
- Prompts: triage and operational playbooks
|
||||
|
||||
### Stretch (explicitly deferred)
|
||||
|
||||
- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
|
||||
- `systemd.status`, `tail.pod_logs` tools
|
||||
- Alertmanager API integration for active-alert summaries
|
||||
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
||||
- Multi-cluster federation
|
||||
- Write/mutate operations (out of scope unless a separate, gated path is designed)
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
| Path | Role |
|
||||
|------|------|
|
||||
| `ansible/` | Bootstrap: install Helm, deploy all charts |
|
||||
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
|
||||
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
|
||||
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
|
||||
| `environments/` | Per-environment overrides and notes |
|
||||
| `wiki/` | Extended design notes and blueprint |
|
||||
|
||||
---
|
||||
|
||||
## Current State (as of initial scaffold)
|
||||
|
||||
**Done**
|
||||
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
|
||||
- Helm values for monitoring, logging, optional OTel collector
|
||||
- MCP bridge service with core tools and saved-query resources
|
||||
- Read-only ClusterRole/Binding for the bridge ServiceAccount
|
||||
- NetworkPolicy skeleton for the bridge
|
||||
- Health check and `/mcp/schema` discovery endpoint
|
||||
|
||||
**Not yet done / known gaps**
|
||||
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
|
||||
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
|
||||
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
|
||||
- No Alertmanager integration in the bridge
|
||||
- No metrics-server chart (useful for `kubectl top` semantics)
|
||||
- No host-level DaemonSet sidecar for systemd/OS signals
|
||||
- NetworkPolicy egress may need K8s API (443) allowance
|
||||
- Wiki and README aligned to INTENT; keep them updated when scope shifts
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
We know TeleMcp is working when:
|
||||
|
||||
1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
|
||||
2. `curl /mcp/schema` returns resources, tools, and prompts.
|
||||
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
|
||||
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
|
||||
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
|
||||
|
||||
---
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Replacing Grafana or building a custom metrics database
|
||||
- Providing arbitrary shell/exec access to the cluster
|
||||
- Mutating cluster state (deploy, scale, delete) through the bridge
|
||||
- Supporting non-Linux or non-Kubernetes targets in v1
|
||||
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
|
||||
|
||||
---
|
||||
|
||||
## How to Use This Document
|
||||
|
||||
- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
|
||||
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
|
||||
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
|
||||
|
||||
For operational quick-start, see [README.md](README.md).
|
||||
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).
|
||||
90
README.md
90
README.md
@@ -1,55 +1,103 @@
|
||||
# TeleMcp
|
||||
|
||||
Telemetry + MCP bridge that auto-deploys on a Linux-based Kubernetes host via **Ansible + Helm**.
|
||||
It exposes read-only metrics, logs, and k8s object state through an **MCP server** so an LLM agent can bootstrap, monitor, and operate the host.
|
||||
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
|
||||
|
||||
TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
|
||||
|
||||
> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.
|
||||
|
||||
## Components
|
||||
- **kube-prometheus-stack** (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics)
|
||||
- **Loki + Promtail** (logs)
|
||||
- **OpenTelemetry Collector** (optional fan-out)
|
||||
- **mcp-telemetry-bridge** (FastAPI service exposing MCP resources/tools/prompts)
|
||||
|
||||
| Component | Namespace | Role |
|
||||
|-----------|-----------|------|
|
||||
| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
|
||||
| **Loki + Promtail** | `logging` | Log aggregation and shipping |
|
||||
| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
|
||||
| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 0) Prereqs
|
||||
|
||||
- Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
|
||||
- Ansible 2.15+ on your control machine
|
||||
- Helm 3 on the host (Ansible role installs if missing)
|
||||
|
||||
### 1) Run Ansible
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook -i inventories/local.ini playbook.yml
|
||||
```
|
||||
|
||||
### 2) Smoke tests (from any machine with kubectl context)
|
||||
### 2) Smoke tests
|
||||
|
||||
From any machine with a `kubectl` context:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n monitoring
|
||||
kubectl get pods -n logging
|
||||
kubectl get pods -n mcp
|
||||
kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
|
||||
curl http://localhost:8080/mcp/schema | jq .
|
||||
curl http://localhost:8080/healthz
|
||||
```
|
||||
|
||||
### 3) Point your LLM Agent
|
||||
Configure your agent's MCP client to the service endpoint (ClusterIP/Ingress).
|
||||
Use tools:
|
||||
- `promql.query`
|
||||
- `loki.query`
|
||||
- `k8s.get`
|
||||
- `k8s.events`
|
||||
- `inventory.snapshot`
|
||||
### 3) Point your LLM agent
|
||||
|
||||
Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
|
||||
|
||||
**Implemented tools:**
|
||||
|
||||
| Tool | Description |
|
||||
|------|-------------|
|
||||
| `promql.query` | Run a PromQL expression against Prometheus |
|
||||
| `loki.query` | Run a LogQL query against Loki |
|
||||
| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
|
||||
| `k8s.events` | List cluster or namespace events |
|
||||
| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
|
||||
|
||||
**Saved resources** (via `/mcp/resource?uri=...`):
|
||||
|
||||
- `res://dashboards/top-pods-by-cpu.promql`
|
||||
- `res://dashboards/pod-restarts.promql`
|
||||
- `res://dashboards/warn-events.logql`
|
||||
|
||||
> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).
|
||||
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
tele-mcp/
|
||||
ansible/
|
||||
INTENT.md # Project north star — goals, scope, current state
|
||||
ansible/ # Bootstrap playbook and roles
|
||||
helm/
|
||||
mcp-telemetry-bridge/
|
||||
environments/
|
||||
values/ # Chart values for monitoring, logging, OTel
|
||||
mcp-telemetry-bridge/ # Bridge Helm chart
|
||||
mcp-telemetry-bridge/ # FastAPI bridge application
|
||||
environments/ # Per-environment overrides
|
||||
wiki/ # Extended project and design docs
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
|
||||
| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
|
||||
| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
|
||||
| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
|
||||
|
||||
## Security
|
||||
- MCP bridge ServiceAccount is read-only (RBAC get/list/watch)
|
||||
- Optional NetworkPolicy limits egress/ingress
|
||||
- Consider mTLS/OIDC if exposing outside the cluster
|
||||
|
||||
- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
|
||||
- NetworkPolicy limits bridge egress to Prometheus and Loki
|
||||
- Consider mTLS or OIDC if exposing the bridge outside the cluster
|
||||
|
||||
## Current limitations
|
||||
|
||||
See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
|
||||
|
||||
- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
|
||||
- No Alertmanager integration in the bridge yet
|
||||
- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar
|
||||
183
wiki/TeleMcpBlueprint.md
Normal file
183
wiki/TeleMcpBlueprint.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# TeleMcp Blueprint
|
||||
|
||||
*Building a Kubernetes telemetry MCP bridge*
|
||||
|
||||
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
|
||||
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
|
||||
|
||||
## Overview
|
||||
|
||||
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
|
||||
|
||||
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
|
||||
|
||||
---
|
||||
|
||||
## What we capture
|
||||
|
||||
### Minimum viable (current target)
|
||||
|
||||
**Kubernetes (control + workloads)**
|
||||
|
||||
- Cluster and node status, taints, conditions, kubelet health
|
||||
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
|
||||
- Services, Events (warning/error)
|
||||
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
|
||||
|
||||
**Logs and alerts**
|
||||
|
||||
- Pod and node logs via Loki/Promtail
|
||||
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
|
||||
|
||||
**Bridge surface**
|
||||
|
||||
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
||||
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
||||
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
|
||||
|
||||
### Stretch (deferred)
|
||||
|
||||
**Host (Linux / node)**
|
||||
|
||||
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
|
||||
- Distro/kernel/version, packages/updates
|
||||
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
|
||||
- Certificates (expiry), time sync status (chrony/ntp)
|
||||
- Firewall/ports (nftables/ufw summary)
|
||||
|
||||
**Additional Kubernetes signals**
|
||||
|
||||
- Ingress, Jobs/CronJobs, HPA/VPA
|
||||
- Throttling and OOM kill detail beyond default metrics
|
||||
|
||||
**Additional bridge capabilities**
|
||||
|
||||
- `systemd.status`, `tail.pod_logs` tools
|
||||
- Alertmanager API for active-alert summaries
|
||||
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
||||
|
||||
---
|
||||
|
||||
## Reference architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ LLM Agent (MCP client) │
|
||||
└──────────────────────────┬──────────────────────────────────┘
|
||||
│ MCP (resources / tools / prompts)
|
||||
┌──────────────────────────▼──────────────────────────────────┐
|
||||
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
||||
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
||||
└──────┬─────────────────┬────────────────────┬───────────────┘
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
||||
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
||||
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
||||
│ Grafana │ │ │ │ │
|
||||
│ KSM │ │ │ │ │
|
||||
│ node-export │ │ │ │ │
|
||||
└─────────────┘ └───────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### On the cluster
|
||||
|
||||
| Component | Status | Role |
|
||||
|-----------|--------|------|
|
||||
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
|
||||
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
|
||||
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
|
||||
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
|
||||
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
|
||||
|
||||
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
|
||||
|
||||
---
|
||||
|
||||
## Why these charts?
|
||||
|
||||
| Chart | Rationale |
|
||||
|-------|-----------|
|
||||
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
|
||||
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
|
||||
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
|
||||
|
||||
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
|
||||
|
||||
---
|
||||
|
||||
## MCP Telemetry Bridge
|
||||
|
||||
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
|
||||
|
||||
### Implementation status
|
||||
|
||||
| Capability | Status |
|
||||
|------------|--------|
|
||||
| FastAPI service with health check | Done |
|
||||
| `/mcp/schema` discovery endpoint | Done |
|
||||
| `promql.query` | Done |
|
||||
| `loki.query` | Done |
|
||||
| `k8s.get` | Done |
|
||||
| `k8s.events` | Done |
|
||||
| `inventory.snapshot` | Done |
|
||||
| Saved PromQL/LogQL resources | Done (3 queries) |
|
||||
| `Triage-Now` prompt | Stub |
|
||||
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
|
||||
| `systemd.status` | Planned (requires DaemonSet sidecar) |
|
||||
| `tail.pod_logs` | Planned |
|
||||
| Alertmanager API | Planned |
|
||||
| Full MCP protocol transport | Planned |
|
||||
|
||||
### Read-only backends
|
||||
|
||||
The bridge talks read-only to:
|
||||
|
||||
- **Prometheus HTTP API** — instant and range queries
|
||||
- **Loki HTTP API** — LogQL queries
|
||||
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
|
||||
- **Alertmanager API** — planned for active-alert summaries
|
||||
- **Node sidecar HTTP** — planned for host-level facts
|
||||
|
||||
### Tools (target API)
|
||||
|
||||
```
|
||||
promql.query(expr, range?)
|
||||
loki.query(logql, limit?, since?)
|
||||
k8s.get(kind, namespace?, name?)
|
||||
k8s.events(namespace?, since?)
|
||||
inventory.snapshot() → JSON
|
||||
systemd.status(unit) # planned
|
||||
```
|
||||
|
||||
### Resources
|
||||
|
||||
```
|
||||
res://dashboards/top-pods-by-cpu.promql # implemented
|
||||
res://dashboards/pod-restarts.promql # implemented
|
||||
res://dashboards/warn-events.logql # implemented
|
||||
res://snapshots/cluster-inventory.json # planned (dynamic)
|
||||
```
|
||||
|
||||
### Prompts
|
||||
|
||||
```
|
||||
Triage-Now # stub — summarize alerts, top offenders, recent warnings
|
||||
Capacity-Check # planned
|
||||
CrashLoop-Playbook # planned
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security model
|
||||
|
||||
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
|
||||
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
|
||||
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
|
||||
|
||||
---
|
||||
|
||||
## Related docs
|
||||
|
||||
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
|
||||
- [README.md](../README.md) — quick start and smoke tests
|
||||
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience
|
||||
73
wiki/TeleMcpProject.md
Normal file
73
wiki/TeleMcpProject.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# TeleMcp Project
|
||||
|
||||
*Telemetry for autonomous control*
|
||||
|
||||
## What is TeleMcp?
|
||||
|
||||
TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.
|
||||
|
||||
The project name reflects its two halves:
|
||||
|
||||
- **Tele** — telemetry: metrics, logs, events, and cluster inventory
|
||||
- **MCP** — the standardized bridge between observability backends and LLM agents
|
||||
|
||||
## Who is it for?
|
||||
|
||||
- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
|
||||
- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
|
||||
- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations
|
||||
|
||||
## What problem does it solve?
|
||||
|
||||
Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.
|
||||
|
||||
TeleMcp solves this in three steps:
|
||||
|
||||
1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
|
||||
2. **Deploy** — bootstrap everything with a single Ansible playbook
|
||||
3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`
|
||||
|
||||
## What can an agent do today?
|
||||
|
||||
With the current scaffold, an agent connected to the bridge can:
|
||||
|
||||
- Query Prometheus with `promql.query`
|
||||
- Search logs with `loki.query`
|
||||
- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
|
||||
- Pull a cluster inventory snapshot with `inventory.snapshot`
|
||||
- Use pre-built PromQL/LogQL resources for common triage queries
|
||||
|
||||
## What is planned?
|
||||
|
||||
Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.
|
||||
|
||||
## Design principles
|
||||
|
||||
| Principle | Summary |
|
||||
|-----------|---------|
|
||||
| Read-only by default | No cluster mutations through the bridge |
|
||||
| Standard stack | CNCF/Grafana components, not custom collectors |
|
||||
| MCP as the interface | One bridge, one contract for agents |
|
||||
| Deployable in one shot | Ansible + Helm, no manual assembly |
|
||||
| Least privilege | Scoped RBAC and NetworkPolicy |
|
||||
|
||||
## Repository map
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
|
||||
| [README.md](../README.md) | Quick start and operational guide |
|
||||
| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
|
||||
| `ansible/` | Bootstrap playbook |
|
||||
| `helm/` | Chart values and bridge chart |
|
||||
| `mcp-telemetry-bridge/` | FastAPI bridge source |
|
||||
|
||||
## Success criteria
|
||||
|
||||
TeleMcp is working when:
|
||||
|
||||
1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
|
||||
2. `/mcp/schema` returns resources, tools, and prompts
|
||||
3. An agent can query metrics, logs, and cluster state without direct API credentials
|
||||
4. Default alert rules fire on induced failures and the agent can triage them
|
||||
5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host
|
||||
Reference in New Issue
Block a user