Contents
Most teams already own the data they need to operate the platform. Logs sit in Loki, OpenSearch, Elasticsearch, or VictoriaLogs. Metrics are in Prometheus or Mimir. Traces live in Tempo or Jaeger. Tickets are in Jira. Deploys are in GitHub. Pages are in PagerDuty. The cost is not data — it is the cognitive load of stitching it together at 03:14 when an alert fires.
This article describes a generic, vendor-portable reference architecture for an AI agent that runs inside Grafana, calls into a model layer like AWS Bedrock under the hood, and reaches every data source through the Model Context Protocol (MCP). The same pattern has been deployed for telco network operations, fintech transaction monitoring, IoT fleet health, and retail platform reliability. Nothing in the design is exotic — the value is in the assembly.
Why Grafana is the right surface
An AI agent is only useful if operators actually open it during an incident. Grafana already owns that real estate. SREs, NOC engineers, fraud analysts, fleet managers, and on-call developers spend their incident hours inside dashboards. Adding another tab, another portal, or another chat tool adds friction at the worst possible moment. Embedding the agent as a Grafana app plugin means the chat lives next to the panels that triggered the alert — same theme, same SSO, same RBAC, same datasource permissions.
This also collapses the governance story. Grafana already has organizations, teams, folders, and datasource-scoped access. An app plugin inherits all of it for free. There is no second identity model to maintain, no separate audit pipeline, no parallel permissions sprawl. The user the agent acts on behalf of is the same user Grafana already knows about.
The reference architecture
The full system is four layers, each with one job. The diagram below is the entire mental model:
Layer 1 — the Grafana app plugin
The plugin is a React app inside Grafana. It exposes four surfaces: a streaming chat panel for natural-language questions, an AI Apps tab that hosts purpose-built mini-apps (log explorers, cost insights, security posture), an admin console for operators of the platform, and a runbook library backed by DynamoDB.
Because it is a Grafana app plugin (not a standalone web app) it inherits SSO, team membership, datasource UIDs, theming, and accessibility for free. Streaming responses render alongside dashboards, and the agent can deep-link into specific panels at the right time range.
Layer 2 — Bedrock orchestration with hub-and-spoke Lambdas
Behind the plugin sits AWS Bedrock Agents. Bedrock handles model selection (Claude for reasoning, Nova for cheap classification, Llama for cost-sensitive answers), prompt orchestration, conversation memory, knowledge bases, and guardrails. The plugin never talks to a model directly — it talks to a Bedrock agent endpoint, and the agent decides which tools to call.
Tools are exposed to Bedrock as action groups, each backed by a small Lambda. One agent (the hub) fans out to many narrow Lambdas (the spokes). Each spoke owns one capability, holds short-lived credentials scoped to one purpose, does response shaping (truncating logs, summarizing metrics) before paying the token cost, and enforces the boring guardrails: query timeouts, max time ranges, max rows, dangerous-action confirmations.
Layer 3 — the MCP gateway
The single most important architectural choice in the stack is putting an MCP gateway between the Lambda spokes and the actual data sources. The Model Context Protocol is an open standard for exposing tools and resources to AI agents. It is not specific to any model provider, so it gives you four properties that matter at scale:
- Vendor portability. Switching from Bedrock to Vertex, OpenAI, or Azure AI Foundry tomorrow does not change your MCP servers. Only the orchestration layer moves.
- Per-source isolation. Each MCP server is one Deployment, one ServiceAccount, one set of permissions. The Loki MCP cannot accidentally read your Jira tickets.
- Centralized policy. Rate limits, audit logs, and authorization checks live in the gateway, not duplicated across every spoke.
- Reusability. The same MCP servers can also be exposed to a Telegram bot, a Slack assistant, a CLI, or a future agent in another product, with zero rewrite.
Layer 4 — data and action MCPs
Below the gateway is whatever your environment runs. The platform is designed so that adding a new data source is a configuration change, not a code change.
Universal data-source coverage
Any system that has an API can become an MCP server. The list below is a starting menu — everything else (your in-house stores, your custom event bus, your industry-specific systems of record) plugs in the same way.
Metrics
Logs
Traces & profiles
Infrastructure & cloud
Workflows & SaaS
Creative, content, and the long tail
MCP catalog & custom JSON MCPs
The platform behaves like a Claude Desktop for the enterprise: an MCP catalog where admins toggle integrations on and off per organization, with a JSON editor for declaring custom MCPs that are not in the catalog yet.
Three integration patterns are supported:
- Built-in catalog MCPs. Pre-packaged servers for the most common data sources — Prometheus, Loki, OpenSearch, VictoriaLogs, Tempo, Kubernetes, Jira, GitHub, Slack, Canva, PagerDuty. One toggle, healthchecks already wired, dashboards included.
- Community / third-party MCPs. Any standards-compliant MCP server — including the growing ecosystem published on
mcp.soor vendor-published servers — can be registered through the gateway. Drop in the connection details, scope its permissions, save. - Custom JSON-defined MCPs. For internal tools where no MCP exists yet, admins paste a JSON definition directly in the admin console. The gateway provisions a sandbox, validates the schema, and exposes the new tools to the agent on save.
The custom-MCP JSON shape is intentionally familiar to anyone who has configured Claude Desktop or a similar agent host:
Admin panel — ten built-in modules
The plugin ships with a complete admin console so platform owners do not need to build their own. Every module is role-gated and audited; every write goes through the datasource backend so AWS keys never reach the browser.
1. Knowledge Base — Sources
List every Bedrock Knowledge Base source and its backing S3 location. Source health (last sync, document count, READY / SYNCING / FAILED), on-demand re-ingestion, and full ingestion-job history with success/failure logs.
2. Knowledge Base — Upload
Drag-and-drop upload for Markdown, PDF, TXT, HTML, and JSON. Per-file metadata: title, tags, owner, runbook category. S3-direct streaming with progress, automatic Bedrock KB ingestion on save, and bulk delete or re-tag.
3. Quality Logs
Every conversation with user, prompt, final answer, tools called, latency, and token cost. Filter by user, date range, agent, "had-error", or "had-feedback". Drill into the full reasoning trace and export to CSV.
4. Usage Analytics
DAU / WAU / MAU. Prompts per user, team, and cluster. Token + cost by model and by agent. Tool-call frequency (which action group is hot). DevTools usage split out separately. Charts rendered as native Grafana panels.
5. Sessions
Active and historical sessions backed by DynamoDB. Per-user inspection, force-end / delete, and a session-replay view that rebuilds the chat exactly as the user saw it.
6. AI Apps Catalog
Manage which AI Apps are visible to which orgs and teams — log-search assistants, S3 explorers, cost-insight apps, security-posture apps. Per-app config: cluster allowlist, max queries per minute, allowed actions.
7. Agent Config
Read and edit live Bedrock agent settings without touching Terraform — agent ID / alias, default temperature, top-p, max tokens, system-prompt overlay layered above the GitOps-managed prompt, and tool / action-group enable flags. Saves are audited.
8. Runbooks
DynamoDB-backed runbook library. Full CRUD with versions and deprecation. Bind a runbook to a tool or scenario tag — the agent picks it up automatically. Preview against a sample query. Confluence import via MCP (planned).
9. Feedback Inbox
Thumbs-up / thumbs-down plus free-text feedback collected from end users. Aggregated per agent, per tool, per user. Triage status, link-to-Jira, and a CSAT trend chart over time.
10. Settings
Default model, streaming on/off, max session age, retention windows, audit-log toggle, per-user / per-team daily token caps, secret rotation status (AWS Secrets Manager / Azure Key Vault), and plugin diagnostics (build SHA, Bedrock connectivity).
Admin panel one-liner — built-in admin console for knowledge-base management, document upload, conversation quality review, usage analytics, session inspection, AI Apps catalog, live agent configuration, runbook CRUD, user feedback inbox, and platform settings — all inside Grafana, role-gated, and fully audited.
End-user UI — what operators actually see
The plugin is built around streaming. Every panel is incremental, every long operation is cancellable, every answer is annotated with cost and latency.
Main AI chat
- Conversational chat inside a Grafana app page — no separate UI, no separate login.
- Live streaming — tokens, tool traces, and reasoning steps render as the agent thinks.
- Multi-turn sessions with persistent history (localStorage + DynamoDB cross-device sync).
- Session sidebar — list, rename, search, pin, delete past chats; resume any conversation.
- Markdown + rich rendering — tables, code blocks, syntax highlighting, collapsible sections.
- Inline tool traces — every Bedrock action-group call shown as an expandable step (input → output → duration).
- Evidence panels — raw JSON or log lines surfaced alongside the natural-language answer.
- Inline charts — the agent can return time-series data and the UI renders it as a real Grafana panel.
- Stop / regenerate — cancel mid-stream, retry the last turn, branch a conversation.
- Copy / share — copy answer, copy full transcript, share-link a session.
- Feedback widget — thumbs and free-text after every answer, fed straight into the admin Feedback Inbox.
Quick-start UX
- Capabilities cards — one-click examples like "investigate trace ID X", "show error-rate spikes in the last hour", or "summarize today's deploys".
- Smart prompt suggestions — context-aware follow-ups generated from the current answer.
- Slash commands and quick actions — for example
/runbook scale-down,/devtools cluster=prod-eu-1. - Recent investigations widget on the landing page.
AI Apps tab
A multi-app launcher inside the same plugin. Each AI App is a focused mini-experience that shares the agent's brain but exposes a tighter UI.
- Log Explorer AI — cluster picker (allowlisted clusters only), plain-English queries that the agent translates into safe, read-only commands against OpenSearch / Elasticsearch / Loki / VictoriaLogs.
- S3 Historical Explorer — ask about archived data without writing Athena SQL.
- Cost Insights — "where did our model spend go this week" with per-team and per-tool drilldowns.
- Security Posture — quick checks against IAM, network policies, and configuration drift.
- Per-app entitlement so admins control which apps each org sees.
Runbooks experience
- Browse the runbook library (DynamoDB-backed, admin-curated).
- Search by tag, scenario, or service.
- "Run with AI" — execute a runbook conversationally; the agent fills variables from context.
- Pin frequently used runbooks to the sidebar.
Dashboards integration
- The agent can open, link, or generate Grafana dashboards in response to a question.
- "Add this panel to dashboard X" action on agent-generated charts.
- Cross-link from chat → dashboard → back to chat, preserving session state.
Investigations / case view
- Multi-step investigation timeline (alert → trace → log → conclusion).
- Saveable as a "case" with title, owner, and a Jira link.
- Export to Markdown or Confluence.
Personalization & DevEx
- Per-user preferences: default model behavior, density (compact / comfortable), code theme, streaming speed.
- Pinned prompts and prompt templates (private and team-shared).
- Notification badges for long-running investigations completing in the background.
- Keyboard shortcuts: new chat, focus input, switch app, stop generation.
- Mobile / narrow-viewport responsive layout.
- Light + dark theme that follows the active Grafana theme.
- Localization-ready strings end-to-end.
Trust, safety & visibility
- "Why did it answer this?" panel — full reasoning trace and the sources used.
- Source citations on KB-grounded answers (file name, section, link).
- Cost / latency badge on every answer (tokens used, time-to-first-token, total time).
- Friendly error cards with retry, report, and operator-action hints.
- Session privacy indicator showing whether the transcript is stored and who can see it.
End-user app one-liner — a streaming AI chat experience inside Grafana with full reasoning traces, multi-turn sessions, an AI Apps launcher (log search, cost insights, security posture, and more), runbook execution, dashboard integration, evidence panels, session sharing, and feedback — themed, SSO-protected, and rendered natively in Grafana.
Governance & access control
Governance was a first-class design constraint, not a postscript:
- Role-based gating. Every surface respects Grafana's built-in roles — Admin, Editor, Viewer.
- Per-tab visibility. Viewers see read-only views; editors get write actions; admins see the admin console.
- Per-user identity carried through. The end-user identity flows from Grafana to the datasource backend to DynamoDB usage tables, so every prompt and every tool call is attributable.
- No browser-side credentials. Every privileged call goes through the datasource backend — AWS keys, Azure keys, and per-MCP secrets stay server-side.
- Read / write split. Each MCP exists in two flavours where it matters — a read-only server enabled in every environment, and a write server enabled only where the agent is allowed to mutate state.
- Streaming-aware everywhere. No "spinning forever" UIs, no long-poll surprises in the audit log.
- Multi-arch. amd64 and arm64 images, runs on EKS, AKS, GKE, or on-prem Grafana.
- CSI Secrets Store integration. Loaded via
GF_PLUGINS_PREINSTALLfrom AWS Secrets Manager or Azure Key Vault — nothing baked into images.
Cost & operational shape
Three cost levers matter, and they all live in the orchestration layer, not the data layer:
- Model tiering. A small, cheap model handles intent classification and tool selection; the expensive reasoning model is reserved for the actual answer. A simple router cuts model cost by 60-80% on typical traffic.
- Response shaping in the spokes. Never let raw log output flow through the model. Pre-aggregate, sample, or summarize in the Lambda. Tokens are the dominant cost; bytes are not.
- Conversation memory bounds. Default session memory is unbounded; cap it to the last N turns and summarize older context server-side.
On the operations side, the MCP gateway is the only stateful piece worth a runbook. Everything else — plugin, Lambdas, MCP servers — is stateless and replaceable. A typical production deployment runs the gateway as an HA pair behind a Service, with a sidecar exporting Prometheus metrics about query rate, error rate, and per-source latency. The gateway's own dashboard goes into Grafana, of course.
Industry use cases
The same four-layer architecture, with different MCP servers wired in, covers very different operational worlds:
- Telco & network operations. MCPs for KPI metrics, network inventory, change-management, and trouble-ticketing. Operators ask why a region is degraded; the agent joins KPI windows with recent change events automatically.
- Fintech. MCPs for transaction stores, rules engines, and case management. Analysts ask why a merchant's approval rate dropped; the agent correlates rule changes with downstream metrics.
- IoT fleet operations. MCPs for device telemetry, firmware-deploy tooling, and field service. Fleet managers ask which device cohort is flagging a new error code; the agent groups by firmware version and geography.
- Retail platform reliability. MCPs for funnel metrics, feature-flag systems, and CDN logs. Engineers ask why conversions dipped in a region at 14:07; the agent correlates with a flag rollout twelve minutes earlier.
- Security operations. MCPs for SIEM, EDR, identity providers, and ticketing. Analysts ask whether a flagged pattern matches anything in the last 90 days; the agent pivots through evidence in one conversation.
- Healthcare & pharma operations. MCPs for the LIMS, integration layer, and audit trails. Operators ask which lab batch fell out of spec; the agent answers with the audit-evidenced trail.
What to build first
If you are starting from zero, build in this order. It produces a working agent fastest and keeps the architecture intact for everything that comes after.
- One MCP server for your most-used data source (almost always Prometheus or Loki). Run it locally first; deploy to Kubernetes once it works.
- One Lambda spoke that calls that MCP server. Wire it as a Bedrock agent action group. Test from the Bedrock console.
- A minimal Grafana app plugin with just the chat panel, posting to your Bedrock agent. No catalog yet, no per-team config — that comes later.
- Add the second and third MCP servers (Kubernetes, Jira). Now the agent can correlate, and value-per-question jumps.
- Build the admin panel modules in order of risk: Sessions and Quality Logs first (pure observation), then Usage Analytics, then KB management, then live Agent Config last.
- Add the AI Apps tab once the patterns repeat — if you find yourself shipping the same prompt template three times, it deserves an app tile.
Closing thought
The temptation with AI in operations is to build a clever monolith — one big chatbot that knows about everything. This pattern resists that. The reasoning model is one component. The integrations are another. The presentation surface is a third. The administrative plane is a fourth. The boundaries between them are open standards: MCP, the Grafana app plugin SDK, S3, DynamoDB. None of them lock you in. That separation is what makes the architecture survive the next model release, the next vendor pivot, and the next data source the team adopts.
If you are building this and want a second pair of hands — on the Grafana plugin, the Bedrock orchestration, the MCP gateway, or the data-source MCP servers themselves — this is exactly the work we do. See our Grafana plugin services or book a free architecture call.