← Back to works

Project

Incident Triage Platform (Agent + MCP + Airflow)

Evidence-driven incident triage platform where an agent calls a guarded MCP control plane, Airflow collects/normalizes evidence, and safe actions create tickets or send notifications with audit and idempotency.

2026-02-24MCPIncident ResponseAirflowLangGraphKubernetesDockerJiraSlackObservabilityRBACGitHub

Problem statement

Modern incident response often breaks down across disconnected systems: alerts fire in one tool, evidence lives in another, and ticketing/notifications happen manually. I wanted a platform where an AI agent could help triage incidents without being given direct access to infrastructure or production systems.

Architecture overview

The platform is built around a guarded MCP control plane:

  • Agent (LangGraph) decides what to do and calls MCP tools only
  • MCP server enforces RBAC, safe-action gating, audit logging, and idempotency
  • Airflow orchestrates evidence collection and produces a normalized EvidenceBundle v1
  • Ticketing / notifications (Jira, ServiceNow, Slack, Teams) are executed as controlled actions

Incident Triage Platform Architecture

What I built

  • MCP incident triage server with tool-based workflows for:
    • evidence retrieval / waiting for bundles
    • deterministic triage summaries
    • Jira draft + create ticket flows
    • Slack / Teams notifications
  • Safe actions layer with:
    • RBAC roles
    • confirm-token gating for live actions
    • audit logging
    • idempotent ticket creation (retry-safe)
  • Airflow integration for triggering incident evidence DAGs and reading artifacts
  • Standalone mode (no Airflow required) for local demos / testing
  • Networked agent mode so the LangGraph agent calls MCP over streamable-http
  • Kubernetes + Helm deployment path for MCP and one-shot agent Jobs

Technical decisions & tradeoffs

  • Thin MCP, heavy evidence pipeline in Airflow

    • Kept provider-specific evidence collection out of the MCP server
    • Tradeoff: more orchestration complexity, but cleaner control-plane boundaries
  • Bundle-first triage (EvidenceBundle v1)

    • Standardized evidence format for deterministic summaries and action payloads
    • Tradeoff: requires a normalization layer, but simplifies downstream tooling
  • Agent as Kubernetes Job (one incident = one run)

    • Strong isolation and clear retry semantics
    • Tradeoff: requires a dispatcher/webhook service for fully automated event ingestion
  • Workflow backend split from evidence backend

    • WORKFLOW_BACKEND=airflow|none
    • EVIDENCE_BACKEND=fs|s3|none
    • Tradeoff: slightly more configuration, but much cleaner local vs prod deployments

Reliability, governance, and security

  • Audit logging for tool requests and outcomes
  • Idempotency mapping for ticket creation retries
  • RBAC + confirm-token approvals for live actions
  • Bundle-only mode to disable direct provider fetch tools and rely on normalized evidence
  • Airflow 2/3 compatibility (including Airflow 3 API v2 + token auth)

Deployment modes

  • Local / standalone: fs evidence backend, mock Jira, stdio MCP
  • Local / k8s prod-sim: MCP + agent Job + Airflow on Kubernetes with shared PVC artifacts
  • Prod target: Airflow orchestration + object storage evidence backend + guarded MCP actions

What this project demonstrates

This project demonstrates how to combine agentic workflows with enterprise controls: models can assist with incident triage, but all actions are mediated by a policy-aware control plane with explicit guardrails and auditability.