Building an AI Operations Crew: How to Think About Agent Composition for Your Team

The Case Against Monolithic AI Platforms

There's a familiar pattern in enterprise software procurement: a vendor sells you a comprehensive platform with dozens of features, you spend months implementing it, and two years later you're using 15% of the functionality while paying for 100%. The shelf-ware problem is real, and it's particularly acute with AI-powered tools because the gap between "demo capability" and "production-ready capability" is enormous.

Agent-based SRE takes a fundamentally different approach. Instead of deploying a monolithic platform and hoping your team grows into it, you assemble a crew of specialized agents based on what your environment actually needs right now. You start small. You prove value. You expand deliberately.

This composable model isn't just a packaging strategy — it reflects how trust actually develops between operations teams and AI systems. Trust isn't built by flipping a switch. It's built by watching an AI agent consistently produce accurate results in a low-risk context, then gradually expanding its scope as confidence grows.

Start With the Core Three

If you're deploying an AI operations crew for the first time, start with three agent roles. These form the foundation of automated incident investigation and provide immediate, measurable value without requiring any write access to your infrastructure.

Incident Commander

This is the orchestration layer — the agent that ties everything together. The Incident Commander receives alerts from your monitoring platforms (Azure Monitor, CloudWatch, Datadog, PagerDuty, etc.), recognizes when multiple alerts are related to the same underlying issue, and coordinates the investigation workflow.

In practical terms, the Incident Commander does what a human incident commander does in a war room: it opens a communication channel (a Microsoft Teams channel or Slack thread), assigns investigation tasks to specialist agents, posts regular status updates, and produces the final root cause analysis report. It also creates and updates your ServiceNow incident records automatically, so your ITSM workflow stays intact.

The Incident Commander is always required — it's the base agent that every crew needs. Without it, you have individual investigators with no coordination.

Log Analyst

This is where the deep technical investigation happens. The Log Analyst connects to every monitoring and observability platform in your environment and queries them simultaneously during an incident. Azure Log Analytics KQL queries, AWS CloudWatch Insights queries, Datadog metric and trace API calls, Splunk Cloud searches — all running in parallel.

The Log Analyst's key differentiator is cross-platform signal correlation. When it finds elevated error rates in Azure App Insights and corresponding latency spikes in AWS CloudWatch, it doesn't present these as two separate findings — it correlates them with shared dependencies, timing patterns, and causal chains. The output is a structured evidence package with timelines, metric screenshots, and direct links to the relevant dashboards.

For teams running multi-cloud environments (which, at the enterprise level, means most teams), the Log Analyst provides the single most impactful capability: the ability to see across all your observability data in one investigation.

Knowledge Keeper

The Knowledge Keeper is the institutional memory agent. It indexes every past investigation, every RCA document, every runbook in your wiki, and every knowledge article in ServiceNow. When a new incident occurs, the Knowledge Keeper searches for matching patterns — similar error signatures, similar affected services, similar timing patterns — and surfaces relevant historical context.

The immediate value is obvious: "We've seen this before" is one of the most powerful statements in incident response. When the Knowledge Keeper surfaces a past incident with an identical error signature and the resolution that worked, the current investigation jumps from hypothesis-generation to verification. The engineer's question shifts from "what could be wrong?" to "is this the same thing that happened in February?"

The compounding value is what makes the Knowledge Keeper uniquely important over time. Every investigation adds to the knowledge base. Every resolution adds to the pattern library. After six months of operation, the Knowledge Keeper has institutional memory that would take a new hire years to accumulate — and unlike a human's memory, it doesn't forget, it doesn't leave the company, and it doesn't go on vacation.

The Trust Progression

Once the core three agents are operating and your team has seen consistent, accurate investigation results, the natural question is: what's next? The answer depends on your specific pain points, but the progression follows a predictable trust curve.

Stage 2: Code-Level Correlation

Add the Code Inspector when your team's most common incident response question is "what changed?" The Code Inspector connects to your source control and CI/CD platforms — Azure DevOps Repos and Pipelines, GitHub, Bitbucket — and automatically correlates telemetry findings with recent code changes.

During an investigation, the Code Inspector checks: What was deployed in the last 4 hours? What pull requests were merged? What configuration files changed? When the Log Analyst identifies that the error pattern started at 02:43 UTC, the Code Inspector cross-references that timestamp against your deployment pipeline and surfaces: "Azure DevOps release REL-847 deployed at 02:41 UTC, containing 3 merged PRs — one of which modified Redis cache TTL configuration."

This correlation is something human engineers do manually during every incident — scrolling through deployment history, checking git logs, comparing timestamps. The Code Inspector does it in seconds and presents the findings as structured evidence within the broader investigation.

Stage 3: Support Intelligence

Add the Support Liaison when user-reported issues are frequently the first signal of a production problem. In many enterprise environments, customer support tickets arrive before monitoring alerts fire — users notice degraded performance or errors before thresholds are breached.

The Support Liaison monitors your support channels (ServiceNow Customer Service Management, Zendesk, Jira Service Management) and detects patterns: multiple tickets reporting similar issues within a short timeframe. When it identifies a pattern that suggests a production problem, it escalates to the Incident Commander — potentially before any monitoring alert has fired.

After resolution, the Support Liaison updates affected support tickets with the resolution details and links to the RCA. This closes the loop between infrastructure incidents and customer impact in a way that most organizations handle manually (or don't handle at all).

Stage 4: FinOps and Cost Awareness

Add the Cloud Advisor when you want to understand the financial dimension of your operations. The Cloud Advisor uses the same cloud API access that the Log Analyst already has — AWS Cost Explorer, Azure Cost Management, GCP Cloud Billing — to analyze resource utilization and identify optimization opportunities.

The Cloud Advisor is uniquely powerful in the context of an AI operations crew because it connects cost data to operational events. It can answer questions that no standalone FinOps tool can: "The auto-scaling event during last night's incident spun up 14 extra D4s_v3 instances that are still running at $6.72/hour — should we scale them back down?" Or: "This service's average CPU utilization is 8% outside of incident spikes — it's over-provisioned by approximately $2,400/month."

This is the agent role that gets CFOs and CIOs interested. Operational efficiency and cost optimization in a single platform.

Stage 5: SLA Management

Add the SLO Tracker when your organization needs to report reliability metrics to leadership or calculate SLA credit exposure for customers. The SLO Tracker monitors error budget burn rates in real-time, calculates how close each service is to breaching its SLA, and generates reliability scorecards.

For MSPs, the SLO Tracker is particularly valuable — it calculates per-customer SLA metrics and credit exposure, providing the data needed for customer-facing reliability reports.

Stage 6: Active Remediation

Add the Remediation Engineer when — and only when — your team trusts the investigation results enough to let the AI execute fixes. The Remediation Engineer can perform approved remediation actions: rollbacks, service restarts, scaling changes, configuration updates. But it always requires explicit human approval. Every action is logged, auditable, and linked to the ServiceNow incident record.

This is deliberately positioned as the last stage of the trust progression. By the time you enable the Remediation Engineer, your team has months of experience reviewing the crew's investigation accuracy. The trust has been earned through consistent, correct read-only investigations.

Agent Teams: Isolation and Scale

As your operations mature, you'll naturally want to separate concerns. A platform engineering team investigating Kubernetes issues has different integrations, different runbooks, and different historical patterns than a data engineering team investigating pipeline failures.

Agent teams provide this isolation. Each team has its own connected integrations and credentials, its own agent configuration, its own investigation history, and its own Knowledge Keeper memory. For enterprises, this enables departmental separation and internal chargeback. For MSPs, this enables per-customer isolation — each managed customer gets their own agent team with dedicated configuration and billing.

Volume discounts apply across all teams on your account. This means the economic incentive aligns with the operational benefit: the more teams and agents you deploy across your organization, the lower the per-agent cost.

The Framework: Start Small, Prove Value, Expand Deliberately

Here's the practical framework for building your AI operations crew:

Month 1-2: Deploy the core three (Incident Commander + Log Analyst + Knowledge Keeper). Connect your primary monitoring platforms and ITSM. Let the crew investigate real incidents alongside your existing process. Compare the AI investigation results with your engineers' manual findings. Measure investigation time savings.

Month 3-4: If the investigation accuracy is consistently high (it should be — the crew is read-only and evidence-based), expand tool connections and start relying on the crew's briefings as the primary investigation source. Add the Code Inspector if deployment correlation is a frequent need.

Month 5-6: Evaluate secondary agents based on your team's specific pain points. Support Liaison if user tickets are leading indicators. Cloud Advisor if cost optimization is a priority. SLO Tracker if reliability reporting is needed.

**Month 7+😗* Consider the Remediation Engineer for teams with high trust in investigation accuracy. Continue expanding Knowledge Keeper's index with runbooks, wiki content, and past postmortems.

The key principle: one team of three agents that saves 20 hours per month of investigation time is worth more than a full platform with ten agents that nobody trusts. Build trust incrementally. Expand when the data supports it. And deploy only what you need — the composable model is there precisely so you don't have to buy features you won't use.

Want to see Parumox in action?

See how AI agents investigate incidents across your entire cloud stack in under 90 seconds.