All posts

Agent Infrastructure at Scale: The Control Plane

·
agentic-aienterprise-architectureinfrastructuregovernanceobservability

At scale, three structural problems will break your agentic systems. They're not optional concerns. They're infrastructure requirements.

The Three Core Problems

1. Tool Definition Fragmentation

Service owners publish their own MCP servers. Result: inconsistent implementations, definitions that drift from reality, silent failures.

The structural issue: Tool definitions live outside the deployment pipeline. Developers update endpoints. MCP server definitions lag. Agents call deprecated parameters. The LLM hallucinates a recovery that doesn't exist.

Fix: Make tool definitions derivative artifacts computed from source-of-truth schemas, not independent sources.

2. Security at Machine Velocity

Agents operate at microsecond decision latencies. Humans cannot review every execution. Your agents access production systems, sensitive data, and business-critical workflows.

Requirements:

  • Centralized authorization: Every tool call validated against RBAC policies
  • PII redaction: Response filtering before agent context window
  • Mutation gating: Read-only by default; writes require explicit approval
  • Traceability: Full audit trail of every agent execution

Standard API gateways are insufficient. You're securing an entire execution plan generated ad-hoc by an LLM.

3. Discovery and Quality

Thousands of available tools. The agent must solve: which tools exist, what do they do, which combination solves the task?

Bad descriptions → LLM hallucination. Agent sees "PaymentService.process()" and invokes with non-existent parameters. The model can't distinguish between documented APIs and hallucinated ones.

Requirements:

  • Schema-derived descriptions: Generated from actual service contracts, not human prose
  • Workflow-scoped availability: Not all agents need all tools
  • Parameter constraints: Prevent LLM parameter invention
  • Quality metrics: Success rate, parameter validity, latency, user satisfaction

The Infrastructure Response

Ad-hoc tooling fails. You need a unified control plane.

Protocol-Driven Tool Generation

Stop requiring service owners to write MCP servers.

Instead:

  1. Introspect service schemas (Protobuf, Thrift, OpenAPI)
  2. Auto-generate MCP tool definitions from schema
  3. Generate natural-language descriptions via LLM tuned for agent understanding
  4. Version and distribute from central registry

Tool definitions stay synchronized with service contracts because they're computed from source of truth.

Constraint: Services must have formalized definitions. If documentation lives in Slack, this fails.

Gateway-Based Control Plane

Single gateway service becomes the enforcement point:

  • All agent-to-service calls route through gateway
  • Authentication and authorization centralized
  • Response filtering (PII redaction) before agent sees data
  • Complete metrics and observability by default
  • Rate limiting and circuit breaking protect downstream systems

Gateway is config-driven. Policies live in version control, deployed like code.

Gateway policy requirements:

  • Tool allowlist per agent
  • Rate limits (calls/minute, calls/hour)
  • Parameter constraints (max_files, max_size, etc.)
  • Data redaction rules (api_keys, secrets, pii)
  • Approval gating for write operations

Tool Scoping and Refinement

Not all agents need all tools.

Support:

  • Workflow-specific tool sets: Agent builders select allowed tools explicitly
  • Parameter overrides: Constrain parameters for known workflows to prevent hallucination
  • Derived tools: Specialized tool definitions layered on base service tools

Automation gets you to 80%. Explicit scoping and overrides get you to production.

Quality Metrics

Tools have SLAs. Track:

  • Success rate: % of calls that work vs. fail
  • Parameter validity: % of calls with invalid parameters
  • Latency distribution: p50, p99, p99.9
  • User satisfaction: Correctness of results from agent's perspective

Actions for underperforming tools:

  1. Refine (better descriptions, tighter constraints)
  2. Deprecate (remove from agent access)
  3. Tier by SLA (restrict to high-priority agents)

SLA tracking:

tool_metrics:
  github_create_pull_request:
    success_rate: 0.98
    parameter_validity: 0.95
    p99_latency_ms: 1200
    user_satisfaction: 4.2/5.0

  payment_process:
    success_rate: 0.99
    parameter_validity: 0.98
    p99_latency_ms: 2500
    user_satisfaction: 4.7/5.0

Establish minimum thresholds. Publish to internal SLA dashboard. Action underperformers monthly.

Operationalizing Multiple Agent Surfaces

Different teams consume agents differently. Infrastructure must support all patterns.

No-Code Agent Builders

Product and business teams assemble agents through configuration. Select tools, set scoping rules, deploy. Platform handles orchestration.

Tradeoff: Requires rock-solid tool definitions and scoping. Bad definitions ship fast.

Code-First SDKs

Complex workflows (payment, support, supply chain). Teams write code with access to full tool registry, can override tool definitions, implement domain-specific validation.

SDK is thin: gateway client + registry client. Hard problems (governance, security, discovery) solved in platform layer.

Autonomous Development Agents

Production ML systems (Claude, etc.) generating and executing code changes.

Non-negotiable requirements:

  • Scope enforcement: What repos can the agent modify?
  • Approval gating: Which changes require human review before merge?
  • Rollback capability: Can the agent revert changes if tests fail?
  • Complete audit trail: Every change, every decision, fully logged

Operating Metrics

At scale:

  • 5,000+ engineers using agentic tools monthly
  • 10,000+ services available as tools
  • 1,500+ active agents in production
  • 60,000+ agent executions per week

At this scale, infrastructure decisions multiply. Flaky tool discovery affects thousands of agents. Security gaps become incidents affecting millions in transaction volume.

The gateway, registry, schema introspection, and observability are not optional. They're the baseline cost.

Requirements from Engineering Leadership

  1. Architectural thinking: This is a platform, not a feature. Expect months to build, years to operationalize.

  2. Cross-team alignment: Service teams, security, platform engineering, and agent builders must align on schemas, policies, tool definitions. Requires governance.

  3. Baseline metrics: Tool quality, agent reliability, security—define and track from day one. You can't improve what you don't measure.

  4. Staged rollout:

    • Phase 1: Code-first agents (highest control, lowest blast radius)
    • Phase 2: SDKs (broader teams, scoped access)
    • Phase 3: No-code builders (only after phases 1-2 are hardened)
  5. Hardened observability: Every execution traceable. Every tool call logged. Every authorization decision auditable. Non-negotiable.

Log requirement:

  • Agent execution ID
  • Tool calls with parameters
  • Authorization decisions (allow/deny)
  • Response filtering actions
  • Latency and error codes
  • Audit trail retention: 2 years minimum
  • Encryption at rest and in transit

Alert on:

  • Authorization failures > 0.1%
  • Tool error rate > 5%
  • PII redaction failures (zero tolerance)
  • Unauthorized scope access attempts (zero tolerance)

Unsolved Problems

Hard problems remain unsolved by infrastructure alone:

  • LLM hallucination in tool selection: Model problem. Better descriptions help. Scoping helps. Model will still invent parameters.

  • Planning under uncertainty: Agent orchestration across services with partial failures is complex. Infrastructure observes failures; can't auto-recover.

  • Cost and latency tradeoffs: More executions = more inference = higher costs. More tools available = longer context = higher latency. Resource tradeoffs are fundamental.

  • Evaluation: "Did the agent do the right thing?" is harder to measure than "did the API respond?" Requires domain expertise and statistical rigor.

Conclusion

This is an infrastructure problem, not an AI problem. The agents are straightforward. The platform keeping them safe, observable, and reliable is the hard part.

Execute:

  • Derivative tool definitions from schemas
  • Central gateway for all agent-to-service calls
  • Tool scoping and parameter constraints
  • Measured tool quality (success rate, latency, validity)
  • Multiple consumption patterns with appropriate controls

This is table stakes. Requires engineering discipline and sustained investment. The alternative—agents operating against inconsistent, undocumented services—is catastrophic.

Build the platform first.


References

Agentic AI & Tool Use

Platform Engineering & Control Planes

Security & Governance at Scale

Tool Quality & Observability

Staged Deployment & Risk Management

Unsolved Research Problems