At scale, three structural problems will break your agentic systems. They're not optional concerns. They're infrastructure requirements.
The Three Core Problems
1. Tool Definition Fragmentation
Service owners publish their own MCP servers. Result: inconsistent implementations, definitions that drift from reality, silent failures.
The structural issue: Tool definitions live outside the deployment pipeline. Developers update endpoints. MCP server definitions lag. Agents call deprecated parameters. The LLM hallucinates a recovery that doesn't exist.
Fix: Make tool definitions derivative artifacts computed from source-of-truth schemas, not independent sources.
2. Security at Machine Velocity
Agents operate at microsecond decision latencies. Humans cannot review every execution. Your agents access production systems, sensitive data, and business-critical workflows.
Requirements:
- Centralized authorization: Every tool call validated against RBAC policies
- PII redaction: Response filtering before agent context window
- Mutation gating: Read-only by default; writes require explicit approval
- Traceability: Full audit trail of every agent execution
Standard API gateways are insufficient. You're securing an entire execution plan generated ad-hoc by an LLM.
3. Discovery and Quality
Thousands of available tools. The agent must solve: which tools exist, what do they do, which combination solves the task?
Bad descriptions → LLM hallucination. Agent sees "PaymentService.process()" and invokes with non-existent parameters. The model can't distinguish between documented APIs and hallucinated ones.
Requirements:
- Schema-derived descriptions: Generated from actual service contracts, not human prose
- Workflow-scoped availability: Not all agents need all tools
- Parameter constraints: Prevent LLM parameter invention
- Quality metrics: Success rate, parameter validity, latency, user satisfaction
The Infrastructure Response
Ad-hoc tooling fails. You need a unified control plane.
Protocol-Driven Tool Generation
Stop requiring service owners to write MCP servers.
Instead:
- Introspect service schemas (Protobuf, Thrift, OpenAPI)
- Auto-generate MCP tool definitions from schema
- Generate natural-language descriptions via LLM tuned for agent understanding
- Version and distribute from central registry
Tool definitions stay synchronized with service contracts because they're computed from source of truth.
Constraint: Services must have formalized definitions. If documentation lives in Slack, this fails.
Gateway-Based Control Plane
Single gateway service becomes the enforcement point:
- All agent-to-service calls route through gateway
- Authentication and authorization centralized
- Response filtering (PII redaction) before agent sees data
- Complete metrics and observability by default
- Rate limiting and circuit breaking protect downstream systems
Gateway is config-driven. Policies live in version control, deployed like code.
Gateway policy requirements:
- Tool allowlist per agent
- Rate limits (calls/minute, calls/hour)
- Parameter constraints (max_files, max_size, etc.)
- Data redaction rules (api_keys, secrets, pii)
- Approval gating for write operations
Tool Scoping and Refinement
Not all agents need all tools.
Support:
- Workflow-specific tool sets: Agent builders select allowed tools explicitly
- Parameter overrides: Constrain parameters for known workflows to prevent hallucination
- Derived tools: Specialized tool definitions layered on base service tools
Automation gets you to 80%. Explicit scoping and overrides get you to production.
Quality Metrics
Tools have SLAs. Track:
- Success rate: % of calls that work vs. fail
- Parameter validity: % of calls with invalid parameters
- Latency distribution: p50, p99, p99.9
- User satisfaction: Correctness of results from agent's perspective
Actions for underperforming tools:
- Refine (better descriptions, tighter constraints)
- Deprecate (remove from agent access)
- Tier by SLA (restrict to high-priority agents)
SLA tracking:
tool_metrics:
github_create_pull_request:
success_rate: 0.98
parameter_validity: 0.95
p99_latency_ms: 1200
user_satisfaction: 4.2/5.0
payment_process:
success_rate: 0.99
parameter_validity: 0.98
p99_latency_ms: 2500
user_satisfaction: 4.7/5.0
Establish minimum thresholds. Publish to internal SLA dashboard. Action underperformers monthly.
Operationalizing Multiple Agent Surfaces
Different teams consume agents differently. Infrastructure must support all patterns.
No-Code Agent Builders
Product and business teams assemble agents through configuration. Select tools, set scoping rules, deploy. Platform handles orchestration.
Tradeoff: Requires rock-solid tool definitions and scoping. Bad definitions ship fast.
Code-First SDKs
Complex workflows (payment, support, supply chain). Teams write code with access to full tool registry, can override tool definitions, implement domain-specific validation.
SDK is thin: gateway client + registry client. Hard problems (governance, security, discovery) solved in platform layer.
Autonomous Development Agents
Production ML systems (Claude, etc.) generating and executing code changes.
Non-negotiable requirements:
- Scope enforcement: What repos can the agent modify?
- Approval gating: Which changes require human review before merge?
- Rollback capability: Can the agent revert changes if tests fail?
- Complete audit trail: Every change, every decision, fully logged
Operating Metrics
At scale:
- 5,000+ engineers using agentic tools monthly
- 10,000+ services available as tools
- 1,500+ active agents in production
- 60,000+ agent executions per week
At this scale, infrastructure decisions multiply. Flaky tool discovery affects thousands of agents. Security gaps become incidents affecting millions in transaction volume.
The gateway, registry, schema introspection, and observability are not optional. They're the baseline cost.
Requirements from Engineering Leadership
-
Architectural thinking: This is a platform, not a feature. Expect months to build, years to operationalize.
-
Cross-team alignment: Service teams, security, platform engineering, and agent builders must align on schemas, policies, tool definitions. Requires governance.
-
Baseline metrics: Tool quality, agent reliability, security—define and track from day one. You can't improve what you don't measure.
-
Staged rollout:
- Phase 1: Code-first agents (highest control, lowest blast radius)
- Phase 2: SDKs (broader teams, scoped access)
- Phase 3: No-code builders (only after phases 1-2 are hardened)
-
Hardened observability: Every execution traceable. Every tool call logged. Every authorization decision auditable. Non-negotiable.
Log requirement:
- Agent execution ID
- Tool calls with parameters
- Authorization decisions (allow/deny)
- Response filtering actions
- Latency and error codes
- Audit trail retention: 2 years minimum
- Encryption at rest and in transit
Alert on:
- Authorization failures > 0.1%
- Tool error rate > 5%
- PII redaction failures (zero tolerance)
- Unauthorized scope access attempts (zero tolerance)
Unsolved Problems
Hard problems remain unsolved by infrastructure alone:
-
LLM hallucination in tool selection: Model problem. Better descriptions help. Scoping helps. Model will still invent parameters.
-
Planning under uncertainty: Agent orchestration across services with partial failures is complex. Infrastructure observes failures; can't auto-recover.
-
Cost and latency tradeoffs: More executions = more inference = higher costs. More tools available = longer context = higher latency. Resource tradeoffs are fundamental.
-
Evaluation: "Did the agent do the right thing?" is harder to measure than "did the API respond?" Requires domain expertise and statistical rigor.
Conclusion
This is an infrastructure problem, not an AI problem. The agents are straightforward. The platform keeping them safe, observable, and reliable is the hard part.
Execute:
- Derivative tool definitions from schemas
- Central gateway for all agent-to-service calls
- Tool scoping and parameter constraints
- Measured tool quality (success rate, latency, validity)
- Multiple consumption patterns with appropriate controls
This is table stakes. Requires engineering discipline and sustained investment. The alternative—agents operating against inconsistent, undocumented services—is catastrophic.
Build the platform first.
References
Agentic AI & Tool Use
- Model Context Protocol — Foundation for tool definition standards
- Building Effective Agents — Architecture patterns and best practices (Anthropic)
- OpenAI Function Calling — Reference implementation for constrained tool invocation
- On the Dangers of Stochastic Parrots — Foundational work on model limitations
Platform Engineering & Control Planes
- Kong API Gateway Architecture — Production gateway patterns
- OPA (Open Policy Agent) — Declarative policy enforcement
- Protocol Buffers Best Practices — Formalized service contracts
- Gartner Platform Engineering Research — Enterprise platform patterns
Security & Governance at Scale
- Google BeyondCorp — Authorization model for distributed systems
- OWASP Data Protection — Sensitive data handling patterns
- NIST Attribute-Based Access Control — Enterprise RBAC/ABAC reference
- OpenTelemetry Observability — Complete observability framework
Tool Quality & Observability
- Google SRE Book - Monitoring Distributed Systems — Measuring reliability
- Graph Databases for Service Mesh — Finding and cataloging tools at scale
- Tail at Scale - Google Research — Understanding p99 latency in distributed systems
- Datadog Observability 101 — Industry standard observability patterns
Staged Deployment & Risk Management
- Stripe Scaling Through Phases — Staged rollout strategy (video)
- Flagger Canary Deployment — Progressive delivery patterns
- Emma Tosch - Formal Methods for Systems — Rigorous change analysis (video)
Unsolved Research Problems
- AlFworld: Autonomous Agents in Simulated Worlds — Benchmark for agent orchestration
- Retrieval-Augmented Generation (RAG) — Grounding agents to external facts
- LLM Inference Optimization — Reducing inference costs at scale
- HELM: Holistic Evaluation of Language Models — Rigorous agent evaluation framework