Nimble – Story: MCP vs Agent-to-Agent: keeping your agents from drowning in tools

The problem we keep seeing

MCP makes it easy to connect agents to many capabilities. The failure mode is equally easy: “one click, 100 tools.” Each tool ships with instructions and schemas. Add a handful and the model starts second-guessing itself, or you end up telling it explicitly, “use payroll.adjustment.create from the HR MCP.” That’s not the autonomy we were promised.

This isn’t just prompt engineering. It’s systems design.

Discovery-first, not exposure-first

Our preferred architecture wraps (not replaces) MCP with an Agent-to-Agent (A2A) layer:

Intent in → The user states the goal (e.g., “Fix duplicate deductions for Sarah Chen and notify her manager.”).
Exploration Agent → Queries a Tool Registry (an “affordance index” with embeddings over tool docs, examples, and reliability stats) and proposes a minimal plan.
Planner/Executor split → Planner chooses tools and sequence; Executor runs calls with retry/fallback and policy checks.
Tight context budgets → Only the chosen tools (and their relevant specs) enter the model’s context.
Self-healing & caching → Cache world models (UI screenshots/DOM maps, schema summaries) and recover with vision when selectors break.

MCP remains the transport. The A2A layer supplies the discipline.

What this looks like in an HR services company

Exposure-first (what breaks):

ATS, Payroll, LMS, CRM MCP servers all attached. The agent sees 120+ tools.
A simple request “Schedule a panel interview, attach last panel’s notes, and pre-book a room” turns into tool roulette.

Discovery-first (what works):

The Exploration Agent retrieves 2–3 task-shaped tools:
- interviews.schedule_event (finds availability, books room, attaches notes)
- employees.get_context (recent feedback, manager, risk flags)
- messaging.send_update (candidate + panel summary)
Only those three specs are loaded. The planner executes, validates, and notifies - done.

Hard-won principles we apply (inspired by recent research)

1) Choose fewer, better tools

Don’t mirror every API endpoint. Build task-shaped tools that match real workflows.

Instead of list_employees → employees.search(name|email|team, response_format=CONCISE)
Instead of list_timesheets + list_deductions + create_adjustment → payroll.resolve_duplicate_deduction(employee, period) that does the chained work inside the tool.
Instead of list_events + create_event → interviews.schedule_event(candidates, panel, constraints).

Why? Agents have finite context. Consolidated tools shift complexity out of the prompt and into deterministic code.

2) Namespace for clarity

Prefix by domain and resource:

payroll.adjustment.create, payroll.anomaly.search, ats.candidates.search, ats.interviews.schedule.

Yes, namespacing sounds cosmetic; in practice it reduces wrong-tool calls and makes traces interpretable.

3) Return meaningful context

Prefer human-legible fields (name, role, team, risk_notes) over cryptic IDs.

If IDs are required downstream, support a response_format switch:

enum ResponseFormat { CONCISE = "concise", DETAILED = "detailed" }

Default to CONCISE; use DETAILED only when chaining calls needs IDs like employee_id or thread_ts.

4) Design for token efficiency

Implement filtering, pagination, date ranges, and sensible truncation.

When truncating, return a helpful footer like:

“Truncated to 50 results. Re-query with page=2 or add team='Sales'.”

5) Prompt-engineer tool specs like you would onboard a new hire

Be explicit about inputs, preconditions, edge cases, and good examples.

Don’t rely on tribal knowledge (“you’re supposed to pass period as YYYY-MM”). Say it.

Evaluate like you mean it (and let agents help you improve)

Prototype quickly, then build a realistic evaluation set that mirrors HR operations. Strong tasks are multi-step and verifiable:

“Schedule a Sales panel next week for Jane Doe, attach the last panel’s notes, and reserve a room near the Brussels office.”
“Customer ID 9182 (a partner company) reports duplicate payroll deductions for May. Find similar cases and fix them if policy allows; otherwise escalate.”
“Sarah Chen requested cancellation of benefits. Determine why, propose a retention offer, and flag any compliance risks.”

Avoid toy prompts (“Schedule a meeting”) they don’t surface failure modes.

Track more than accuracy: tool-call counts, runtime, token usage, and error rates. These metrics reveal when tools should be merged, paginated, or better documented.

Then let the agent critique its own runs. Have it read transcripts, spot confusing parameter names, and propose spec edits. Rerun the evaluation on a held-out set to avoid overfitting.

A minimal spec that passes the sniff test

Good (interviews.schedule_event)

Inputs: candidate_id, panel_user_ids[], duration_min, week_of, location_pref, attach_last_panel_notes: boolean
Behavior: finds overlapping availability, books a room, attaches notes, posts a summary in the hiring channel
Errors: explains what to change (e.g., “No overlapping 60-min slot. Try duration_min=45 or week_of=2025-11-10.”)
Returns (CONCISE): human-readable summary + meeting link
Returns (DETAILED): same + event_id, room_id, channel_id

Not great (list_events)

Streams raw calendars, burns tokens, leaves planning to the agent

Where to start: QA automation for HR portals

QA is a practical proving ground:

Vision-first exploration caches UI state (screenshots/DOM) on first run
Hybrid selectors: use fast deterministic locators; fall back to vision when the UI shifts
Self-healing: if a step fails, re-infer and retry before escalating
Output: agent-written test cases that your runner can execute daily

Once stable, reuse the same A2A pattern for onboarding flows, payroll anomaly resolution, and compliance reminders.

A humble roadmap we recommend

Inventory & namespace your existing MCP tools. Deprecate overlaps.
Build a Tool Registry (descriptions, examples, reliability, embeddings).
Ship an Exploration Agent + Planner/Executor with strict context budgets.
Convert 3–5 critical flows into task-shaped tools.
Stand up a real evaluation pack (20–50 tasks) and track accuracy, tokens, runtime, errors.
Iterate with the agent in the loop: let it propose spec changes, then re-evaluate on a held-out set.
Roll into QA first, then HR operations.

Closing

We’re bullish on MCP as a plumbing standard and realistic about its limits as a strategy. The path forward is a small layer of discipline: discovery-first orchestration, fewer and clearer tools, and a culture of evaluation. In HR, where workflows are human, policy-heavy, and cross-system this shift pays for itself quickly.

If you want to trial this in your HR stack, we’re happy to start small, publish results, and let the data tell us what to build next.

Milan Claeys Bouuaert

Join Bothrs

Bundle forces and let’s create impactful experiences together!

Agentic organizations don’t wait, they build.

Start your GenAI Discovery Track and unlock easier, smarter experiences.

Stef Nimmegeers, Co-Founder Nimble

Discover GenAI Track Contact

Have you read these?

Article

MCP vs Agent-to-Agent: keeping your agents from drowning in tools