Agents calling the wrong tool for the task, often because tool descriptions overlap or are ambiguous.
A practical guide to wrong tool selection — what it is, what causes it, how to stop it before it ships harm, and how to catch it when prevention fails.
What it actually looks like in production
- Agent called searchlegal when searchinternal would have answered the question
- Agent called writeemail when reademail was intended
- Agent chained multiple tools when a single one would have done it
Why it happens
- Overlapping tool descriptions
- Ambiguous tool naming
- Insufficient examples in descriptions
- Tool ordering biases
How to prevent it (vendor-neutral)
1. Write tool descriptions for the LLM with clear differentiation
2. Add few-shot examples to tool descriptions
3. Eval tool selection on adversarial inputs
4. Restrict available tools per task type
How Prefactor helps detect and prevent it
Prefactor sits at the agent runtime and contributes specifically:
- Runtime guardrails that flag or block matching patterns before they land
- Continuous eval suites that catch quality regressions on every change
- Tamper-evident logs of every incident and response action
- Per-agent anomaly alerts on the signals listed below
Detection — what to monitor
- Tool-choice eval scores
- Argument shape mismatches
- User feedback on wrong-action incidents
Response — what to do when it happens
Immediate (minutes): confirm the incident from the trace; pause the affected agent if active harm possible; hotfix the trigger.
Short-term (hours): add the failure case to the eval suite; patch the root cause; redeploy with regression validation.
Medium-term (days): root cause analysis; tighten guardrails or controls; document the incident for post-mortem and audit.
FAQ
Can wrong tool selection be eliminated entirely? Usually no — reduce frequency and severity dramatically, and contain blast radius. Aim for low, detected, and contained.
How often should we test for this? Continuously, with every change. Every reported incident becomes a test case.
Can Prefactor detect this in real time? Yes for many variants — guardrails run in-line with sub-second latency.
Related
See Prefactor in action
[Get started free →] [Book a demo →]