Glossary
Benchmark (Agent)
An agent benchmark is a standardised set of tasks used to measure and compare agent performance across dimensions such as accuracy, reasoning depth, tool use efficiency, and instruction following. Benchmarks provide a reproducible baseline for tracking improvement over time and for comparing different models or agent configurations against each other.