AI Glossary
Benchmark
Standardized tests for comparing AI model capability
Definition
A benchmark is a standardized test or dataset used to measure and compare AI model performance. Common benchmarks include MMLU (knowledge), HumanEval (coding), GPQA (graduate-level reasoning), and LMSYS Arena (head-to-head user preference). Benchmarks help buyers compare models objectively, though they can be "gamed" by training specifically on benchmark data.