LLM Benchmarks
Model benchmark comparison
MMLU tests broad knowledge, HumanEval tests coding, GPQA tests graduate-level science. Higher scores = better. Prices are approximate API rates. Data is updated periodically — check provider documentation for the latest.
Frontier
| Model | MMLU | HumanEval | GPQA | Context | Input / 1M | Output / 1M |
|---|---|---|---|---|---|---|
| GPT-4o OpenAI | 88.7% | 90.2% | 53.6% | 128K | $2.50 | $10.00 |
| Claude 3.5 Sonnet Anthropic | 88.3% | 92% | 59.4% | 200K | $3.00 | $15.00 |
| Gemini 1.5 Pro Google | 85.9% | 84.1% | 46.2% | 2M | $1.25 | $5.00 |
| Llama 3.1 405B Meta | 88.6% | 89% | 51.1% | 128K | $0.80 | $0.80 |
| DeepSeek V3 DeepSeek | 88.5% | 84.7% | 59.1% | 128K | $0.070 | $1.10 |
Mid-tier
| Model | MMLU | HumanEval | GPQA | Context | Input / 1M | Output / 1M |
|---|---|---|---|---|---|---|
| Mistral Large 2 Mistral | 84% | 92.1% | 46% | 128K | $2.00 | $6.00 |
| Qwen 2.5 72B Alibaba | 86% | 86.6% | — | 131K | $0.40 | $1.20 |
| Llama 3.1 70B Meta | 83.6% | 80.5% | 41.8% | 128K | $0.40 | $0.40 |
Efficient
| Model | MMLU | HumanEval | GPQA | Context | Input / 1M | Output / 1M |
|---|---|---|---|---|---|---|
| Gemini 2.0 Flash Google | 76.2% | 81.4% | 40.1% | 1M | $0.10 | $0.40 |
| Claude 3 Haiku Anthropic | 75.2% | 75.9% | 33.3% | 200K | $0.25 | $1.25 |
| GPT-4o mini OpenAI | 82% | 87.2% | 40.2% | 128K | $0.15 | $0.60 |
| Gemini 1.5 Flash Google | 78.9% | 74.3% | — | 1M | $0.075 | $0.30 |
Benchmark scores and prices are approximate and change frequently. Sources: model provider documentation, LMSYS Leaderboard, official benchmark papers. Last reviewed May 2026.