LLM Benchmarks

Model benchmark comparison

MMLU tests broad knowledge, HumanEval tests coding, GPQA tests graduate-level science. Higher scores = better. Prices are approximate API rates. Data is updated periodically — check provider documentation for the latest.

Frontier

ModelMMLUHumanEvalGPQAContextInput / 1MOutput / 1M
GPT-4o
OpenAI
88.7%90.2%53.6%128K$2.50$10.00
Claude 3.5 Sonnet
Anthropic
88.3%92%59.4%200K$3.00$15.00
Gemini 1.5 Pro
Google
85.9%84.1%46.2%2M$1.25$5.00
Llama 3.1 405B
Meta
88.6%89%51.1%128K$0.80$0.80
DeepSeek V3
DeepSeek
88.5%84.7%59.1%128K$0.070$1.10

Mid-tier

ModelMMLUHumanEvalGPQAContextInput / 1MOutput / 1M
Mistral Large 2
Mistral
84%92.1%46%128K$2.00$6.00
Qwen 2.5 72B
Alibaba
86%86.6%131K$0.40$1.20
Llama 3.1 70B
Meta
83.6%80.5%41.8%128K$0.40$0.40

Efficient

ModelMMLUHumanEvalGPQAContextInput / 1MOutput / 1M
Gemini 2.0 Flash
Google
76.2%81.4%40.1%1M$0.10$0.40
Claude 3 Haiku
Anthropic
75.2%75.9%33.3%200K$0.25$1.25
GPT-4o mini
OpenAI
82%87.2%40.2%128K$0.15$0.60
Gemini 1.5 Flash
Google
78.9%74.3%1M$0.075$0.30

Benchmark scores and prices are approximate and change frequently. Sources: model provider documentation, LMSYS Leaderboard, official benchmark papers. Last reviewed May 2026.