LLM Benchmarks

Model benchmark comparison

Name: LLM Benchmark Comparison
Creator: XScanHub

MMLU tests broad knowledge, HumanEval tests coding, GPQA tests graduate-level science. Higher scores = better. Prices are approximate API rates. Data is updated periodically — check provider documentation for the latest.

Frontier

Model	MMLU	HumanEval	GPQA	Context	Input / 1M	Output / 1M
GPT-4o OpenAI	88.7%	90.2%	53.6%	128K	$2.50	$10.00
Claude 3.5 Sonnet Anthropic	88.3%	92%	59.4%	200K	$3.00	$15.00
Gemini 1.5 Pro Google	85.9%	84.1%	46.2%	2M	$1.25	$5.00
Llama 3.1 405B Meta	88.6%	89%	51.1%	128K	$0.80	$0.80
DeepSeek V3 DeepSeek	88.5%	84.7%	59.1%	128K	$0.070	$1.10

Mid-tier

Model	MMLU	HumanEval	GPQA	Context	Input / 1M	Output / 1M
Mistral Large 2 Mistral	84%	92.1%	46%	128K	$2.00	$6.00
Qwen 2.5 72B Alibaba	86%	86.6%	—	131K	$0.40	$1.20
Llama 3.1 70B Meta	83.6%	80.5%	41.8%	128K	$0.40	$0.40

Efficient

Model	MMLU	HumanEval	GPQA	Context	Input / 1M	Output / 1M
Gemini 2.0 Flash Google	76.2%	81.4%	40.1%	1M	$0.10	$0.40
Claude 3 Haiku Anthropic	75.2%	75.9%	33.3%	200K	$0.25	$1.25
GPT-4o mini OpenAI	82%	87.2%	40.2%	128K	$0.15	$0.60
Gemini 1.5 Flash Google	78.9%	74.3%	—	1M	$0.075	$0.30

Benchmark scores and prices are approximate and change frequently. Sources: model provider documentation, LMSYS Leaderboard, official benchmark papers. Last reviewed May 2026.