Question 1

Which AI model has the highest MMLU score?

Accepted Answer

As of April 2026, Claude Opus 4.6 and GPT-4.5 lead on MMLU with scores above 90%. Gemini 2.5 Pro scores 90.3% and o3 achieves 91.4%. Among open-source models, Llama 3.1 405B scores 87.3% and Qwen 2.5 72B scores 85.3%. MMLU measures knowledge across 57 subjects including STEM, humanities, and social sciences.

Question 2

What is MMLU and why does it matter for AI models?

Accepted Answer

MMLU (Massive Multitask Language Understanding) is a benchmark that tests AI models on 57 academic subjects ranging from elementary math to professional law and medicine. It uses multiple-choice questions in a 5-shot format. MMLU matters because it provides a standardized measure of general knowledge and reasoning ability across diverse domains, making it the most widely cited LLM benchmark.

Question 3

Which AI model is best at coding?

Accepted Answer

On the HumanEval benchmark (Python code generation), Claude Opus 4.6 and o3 lead with pass@1 scores above 90%. GPT-4.5 scores 88.6%, and Claude Sonnet 4 scores 89.2%. Among open-source models, DeepSeek V3 achieves 82.6% and Codestral 25.01 scores 81.1%. For production coding tasks, model choice should also consider latency, cost, and support for your specific programming language.

Question 4

How reliable are AI model benchmarks?

Accepted Answer

AI benchmarks have known limitations. Test set contamination (models trained on benchmark questions) can inflate scores. Different evaluation methods (0-shot vs 5-shot, different prompting) produce different results. Benchmarks measure narrow capabilities and may not reflect real-world performance. MMLU is saturating as top models score above 90%. Newer benchmarks like GPQA and SWE-bench aim to be harder but are less standardized.

Question 5

Where does this benchmark data come from?

Accepted Answer

Benchmark scores are compiled from official model papers, provider technical reports, and the Open LLM Leaderboard on Hugging Face. When providers report different scores under different conditions, we use the most commonly cited configuration. All MMLU scores are 5-shot unless noted. HumanEval uses pass@1. Data is updated as new models are released and official benchmarks are published.

Model	Provider	MMLU	HumanEval	MATH	GSM8K	Open Source
o3	OpenAI	91.4%	92.8%	96.7%	97.2%	No
o4-mini	OpenAI	89.5%	90.1%	93.4%	96.5%	No
GPT-4.5	OpenAI	90.8%	88.6%	80.4%	95.8%	No
GPT-4o	OpenAI	88.7%	87.1%	76.6%	95.3%	No
GPT-4o mini	OpenAI	82.0%	87.0%	70.2%	93.2%	No
GPT-4 Turbo	OpenAI	86.4%	85.4%	73.4%	94.2%	No
Claude Opus 4.6	Anthropic	91.2%	91.5%	85.2%	96.8%	No
Claude Sonnet 4	Anthropic	88.8%	89.2%	78.3%	95.1%	No
Claude Haiku 3.5	Anthropic	83.5%	84.8%	69.4%	92.1%	No
Gemini 2.5 Pro	Google	90.3%	88.4%	84.7%	95.4%	No
Gemini 2.5 Flash	Google	85.8%	83.2%	74.1%	93.8%	No
Gemini 2.0 Flash	Google	83.1%	80.5%	71.8%	92.5%	No
Gemini 1.5 Pro	Google	85.9%	71.9%	67.7%	91.7%	No
DeepSeek V3	DeepSeek	88.5%	82.6%	75.9%	92.8%	Yes
DeepSeek R1	DeepSeek	90.8%	85.7%	90.1%	97.3%	Yes
Llama 3.1 405B	Meta	87.3%	80.5%	73.8%	94.4%	Yes
Llama 3.1 70B	Meta	83.6%	72.6%	68.0%	91.1%	Yes
Llama 3.1 8B	Meta	68.4%	62.6%	51.9%	77.4%	Yes
Llama 3.3 70B	Meta	86.0%	78.9%	73.5%	93.7%	Yes
Mistral Large 2	Mistral	84.0%	82.0%	71.2%	92.0%	No
Mistral Small 3.1	Mistral	76.5%	72.1%	58.3%	85.6%	Yes
Codestral 25.01	Mistral	75.8%	81.1%	65.2%	84.1%	Yes
Mixtral 8x22B	Mistral	77.8%	72.7%	60.1%	88.4%	Yes
Qwen 2.5 72B	Alibaba	85.3%	76.5%	72.1%	91.8%	Yes
Qwen 2.5 32B	Alibaba	81.2%	71.8%	64.5%	88.3%	Yes
Qwen 2.5 Coder 32B	Alibaba	78.4%	84.2%	62.1%	85.7%	Yes
Command R+	Cohere	75.7%	68.2%	56.4%	87.5%	Yes
Command R	Cohere	68.4%	56.1%	42.8%	79.2%	Yes
Yi-Large	01.AI	82.4%	70.5%	62.8%	89.1%	No
Yi-34B	01.AI	76.3%	63.2%	54.1%	84.6%	Yes
Phi-4	Microsoft	84.8%	82.6%	80.4%	94.5%	Yes
Phi-3 Medium (14B)	Microsoft	78.0%	62.7%	53.6%	86.4%	Yes
Grok-2	xAI	87.5%	82.1%	74.3%	93.1%	No
Grok-1	xAI	73.0%	63.2%	52.8%	84.7%	Yes
DBRX	Databricks	73.7%	70.1%	48.2%	83.2%	Yes
Falcon 3 10B	TII	70.8%	55.4%	45.2%	78.9%	Yes
Gemma 2 27B	Google	78.8%	64.4%	56.7%	85.2%	Yes
Gemma 2 9B	Google	72.3%	54.8%	44.1%	76.5%	Yes
Jamba 1.5 Large	AI21	81.2%	72.4%	64.8%	90.1%	Yes
Jamba 1.5 Mini	AI21	69.7%	58.3%	47.5%	79.8%	Yes
OLMo 2 13B	AI2	63.5%	45.2%	32.1%	65.8%	Yes

AI Model Benchmark Tracker 2026 — MMLU, HumanEval, MATH Scores

Methodology

Frequently Asked Questions

Related Tools