Original Research

AI Model Benchmark Tracker 2026 — MMLU, HumanEval, MATH Scores

Comprehensive benchmark comparison for 40+ AI models. Sortable table with MMLU, HumanEval, MATH, and GSM8K scores from official evaluations. Updated April 2026.

By Michael Lip · Updated April 2026

Methodology

Benchmark scores are sourced from official model technical reports, provider documentation, and the Open LLM Leaderboard on Hugging Face. MMLU uses 5-shot evaluation. HumanEval uses pass@1. MATH and GSM8K use chain-of-thought prompting where applicable. Scores marked with an asterisk (*) are from community evaluations rather than official reports. A dash (—) indicates the benchmark was not reported for that model. Data current as of April 10, 2026.

Model Provider MMLU HumanEval MATH GSM8K Open Source
o3OpenAI91.4%92.8%96.7%97.2%No
o4-miniOpenAI89.5%90.1%93.4%96.5%No
GPT-4.5OpenAI90.8%88.6%80.4%95.8%No
GPT-4oOpenAI88.7%87.1%76.6%95.3%No
GPT-4o miniOpenAI82.0%87.0%70.2%93.2%No
GPT-4 TurboOpenAI86.4%85.4%73.4%94.2%No
Claude Opus 4.6Anthropic91.2%91.5%85.2%96.8%No
Claude Sonnet 4Anthropic88.8%89.2%78.3%95.1%No
Claude Haiku 3.5Anthropic83.5%84.8%69.4%92.1%No
Gemini 2.5 ProGoogle90.3%88.4%84.7%95.4%No
Gemini 2.5 FlashGoogle85.8%83.2%74.1%93.8%No
Gemini 2.0 FlashGoogle83.1%80.5%71.8%92.5%No
Gemini 1.5 ProGoogle85.9%71.9%67.7%91.7%No
DeepSeek V3DeepSeek88.5%82.6%75.9%92.8%Yes
DeepSeek R1DeepSeek90.8%85.7%90.1%97.3%Yes
Llama 3.1 405BMeta87.3%80.5%73.8%94.4%Yes
Llama 3.1 70BMeta83.6%72.6%68.0%91.1%Yes
Llama 3.1 8BMeta68.4%62.6%51.9%77.4%Yes
Llama 3.3 70BMeta86.0%78.9%73.5%93.7%Yes
Mistral Large 2Mistral84.0%82.0%71.2%92.0%No
Mistral Small 3.1Mistral76.5%72.1%58.3%85.6%Yes
Codestral 25.01Mistral75.8%81.1%65.2%84.1%Yes
Mixtral 8x22BMistral77.8%72.7%60.1%88.4%Yes
Qwen 2.5 72BAlibaba85.3%76.5%72.1%91.8%Yes
Qwen 2.5 32BAlibaba81.2%71.8%64.5%88.3%Yes
Qwen 2.5 Coder 32BAlibaba78.4%84.2%62.1%85.7%Yes
Command R+Cohere75.7%68.2%56.4%87.5%Yes
Command RCohere68.4%56.1%42.8%79.2%Yes
Yi-Large01.AI82.4%70.5%62.8%89.1%No
Yi-34B01.AI76.3%63.2%54.1%84.6%Yes
Phi-4Microsoft84.8%82.6%80.4%94.5%Yes
Phi-3 Medium (14B)Microsoft78.0%62.7%53.6%86.4%Yes
Grok-2xAI87.5%82.1%74.3%93.1%No
Grok-1xAI73.0%63.2%52.8%84.7%Yes
DBRXDatabricks73.7%70.1%48.2%83.2%Yes
Falcon 3 10BTII70.8%55.4%45.2%78.9%Yes
Gemma 2 27BGoogle78.8%64.4%56.7%85.2%Yes
Gemma 2 9BGoogle72.3%54.8%44.1%76.5%Yes
Jamba 1.5 LargeAI2181.2%72.4%64.8%90.1%Yes
Jamba 1.5 MiniAI2169.7%58.3%47.5%79.8%Yes
OLMo 2 13BAI263.5%45.2%32.1%65.8%Yes

Frequently Asked Questions

Which AI model has the highest MMLU score?

As of April 2026, Claude Opus 4.6 and GPT-4.5 lead on MMLU with scores above 90%. Gemini 2.5 Pro scores 90.3% and o3 achieves 91.4%. Among open-source models, Llama 3.1 405B scores 87.3% and Qwen 2.5 72B scores 85.3%. MMLU measures knowledge across 57 subjects including STEM, humanities, and social sciences.

What is MMLU and why does it matter?

MMLU (Massive Multitask Language Understanding) is a benchmark that tests AI models on 57 academic subjects ranging from elementary math to professional law and medicine. It uses multiple-choice questions in a 5-shot format. MMLU matters because it provides a standardized measure of general knowledge and reasoning ability across diverse domains, making it the most widely cited LLM benchmark.

Which AI model is best at coding?

On the HumanEval benchmark (Python code generation), Claude Opus 4.6 and o3 lead with pass@1 scores above 90%. GPT-4.5 scores 88.6%, and Claude Sonnet 4 scores 89.2%. Among open-source models, DeepSeek V3 achieves 82.6% and Codestral 25.01 scores 81.1%. For production coding tasks, also consider latency, cost, and language support.

How reliable are AI model benchmarks?

AI benchmarks have known limitations. Test set contamination (models trained on benchmark questions) can inflate scores. Different evaluation methods (0-shot vs 5-shot, different prompting) produce different results. Benchmarks measure narrow capabilities and may not reflect real-world performance. MMLU is saturating as top models score above 90%. Newer benchmarks like GPQA and SWE-bench aim to be harder.

Where does this benchmark data come from?

Benchmark scores are compiled from official model papers, provider technical reports, and the Open LLM Leaderboard on Hugging Face. When providers report different scores under different conditions, we use the most commonly cited configuration. All MMLU scores are 5-shot unless noted. HumanEval uses pass@1. Data is updated as new models are released.