AI Benchmark Tracker

Q: Which AI model scores highest on HumanEval for coding?

As of May 2026, Claude Opus 4 leads HumanEval with a 92.7% pass@1 score, followed by Gemini 2.5 Pro at 91.2% and GPT-4o at 90.2%. Among open-source models, DeepSeek V3 scores 82.6% and Llama 4 Scout reaches 80.5%. HumanEval tests Python code generation with 164 programming challenges, measuring the model's ability to write correct code from docstring descriptions on the first attempt.

Q: How do open-source AI models compare to closed-source on benchmarks?

The gap between open-source and closed-source models has narrowed dramatically by 2026. On MMLU, the best open-source model (Llama 4 Scout at 88.5%) trails the best closed-source model (Claude Opus 4 at 92.0%) by only 3.5 points. On HumanEval, the gap is about 10 points. On GSM8K math, open-source models like Qwen 2.5 72B (89.5%) approach closed-source performance. The remaining gap is largest on complex reasoning tasks (MATH) where closed-source models maintain a 10-15 point lead.

Q: What are the limitations of AI benchmarks?

AI benchmarks have several known limitations: (1) Data contamination — models may have seen benchmark questions during training, inflating scores. (2) Benchmark saturation — when top models all score 90%+, the benchmark loses discriminative power. (3) Gaming — models can be specifically optimized for benchmark performance without corresponding real-world improvement. (4) Narrow measurement — MMLU measures knowledge breadth but not reasoning depth; HumanEval tests simple functions but not complex systems. (5) Static snapshots — benchmarks don't capture how models perform on novel, evolving real-world tasks. For these reasons, benchmark scores should be one factor among many in model selection.

Q: Can I download the benchmark data?

Yes, you can download the complete benchmark dataset from this page in both JSON and CSV formats using the download buttons above the chart. The data includes all 12 models with their MMLU, HumanEval, GSM8K, MATH, and HellaSwag scores, along with provider and release date information. The data runs entirely in your browser with no server interaction — the export is generated client-side from the same dataset displayed in the table.

Compare MMLU, HumanEval, GSM8K, MATH, and HellaSwag scores across 12 major AI models. Color-coded scores, sortable table, bar chart visualization, and JSON/CSV export.

No data leaves your browser

Chart benchmark:

Model	Provider	MMLU	HumanEval	GSM8K	MATH	HellaSwag

MMLU Scores Comparison

Understanding AI Benchmarks

AI benchmarks provide standardized measurements of model capabilities across specific domains. While no single benchmark captures the full picture of a model's usefulness, the five benchmarks tracked on this page cover the most important dimensions: general knowledge (MMLU), code generation (HumanEval), mathematical reasoning (GSM8K and MATH), and common-sense understanding (HellaSwag). Together, they provide a multi-dimensional view of model capability that helps developers, researchers, and businesses make informed decisions about which model to use for specific tasks.

All benchmark data on this page is sourced from published papers, official model cards, and verified third-party evaluations. Scores are reported as percentages where available, with higher scores indicating better performance. The color coding provides instant visual assessment: green (90%+) indicates state-of-the-art performance, yellow (70-90%) indicates strong performance, orange (50-70%) indicates moderate performance, and red (below 50%) indicates below-average performance for current-generation models.

MMLU — Massive Multitask Language Understanding

MMLU is the most widely cited benchmark for measuring an AI model's breadth of knowledge. Introduced by Hendrycks et al. in 2021, it consists of 14,042 multiple-choice questions spanning 57 academic subjects organized into four categories: STEM (physics, chemistry, computer science, mathematics), humanities (history, philosophy, law), social sciences (economics, political science, sociology), and other (clinical medicine, professional accounting, US foreign policy).

The benchmark uses a 5-shot evaluation format, meaning the model is shown 5 example question-answer pairs before each test question. This tests both knowledge recall and in-context learning ability. A score of 90%+ generally indicates expert-level knowledge across most domains. By 2026, the leading models have reached 90-92% on MMLU, approaching the saturation point where the benchmark loses its discriminative power. This has led to the development of MMLU-Pro, a harder variant with more challenging questions and 10 answer choices instead of 4.

Among the models tracked here, Claude Opus 4 leads at 92.0%, followed by Gemini 2.5 Pro at 91.7% and GPT-4o at 90.2%. The strong performance of open-source models like Llama 4 Scout (88.5%) and Qwen 2.5 72B (85.3%) demonstrates how quickly the gap between open and closed-source models is closing on knowledge benchmarks.

HumanEval — Code Generation Benchmark

HumanEval, created by OpenAI in 2021, measures a model's ability to generate correct Python code from function docstrings. The benchmark contains 164 hand-written programming challenges, each with a function signature, docstring description, and a suite of unit tests. The primary metric is pass@1 — the percentage of problems where the model's first generated solution passes all unit tests.

HumanEval tests fundamental programming skills: string manipulation, list operations, mathematical computations, recursive algorithms, and basic data structure operations. It does not test complex system design, multi-file projects, or language-specific idioms beyond Python. For a more comprehensive evaluation of real-world coding ability, benchmarks like SWE-bench (which tests ability to resolve actual GitHub issues) have become increasingly important.

As of 2026, Claude Opus 4 leads HumanEval at 92.7%, with Gemini 2.5 Pro at 91.2% and GPT-4o at 90.2%. The gap between the best closed-source and open-source models (DeepSeek V3 at 82.6%) remains around 10 points, but this gap has shrunk considerably from the 30+ point difference seen in 2023. Open-source code models like DeepSeek V3 benefit from specialized training on large code corpora.

GSM8K — Grade School Math

GSM8K (Grade School Math 8K) is a dataset of 8,500 grade-school-level math word problems created by OpenAI. Each problem requires 2-8 steps of arithmetic and logical reasoning to solve. Despite the "grade school" label, the benchmark is genuinely challenging for AI models because it tests multi-step reasoning — the model must plan a solution strategy, execute multiple calculations correctly, and arrive at a precise numerical answer.

GSM8K has become one of the best proxies for reasoning ability in language models. Models that perform well on GSM8K typically also perform well on real-world tasks requiring step-by-step problem solving. Chain-of-thought prompting dramatically improves GSM8K scores, which is why most model evaluations use CoT prompts for this benchmark.

The leading models have largely saturated GSM8K, with Claude Opus 4 at 97.2%, GPT-4o at 95.5%, and Gemini 2.5 Pro at 96.8%. Even open-source models like Qwen 2.5 72B (89.5%) and Llama 4 Scout (88.7%) perform strongly. The community has shifted attention to harder math benchmarks like MATH for differentiating model capabilities.

MATH — Competition Mathematics

The MATH benchmark contains 12,500 problems from high school mathematics competitions (AMC, AIME, and similar), covering seven subjects: algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus. Problems range from AMC-8 difficulty (accessible to strong middle school students) to AIME difficulty (challenging for talented high schoolers and beyond).

MATH is significantly harder than GSM8K and remains the best benchmark for measuring deep mathematical reasoning. A human expert-level score is estimated at around 90%. The benchmark tests not just arithmetic but mathematical insight — the ability to recognize when a problem requires a specific technique, to apply abstract reasoning, and to handle multi-step proofs.

As of 2026, Claude Opus 4 leads at 78.5%, followed by Gemini 2.5 Pro at 80.2% and GPT-4o at 76.4%. The gap between closed-source and open-source models is largest on MATH, with the best open-source model (DeepSeek V3 at 64.3%) trailing by about 16 points. This gap reflects the advantage that larger models and more sophisticated training techniques (reinforcement learning from human feedback, chain-of-thought fine-tuning) provide for complex reasoning.

HellaSwag — Common Sense Reasoning

HellaSwag tests common-sense reasoning about everyday situations. Given the beginning of a scenario, the model must select the most plausible continuation from four options. Created by Zellers et al. in 2019, it was originally designed to be "adversarially hard" — the wrong answers are generated by a language model and filtered to be superficially plausible but nonsensical to humans.

While HellaSwag was once considered extremely difficult (GPT-2 scored below 50%), modern large language models have largely solved it. The top models score above 95%, and even smaller models score above 85%. HellaSwag remains useful as a baseline check for common-sense understanding but has lost much of its discriminative power among frontier models. Its inclusion in standard benchmark suites is partly historical — it was a key differentiator during the GPT-3 era and continues to be reported for trend comparison.

Historical Progression and Trends

The pace of benchmark improvement has been remarkable. On MMLU, the best score improved from 43.9% (GPT-3, 2020) to 86.4% (GPT-4, 2023) to 92.0% (Claude Opus 4, 2025) — a 48-point improvement in five years. On HumanEval, scores went from 26.2% (Codex, 2021) to 67.0% (GPT-4, 2023) to 92.7% (Claude Opus 4, 2025). The rate of improvement has slowed as models approach saturation on current benchmarks, but each percentage point at the top represents significantly harder problems being solved.

The most significant trend in 2025-2026 is the convergence between open-source and closed-source models. On MMLU, the gap narrowed from 15+ points in early 2024 to under 4 points by mid-2026. This convergence is driven by larger open-source training datasets, more efficient architectures (mixture-of-experts), and the open publication of training techniques by leading labs. For benchmark-sensitive applications, this convergence means that self-hosted open-source models are now viable alternatives to cloud APIs.

Limitations and Caveats

Benchmark scores should be interpreted carefully. Data contamination — where benchmark questions appear in training data — is an ongoing concern that can inflate scores by 2-5 points. Different evaluation harnesses (the software used to run benchmarks) can produce different scores for the same model due to variations in prompting format, sampling parameters, and answer extraction methods. Self-reported scores from model providers tend to be higher than independently verified scores.

Most importantly, benchmark scores measure narrow, well-defined tasks that may not correlate with real-world performance on your specific use case. A model that scores 2 points higher on MMLU may not produce noticeably better results for your customer support chatbot. Always complement benchmark comparisons with empirical testing on representative samples from your actual workload.

Frequently Asked Questions

What is MMLU and why is it the most cited AI benchmark?

MMLU (Massive Multitask Language Understanding) tests AI models on 14,042 multiple-choice questions across 57 academic subjects. It became the most cited benchmark because it measures broad knowledge and reasoning across diverse fields in a standardized format. A score of 90%+ indicates expert-level knowledge. Top models like Claude Opus 4 (92.0%) and Gemini 2.5 Pro (91.7%) have reached near-saturation, prompting development of harder variants like MMLU-Pro.

Which AI model scores highest on HumanEval for coding?

Claude Opus 4 leads HumanEval with a 92.7% pass@1 score, followed by Gemini 2.5 Pro at 91.2% and GPT-4o at 90.2%. Among open-source models, DeepSeek V3 scores 82.6% and Llama 4 Scout reaches 80.5%. HumanEval tests Python code generation with 164 programming challenges.

How do open-source AI models compare to closed-source on benchmarks?

The gap has narrowed dramatically by 2026. On MMLU, the best open-source model (Llama 4 Scout at 88.5%) trails the best closed-source (Claude Opus 4 at 92.0%) by only 3.5 points. The remaining gap is largest on complex reasoning tasks (MATH) where closed-source models maintain a 10-15 point lead.

What are the limitations of AI benchmarks?

Key limitations include data contamination (models may have seen questions during training), benchmark saturation (when top models all score 90%+), gaming (optimization for benchmarks without real-world improvement), narrow measurement (single benchmarks do not capture full capability), and static snapshots that do not reflect performance on novel tasks. Benchmark scores should be one factor among many in model selection.

Can I download the benchmark data?

Yes, use the "Download JSON" or "Download CSV" buttons above the chart to export the complete benchmark dataset. The data includes all 12 models with MMLU, HumanEval, GSM8K, MATH, and HellaSwag scores, plus provider and release date. Export is generated client-side with no server interaction.

About the Author

Built by Michael Lip — solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ developer tools across the Zovo network. No tracking, no ads, no data collection.