Open-Source LLM Leaderboard

Interactive rankings of 20+ open-source language models across MMLU, HumanEval, GSM8K, MATH, MT-Bench, and ARC benchmarks. Filter by model size, license, and specific benchmarks. Compare any two models head-to-head. Includes closed-model reference scores for context. Updated May 2026.

No data leaves your browser
Showing 0 models
#ModelSizeLicenseMMLUHumanEvalGSM8KMATHMT-BenchARCAvg

Head-to-Head Comparison

Select two models and click "Compare" to see a head-to-head breakdown.

Understanding the Open-Source LLM Landscape in 2026

The open-source AI model ecosystem has matured dramatically. Models like Llama 3.3 70B now match GPT-4-level performance on many benchmarks, while smaller models (7-8B parameters) can run on consumer hardware with competitive quality for focused tasks. This leaderboard tracks the most important open-source models and provides context by including closed-model reference scores. Understanding benchmark scores helps you choose the right model for your specific deployment constraints — GPU budget, latency requirements, and task focus.

Benchmark Descriptions

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects from STEM to humanities at undergraduate level. A strong MMLU score indicates broad general knowledge. Scores above 80 are considered strong; above 85 is state-of-the-art for open models. MMLU is the most cited overall capability benchmark but does not measure reasoning depth or creative abilities.

HumanEval

Evaluates code generation by asking models to implement Python functions from docstrings. The score represents pass@1 — the percentage of problems solved correctly on the first attempt. Scores above 70 indicate strong coding ability. HumanEval is particularly relevant for code assistant applications and reflects practical programming capability.

GSM8K

Grade School Math: 8,500 math word problems requiring multi-step arithmetic reasoning. Tests the model's ability to break down and solve structured mathematical problems. Scores above 85 are strong. GSM8K performance correlates well with general reasoning ability beyond just math.

MATH

Advanced mathematics covering algebra, geometry, number theory, and calculus. Significantly harder than GSM8K — problems require genuine mathematical reasoning rather than pattern matching. Scores above 50 are considered strong for open models. MATH scores have the highest variance across models, making it a good differentiator.

MT-Bench

Multi-turn conversation benchmark scored by GPT-4 as a judge on a 1-10 scale. Tests the model's ability to maintain coherent, helpful conversations across follow-up questions. Scores above 8.0 indicate strong conversational ability. MT-Bench is the most relevant benchmark for chatbot and assistant applications.

ARC (AI2 Reasoning Challenge)

Science reasoning questions at grade-school level. Tests common-sense and scientific reasoning rather than factual recall. ARC-Challenge (the harder subset) is the standard variant. Scores above 85 are strong. ARC complements MMLU by testing reasoning rather than knowledge.

Model Size vs Performance Tradeoffs

Model size affects both capability and deployment cost. A 70B model requires 2x A100 GPUs (~$4-6/hr) or 1x A100 with 4-bit quantization (~$2/hr). An 8B model runs on a single A10G (~$0.50/hr) or even consumer GPUs. The quality gap between 8B and 70B models is typically 5-15% on benchmarks — significant for complex reasoning but negligible for simpler tasks like classification, extraction, and formatting. For production, the optimal strategy is often to use a small model for most requests and route complex requests to a larger model.

License Considerations

Model licenses affect what you can do in production. Apache 2.0 and MIT licenses (Mistral 7B, some Qwen models) allow unrestricted commercial use. The Llama license permits commercial use but with user count restrictions and requires accepting Meta's terms. Some models use custom licenses that restrict specific use cases (medical, legal) or require attribution. Always verify the license before deploying in production — license violations can result in legal liability regardless of whether the model is "open source."

Frequently Asked Questions

What is the best open-source LLM in 2026?

Llama 3.3 70B leads on most general benchmarks. DeepSeek V3 offers the best quality-to-cost ratio. Qwen 2.5 72B excels at multilingual tasks. For smaller models, Llama 3.3 8B and Mistral 7B lead.

How do open-source LLMs compare to GPT-4o and Claude?

The gap has narrowed to 3-5% on coding and math benchmarks. Closed models still lead on complex reasoning and safety. For many production tasks, open-source models are functionally equivalent at 5-20x lower cost when self-hosted.

What GPU do I need to run a 70B parameter model?

Float16: 2x A100 80GB ($4-6/hr). 4-bit quantized: 1x A100 80GB ($2/hr) or 1x RTX 4090 24GB (slow). vLLM or TGI for production serving.

What do the different benchmarks measure?

MMLU: broad knowledge. HumanEval: code generation. GSM8K: math reasoning. MATH: advanced math. MT-Bench: conversation quality. ARC: scientific reasoning. No single benchmark predicts overall quality.

Are open-source LLMs safe to use in production?

They require more safety engineering than commercial APIs. Add output filters, content moderation, guardrails frameworks, and domain-specific red-teaming. More effort but more control over safety behaviors.

About the Author

Built by Michael Lip — solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ developer tools across the Zovo network. No tracking, no ads, no data collection.