See How the Top LLMs Stack Up

LLM Leaderboard

Knowing the best LLMs is key to building the best AI applications, and evaluating them is a daunting task.

MMLU Benchmarks

Comprehensive testing across 57 subjects including mathematics, history, law, and medicine to evaluate LLM knowledge breadth.

HumanEval+

Extended version of HumanEval with more complex programming challenges across multiple languages to test code quality.

GPQA Evaluation

Graduate-level expert knowledge evaluation designed to test advanced reasoning in specialized domains.

MT-Bench Analysis

Multi-turn benchmarking that evaluates conversation abilities, reasoning, and instruction following across complex dialogues.

SWE Benchmarks

Software engineering tests including code generation, debugging, and algorithm design to measure programming capabilities.

GSM8K Reasoning

Grade school math word problems requiring multi-step reasoning to evaluate logical thinking and problem-solving capabilities.

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top LLM Models by MMLU Score

The top LLM for reasoning and problem solving, with a focus on grade school math word problems.

MMLU measures general knowledge and reasoning

MMLU measures general knowledge and reasoning

Gemini 3 Pro

MMLU Score

89.8%

Gemini 3.1 Flash Lite

MMLU Score

89.2%

Claude 4.5 Opus

MMLU Score

88.9%

Gemini 3 Flash

MMLU Score

88.2%

Seed 2.0 Lite

MMLU Score

87.7%

MMLU measures general knowledge and reasoning

MMLU measures general knowledge and reasoning

Gemini 3 Pro

Gemini 3 Pro

Gemini 3 Pro

Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite

Claude 4 Opus

Claude 4 Opus

Claude 4 Opus

Gemini 3 Flash

Gemini 3 Flash

Gemini 3 Flash

Seed 2.0 Lite

Seed 2.0 Lite

Seed 2.0 Lite

80%

90%

100%

Fastest LLM Models by Throughput

The fastest LLMs ranked by tokens processed per second, measuring raw processing speed and efficiency.

Tokens processed per second - higher is better

Tokens processed per second - higher is better

Mercury 2

Throughput

870.9

tokens/s

Inference Speed

0

ms/tokens

Latency

3.67

ms

Provider

Inception

Granite 4.0 Small

Throughput

448.8

tokens/s

Inference Speed

0

ms/tokens

Latency

0.55

ms

Provider

IBM

Nemotron 3 Super 120B

Throughput

390.7

tokens/s

Inference Speed

0

ms/tokens

Latency

0.7

ms

Provider

NVIDIA

GPT-OSS 120B

Throughput

319.541

tokens/s

Inference Speed

0

ms/tokens

Latency

0.47

ms

Provider

OpenAI

GPT-OSS 20B

Throughput

297.704

tokens/s

Inference Speed

0

ms/tokens

Latency

0.503

ms

Provider

OpenAI

Tokens processed per second - higher is better

Mercury 2

Mercury 2

Granite 4.0 Small

Granite 4.0 Small

Nemotron 3 Super 120B

Nemotron 3 Super 120B

GPT-OSS 120B

GPT-OSS 120B

GPT-OSS 20B

GPT-OSS 20B

0.0

500

1.000

Most Cost-Effective LLM Models

The most affordable LLMs ranked by cost per token, helping you optimize your budget without compromising quality.

Price per million tokens - lower is better

Gemma 3n E4B

Gemma 3n E4B

Llama 3.2 1B Instruct

Llama 3.2 1B Instruct

Command R7B

Command R7B

Granite 4.0 Small

Granite 4.0 Small

Qwen 3.5 9B

Qwen 3.5 9B

$0.000

$0.040

$0.080

Price per million tokens - lower is better

Price per million tokens - lower is better

Gemma 3n E4B

Input Price

0.03

$/M

Output Price

0.06

$/M

Effective Price

0.037

$/M

Provider

Google

Llama 3.2 1B Instruct

Input Price

0.053

$/M

Output Price

0.055

$/M

Effective Price

0.053

$/M

Provider

Meta

Command R7B

Input Price

0.0375

$/M

Output Price

0.15

$/M

Effective Price

0.066

$/M

Provider

Cohere

Granite 4.0 Small

Input Price

0.05

$/M

Output Price

0.15

$/M

Effective Price

0.075

$/M

Provider

IBM

Qwen 3.5 9B

Input Price

0.05

$/M

Output Price

0.15

$/M

Effective Price

0.075

$/M

Provider

Qwen

Longest Context Window

Maximum number of tokens a model can process in a single input

While tokenization varies between models, on average, 1 token ≈ 3.5 characters in English

Grok 4 Fast

Grok 4 Fast

Grok 4 Heavy

Grok 4 Heavy

Grok 4.1

Grok 4.1

Grok 4.20

Grok 4.20

GPT 4.1

GPT 4.1

0

1M

2M

While tokenization varies between models, on average, 1 token ≈ 3.5 characters in English

While tokenization varies between models, on average, 1 token ≈ 3.5 characters in English

Grok 4 Fast

Tokens

2,000,000

Grok 4 Heavy

Tokens

2,000,000

Grok 4.1

Tokens

2,000,000

Grok 4.20

Tokens

2,000,000

GPT 4.1

Tokens

1,280,000

Token Generation Speed

Token Generation Speed

Token Generation Speed

Token Generation Speed

Observe how different processing speeds affect real-time token generation.

1200

t/s

The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.

200

t/s

The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.

40

t/s

The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.

Values reset every 5 seconds to demonstrate different speeds

Compare LLM Models

Compare LLM Models

Compare LLM Models

Compare LLM Models

Compare any two LLM models side by side across different metrics, including MMLU, GPQA, HumanEval, DROP, Context Size, Parameters, Input Price, Output Price, Inference Speed, Throughput, and Latency.

Metric

Provider

MMLU Score

GPQA Score

Context Size

Parameters

Input Price

Throughput

Latency

Claude 3.5 Haiku

Anthropic

63.4%

40.8%

200,000

N/A

0.8

49.093

0.689

Claude 3.7 Sonnet

Anthropic

80.3%

65.6%

200,000

N/A

3

N/A

N/A

Get started

Let’s Build AI Agents, Together

Book a demo to see how AI agents can help your team process unstructured documents and perform complex analysis faster and more accurately.

Get started

Let’s Build AI Agents, Together

Book a demo to see how AI agents can help your team process unstructured documents and perform complex analysis faster and more accurately.