Context Arena

Needles:

Legend Interaction: Single-click to toggle individual models. Double-click to toggle all models from the same provider.

info

Note: Deprecated models and older revisions are hidden by default. To view all models, go to the and adjust the filters.

Model	Max Ctx	8k (%)	16k (%)	32k (%)	64k (%)	128k (%)	256k (%)	512k (%)	1M (%)	AUC @128k (%)	AUC @1M (%) ▼	Model AUC (%)	Runs	Input Cost ($)	Output Cost ($)
gemini-2.5-flash-05-20 psychology	1,048,576	#3 98.2%	#2 94.6%	#1 91.5%	#2 88.7%	#1 93.5%	#2 83.3%	#2 76.0%	#1 68.1%	#1 91.5%	#1 78.3%	#1 78.3%	6,320	#10 $361.6285	WARN $6.8816
gemini-2.5-pro-06-05 psychology	1,048,576	#2 99.4%	#1 96.7%	#2 90.9%	#1 89.8%	#2 84.4%	#1 86.1%	#1 77.2%	#3 63.9%	#2 89.6%	#2 77.5%	#2 77.5%	6,320	#13 $3,013.5705	WARN $146.4724
gemini-2.5-flash-05-20 psychology	1,048,576	#12 91.3%	#8 86.0%	#9 78.1%	#3 80.9%	#3 82.9%	#3 73.4%	#3 66.4%	#2 64.9%	#3 81.7%	#3 70.2%	#4 70.2%	6,320	#10 $361.6285	WARN $0.8077
gemini-pro-1.5-002	2,000,000	#11 92.9%	#19 67.9%	#11 73.8%	#10 67.3%	#4 79.7%	#8 52.5%	#6 51.3%	#4 47.3%	#7 72.8%	#4 54.8%	#13 54.8%	6,400	#14 $3,198.4487	#11 $25.6526
gpt-4.1	1,047,576	#15 87.6%	#18 68.0%	#12 70.4%	#15 57.5%	#12 55.5%	#6 57.1%	#4 55.6%	#5 42.1%	#13 61.6%	#5 53.2%	#15 53.2%	6,384	#12 $2,468.4629	#10 $19.7168
gemini-flash-1.5-002	1,000,000	#36 58.6%	#26 59.3%	#28 54.9%	#13 61.6%	#16 52.0%	#12 44.8%	#9 45.8%	#6 34.9%	#15 57.4%	#6 44.6%	#23 44.6%	6,208	#9 $162.7636	#8 $1.5166
gpt-4.1-mini	1,047,576	#26 73.4%	#30 57.2%	#21 58.2%	#28 42.6%	#21 47.2%	#11 45.1%	#8 47.7%	#7 32.5%	#26 49.4%	#7 43.6%	#24 43.6%	6,384	#11 $493.6926	#9 $3.9214
llama-4-maverick	1,048,576	#22 80.4%	#24 60.0%	#24 56.4%	#22 48.4%	#18 50.3%	#10 48.6%	#11 36.2%	#8 30.1%	#22 52.7%	#8 39.9%	#30 39.9%	6,024	#8 $136.3894	#7 $1.4152
gemini-2.5-flash-lite-preview-06-17 psychology	1,048,576	#25 74.1%	#23 60.2%	#22 57.3%	#17 51.0%	#22 46.7%	#13 40.5%	#10 37.0%	#9 17.4%	#21 52.8%	#9 35.2%	#35 35.2%	6,320	#6 $120.5428	WARN $10.2502
gemini-2.5-flash-lite-preview-06-17 psychology	1,048,576	#43 54.0%	#36 53.5%	#30 54.1%	#26 45.4%	#20 48.4%	#14 38.3%	#12 33.2%	#11 15.9%	#27 49.0%	#10 32.6%	#37 32.6%	6,320	#6 $120.5428	WARN $0.1329
gemini-2.0-flash-001	1,048,576	#31 65.6%	#25 59.9%	#20 59.6%	#19 49.6%	#8 60.1%	#17 37.2%	#13 29.9%	#13 13.3%	#16 56.0%	#11 32.1%	#38 32.1%	6,224	#5 $110.0136	#6 $0.9748
minimax-01	1,000,192	#18 84.3%	#11 78.5%	#10 74.5%	#11 67.0%	#5 71.9%	#4 61.4%	#5 53.8%	N/A	#9 71.6%	#12 31.4%	#5 63.2%	5,224	inc $90.9933	inc $2.0702
minimax-m1 psychology	1,000,000	#21 82.6%	#10 81.7%	#8 80.3%	#5 75.2%	#6 68.5%	#5 57.1%	#7 50.7%	N/A	#5 75.4%	#13 30.4%	#10 61.3%	4,920	inc $467.2028	inc $20.8827
gpt-4.1-nano	1,047,576	#45 47.0%	#39 48.5%	#37 49.8%	#30 41.3%	#28 36.4%	#25 20.7%	#14 26.7%	#14 13.0%	#32 42.6%	#14 24.6%	#43 24.6%	6,384	#7 $123.4231	#5 $0.8169
gemini-2.0-flash-lite-001	1,048,576	#40 56.1%	#38 48.6%	#39 46.6%	#40 30.3%	#26 38.6%	#21 29.1%	#15 19.7%	#10 17.0%	#37 38.4%	#15 24.2%	#44 24.2%	6,224	#4 $82.5102	#3 $0.5253
llama-4-scout	1,048,576	#32 64.0%	#35 53.9%	#40 44.6%	#38 31.9%	#35 33.1%	#26 20.7%	#16 17.6%	#12 13.4%	#38 38.1%	#16 20.5%	#47 20.5%	5,702	#2 $55.0137	#4 $0.6192
gemini-flash-1.5-8b-001	1,000,000	#47 43.6%	#46 42.8%	#49 29.5%	#43 27.9%	#38 27.0%	#27 16.2%	#17 17.0%	#15 12.1%	#42 30.0%	#17 17.8%	#48 17.8%	6,208	#3 $81.4626	#2 $0.4052
claude-sonnet-4 psychology	200,000	#7 96.4%	#4 93.6%	#7 80.8%	#7 70.9%	#15 52.8%	#7 54.1%	N/A	N/A	#10 71.2%	#18 15.1%	#7 62.0%	4,176	inc $594.2508	inc $64.0251
claude-sonnet-4 psychology	200,000	#10 93.0%	#12 78.3%	#14 69.3%	#9 67.9%	#7 62.3%	#9 49.4%	N/A	N/A	#11 68.5%	#19 15.1%	#8 62.0%	4,176	inc $594.4869	inc $28.4040
o4-mini psychology	200,000	#6 96.4%	#7 88.9%	#3 86.6%	#4 78.2%	#11 57.4%	#16 37.8%	N/A	N/A	#4 76.0%	#20 15.0%	#9 61.3%	4,448	inc $247.9571	inc $37.2230
o3 psychology	200,000	#4 97.5%	#5 91.4%	#5 85.4%	#6 71.2%	#13 55.4%	#19 32.8%	N/A	N/A	#8 72.7%	#21 14.1%	#12 57.9%	4,448	inc $450.8311	inc $51.7601
claude-3-haiku OLD	200,000	#30 67.5%	#34 54.3%	#29 54.5%	#24 46.7%	#10 58.6%	#15 38.0%	N/A	N/A	#20 52.9%	#22 12.3%	#16 50.5%	4,176	inc $49.5406	inc $2.2391
claude-3.7-sonnet psychology	200,000	#16 85.5%	#20 67.6%	#17 62.4%	#18 50.4%	#17 51.1%	#18 36.5%	N/A	N/A	#17 55.9%	#23 12.1%	#18 49.6%	4,176	inc $594.4813	inc $27.8910
claude-3.7-sonnet psychology	200,000	#17 85.1%	#22 65.4%	#18 62.3%	#21 49.0%	#14 52.9%	#22 28.4%	N/A	N/A	#18 55.5%	#24 11.7%	#19 47.8%	4,176	inc $594.5065	inc $33.5946
claude-3.5-sonnet	200,000	#13 89.7%	#21 66.6%	#27 54.9%	#25 45.8%	#24 46.1%	#20 31.5%	N/A	N/A	#23 51.2%	#25 10.9%	#22 44.8%	4,176	inc $594.4836	inc $27.3120
claude-3.5-haiku	200,000	#29 67.8%	#33 55.2%	#32 51.5%	#20 49.1%	#23 46.2%	#23 22.9%	N/A	N/A	#25 50.0%	#26 10.3%	#26 42.0%	4,120	inc $155.2502	inc $7.3357
o3-mini psychology	200,000	#27 72.6%	#31 56.0%	#31 52.6%	#31 40.1%	#31 34.7%	#24 21.9%	N/A	N/A	#30 43.8%	#27 8.7%	#34 35.8%	4,456	inc $248.4745	inc $47.7402
claude-opus-4:curated psychology unranked	200,000	96.6%	94.3%	84.3%	67.0%	63.2%	N/A	N/A	N/A	73.1%	8.6%	73.1%	2,797	inc $1,688.0723	inc $190.7885
grok-3-beta	131,072	#1 99.5%	#6 90.0%	#4 85.9%	#8 68.7%	#9 60.0%	N/A	N/A	N/A	#6 73.0%	#28 8.6%	#3 73.0%	4,000	inc $436.6788	inc $23.1833
claude-opus-4 psychology	200,000	#5 96.6%	#3 94.3%	#6 84.3%	#12 67.0%	#39 25.9%	N/A	N/A	N/A	#12 63.2%	#29 7.5%	#6 63.2%	3,269	inc $2,465.6002	inc $190.7885
deepseek-r1-0528 psychology	163,840	#8 95.0%	#9 85.3%	#13 69.6%	#14 58.6%	#25 44.5%	N/A	N/A	N/A	#14 61.0%	#30 7.2%	#11 61.0%	3,736	inc $59.7154	inc $7.6832
grok-3-mini-beta psychology	131,072	#20 82.8%	#13 76.9%	#16 67.7%	#16 52.5%	#27 37.0%	N/A	N/A	N/A	#19 54.8%	#31 6.5%	#14 54.8%	3,976	inc $44.2934	inc $2.8246
qwen-turbo	1,000,000	#46 46.0%	#49 35.5%	#48 30.8%	#45 23.6%	#47 14.2%	#29 1.8%	#18 3.3%	#16 4.2%	#47 24.5%	#32 6.4%	#53 6.4%	5,984	#1 $44.3817	#1 $0.1932
grok-3-mini-beta:high psychology	131,072	#24 75.8%	#17 69.8%	#19 61.3%	#23 48.0%	#30 34.8%	N/A	N/A	N/A	#24 50.3%	#33 5.9%	#17 50.3%	4,000	inc $44.4349	inc $4.8529
mistral-small-3.1-24b-instruct	131,072	#35 60.0%	#28 58.7%	#23 57.2%	#32 38.5%	#19 48.7%	N/A	N/A	N/A	#28 47.7%	#34 5.6%	#20 47.7%	3,728	inc $6.2525	inc $0.2265
qwen3-14b psychology	40,960	#34 62.6%	#27 58.8%	#35 50.9%	#27 43.9%	#32 34.5%	N/A	N/A	N/A	#29 44.9%	#35 5.3%	#21 44.9%	3,416	inc $5.4668	inc $0.7353
mistral-medium-3	131,072	#41 55.7%	#37 53.1%	#36 50.2%	#29 42.5%	#34 33.2%	N/A	N/A	N/A	#31 43.0%	#36 5.1%	#25 43.0%	3,936	inc $59.9857	inc $3.1361
qwq-32b psychology	131,072	#14 88.7%	#15 74.0%	#15 68.2%	#35 34.8%	#46 14.3%	N/A	N/A	N/A	#33 41.7%	#37 4.9%	#27 41.7%	3,688	inc $8.7519	inc $0.5107
deepseek-r1-0120 psychology	163,840	#9 93.7%	#14 75.0%	#26 55.8%	#34 34.9%	#42 18.0%	N/A	N/A	N/A	#34 40.6%	#38 4.8%	#28 40.6%	3,952	inc $78.5818	inc $9.2225
gemma-3-12b-it	131,072	#44 49.4%	#41 48.3%	#34 51.2%	#36 34.6%	#29 35.0%	N/A	N/A	N/A	#35 39.9%	#39 4.7%	#29 39.9%	3,664	inc $5.9463	inc $0.1337
qwen3-30b-a3b psychology	40,960	#42 55.2%	#42 46.0%	#38 47.4%	#33 38.2%	#36 29.7%	N/A	N/A	N/A	#36 39.1%	#40 4.6%	#31 39.1%	3,416	inc $7.2671	inc $1.0418
llama-3.3-70b-instruct	131,072	#23 76.6%	#29 57.3%	#25 56.2%	#37 34.5%	#44 17.1%	N/A	N/A	N/A	#39 37.9%	#41 4.5%	#32 37.9%	3,672	inc $4.5006	inc $0.1701
qwen3-32b psychology	131,072	#28 70.4%	#32 55.4%	#42 43.5%	#39 30.3%	#37 29.3%	N/A	N/A	N/A	#40 36.5%	#42 4.3%	#33 36.5%	3,416	inc $26.4036	inc $1.8788
deepseek-chat-v3-0324	163,840	#37 58.4%	#43 44.3%	#44 38.8%	#41 29.4%	#33 33.8%	N/A	N/A	N/A	#41 34.9%	#43 4.1%	#36 34.9%	3,960	inc $40.8046	inc $1.2682
qwen3-235b-a22b psychology	128,000	#19 83.7%	#16 70.5%	#45 36.8%	#47 21.0%	#45 14.9%	N/A	N/A	N/A	#43 29.6%	#44 3.5%	#39 29.6%	3,416	inc $20.0448	inc $3.0603
command-r7b-12-2024	128,000	#49 40.2%	#44 44.3%	#43 40.3%	#44 25.6%	#43 17.3%	N/A	N/A	N/A	#44 28.7%	#45 3.4%	#40 28.7%	3,704	inc $4.5529	inc $0.2174
gemma-3-27b-it	131,072	#53 30.9%	#52 25.3%	#47 33.6%	#42 28.0%	#40 25.1%	N/A	N/A	N/A	#45 28.2%	#46 3.3%	#41 28.2%	3,712	inc $12.0846	inc $0.3058
qwen3-8b psychology	128,000	#39 56.5%	#47 40.4%	#46 34.4%	#46 22.3%	#41 21.3%	N/A	N/A	N/A	#46 27.4%	#47 3.2%	#42 27.4%	3,416	inc $3.1890	inc $0.5640
mistral-large-2411	131,072	#33 63.8%	#40 48.3%	#33 51.3%	#48 17.2%	#53 0.0%	N/A	N/A	N/A	#48 24.1%	#48 2.8%	#45 24.1%	3,824	inc $304.2297	inc $9.4407
nova-lite-v1	300,000	#38 57.4%	#48 37.2%	#50 24.9%	#50 14.2%	#50 6.0%	#28 3.4%	N/A	N/A	#50 17.9%	#49 2.7%	#51 11.1%	4,608	inc $25.0529	inc $0.2342
ministral-8b	128,000	#48 40.6%	#45 42.8%	#41 44.6%	#49 15.8%	#49 7.2%	N/A	N/A	N/A	#49 22.8%	#50 2.7%	#46 22.8%	3,712	inc $12.2571	inc $0.1293
gemma-3-4b-it	131,072	#52 35.6%	#51 27.8%	#52 20.5%	#51 10.6%	#48 8.1%	N/A	N/A	N/A	#51 14.5%	#51 1.7%	#49 14.5%	3,688	inc $2.4073	inc $0.0343
ministral-3b	131,072	#51 35.8%	#50 35.3%	#51 24.6%	#53 6.9%	#51 5.4%	N/A	N/A	N/A	#52 13.9%	#52 1.6%	#50 13.9%	3,728	inc $5.0020	inc $0.0360
nova-micro-v1	128,000	#50 39.7%	#53 19.2%	#53 12.4%	#52 7.2%	#52 4.0%	N/A	N/A	N/A	#53 9.7%	#53 1.1%	#52 9.7%	3,448	inc $4.5526	inc $0.0811
mistral-nemo	131,072	#54 19.1%	#54 4.3%	#54 1.8%	#54 0.1%	#53 0.0%	N/A	N/A	N/A	#54 1.5%	#54 0.2%	#54 1.5%	3,760	inc $1.3060	inc $0.0059

Table Interaction: Click headers to sort. refresh (Model header): Reload data. timeline (Model cell): View model performance chart. query_stats (Bin header): View Cost/Score plot. article (Score cell): View test details.

Notes & Definitions:

N/A indicates no results for this model/bin, potentially due to context window limits.
AUC (Area Under Curve) normalized to 100% reflects overall performance across bins, weighted by bin width. Model AUC is normalized only for context bins within the model's max context length.
Badge definitions:
- #1 #2 #3: Top 3 models for that metric (higher score / lower cost is better).
- unranked: Unranked due to known issues.
- WARN: Cost Inaccuracy reported.
- inc: Incomplete cost data (potentially underestimated cost, excluded from cost rank).
- OLD: Old model (released >1yr ago).

info

Note: Some technical terms have a dotted underline. Hover over them for a brief explanation.

Benchmark Details & Methodology

1. What is the OpenAI-MRCR benchmark?

OpenAI MRCR tests a Large Language Model's (LLM) ability to handle complex conversational history. Key aspects include:

Core Task: Finding and distinguishing between multiple identical pieces of information ("needles") hidden within a long conversation ("haystack").
Setup: Inspired by Google DeepMind's MRCR eval (arxiv:2409.12640v2), this version inserts 2, 4, or 8 identical requests (e.g., "write a poem about tapirs") alongside distractor requests. Needles/distractors are generated by GPT-4o to blend in.
Challenge: The model must retrieve a specific instance (e.g., the 2nd poem) based on its order, requiring careful tracking of the conversation. It must also prepend a specific random code (hash) to its answer.
Data Source: The benchmark data and detailed methodology are described on Hugging Face (openai/mrcr).
Dashboard Scope: This dashboard visualizes results directly from that published dataset and does not currently run new evaluations.

2. Which is harder: 2-needle or 8-needle tests?

Generally, 8-needle tests are more challenging for language models. A common failure mode with 8 needles, compared to 2 needles, is that the model might identify multiple similar "needles" (pieces of information) within the long context but select the incorrect one out of the eight possibilities.

To understand the setup for each test case:

Context: The input context is populated with various writing mediums (e.g., poems, letters). Each medium focuses on several topic categories (e.g., "tapir," "chair").
Needle Placement: There are 2, 4, or 8 instances of each combination of writing medium and topic (e.g., "write a poem about tapirs," followed by the poem) scattered throughout the context.
The Question: The model is then asked a specific question, like "Return the fifth poem about tapirs and prepend the code XXX to it."
Scoring: The model's response is graded from 0-100% based on its similarity to the expected answer.

So, an "8-needle" test means that when the model is quizzed, there are eight distinct instances of the same writing medium and topic, and the task is to retrieve the nth specific one found in the context.

While 8-needle tests are typically harder and can be considered a more robust measure of this specific capability, the importance of 2-needle versus 8-needle performance can depend on the specific use case. This is why results for different needle counts are often presented.

3. How is the score calculated?

The score measures how accurately the model retrieves the correct instance of the requested needle. The process involves:

Comparison Method: The model's answer is compared to the expected answer using the SequenceMatcher ratio from Python's `difflib` library.
Hash Requirement: The model *must* include a specific random code (hash identifier) at the start of its answer. This hash is removed before the comparison.
Failure Condition: If the required hash is missing or incorrect, the score is automatically 0.
Result: The similarity ratio (0.0 to 1.0) from the comparison is presented as a percentage (0-100%) in the table.

See FAQ #6 for details about how Area Under Curve (AUC) scores summarize performance across different context lengths.

(Source: OpenAI MRCR Dataset Card)

4. What do the context length "bins" (e.g., 128k, 1M) mean?

The "bins" group test runs based on the total length of the text involved (prompt + expected answer). Here's how they work:

Measurement: Length is measured in tokens using the `o200k_base` tokenizer.
Grouping: Tests are grouped into bins based on their total token count. For example, the "128k" bin includes tests with >65,536 and <=131,072 tokens.
Score Display: The score shown for a bin (e.g., "128k (%)") is the average score from the 100 test samples conducted within that bin's length range.
Boundaries: The specific bin boundaries are [4k, 8k], (8k, 16k], (16k, 32k], (32k, 65k], (65k, 128k], (128k, 256k], (256k, 512k], (512k, 1M].

(Source: OpenAI MRCR Dataset Card)

5. How is this different from Fiction.livebench?

Both test how well models handle long texts, but focus on different skills. OpenAI-MRCR tests the model's ability to pinpoint and distinguish between identical pieces of information based on their order in a conversation (using synthetic data). Fiction.livebench (fiction.live/...) tests narrative understanding – how well models follow plots, characters, and consistency within complex stories, using quizzes based on actual fiction excerpts.

6. How does the benchmark design reduce the risk of models succeeding due to training data contamination?

The benchmark incorporates several design features to minimize the chance that models succeed simply by having seen similar data during training:

Synthetic & Unique Data: Each test uses a specially generated, long conversation. While topics might overlap with training data, the specific sequence of turns and the placement of "needles" are unique to the benchmark run.
Instance Specificity: The core task isn't just retrieving information (e.g., a poem about tapirs) but retrieving a specific instance (e.g., the 2nd poem requested) based on conversational order. Simple memorization of poems is insufficient.
Required Random Hash: Models must prepend a specific random code (hash) to their answer. This code is generated for the test run and cannot be predicted from general training data. Failure to include the correct hash results in a score of 0.

These elements combined make it highly unlikely that a model can achieve a high score purely by recalling memorized training data, as success requires understanding the specific conversational context, instance order, and adhering to the random hash requirement. (Source: OpenAI MRCR Dataset Card)

Note: This dashboard evaluates models against the dataset published by OpenAI (data collected up to April 11, 2024). Models released or significantly updated after this date might have been trained on this specific benchmark data, potentially affecting their results.

Understanding the Results

7. What are the different AUC scores and how are they calculated?

AUC (Area Under Curve) gives a single score summarizing performance across different context lengths (bins). Think of it like an average grade across tests of increasing difficulty (longer contexts). It's calculated by plotting the average score for each bin against the maximum context length of that bin and measuring the area under the resulting line/curve. This area is then normalized to a percentage (0-100%).

AUC @128k: AUC calculated using results only up to the 128k token bin (tests with up to 131,072 tokens).
AUC @1M: AUC calculated using results across all bins up to the 1M token bin (tests with up to 1,048,576 tokens).
Model AUC: AUC calculated only over the range of bins the specific model actually completed successfully. This provides a fairer comparison if a model couldn't handle the longest contexts.

Technical Note: Calculation uses the Trapezoidal Rule on a linear scale of context lengths. See example below for details.

Show Calculation Example

Example (AUC @1M for google/gemini-2.5-flash-05-20:thinking - 2-Needle Top Model):

Data Points (Bin, Score%): (8k, 98.2), (16k, 94.6), (32k, 91.5), (64k, 88.7), (128k, 93.5), (256k, 83.3), (512k, 76.0), (1M, 68.1)
Calculate Trapezoid Areas:
  - 8k-16k: (16384 - 8192) * (98.2 + 94.6)/2 = 789,782.55
  - 16k-32k: (32768 - 16384) * (94.6 + 91.5)/2 = 1,524,358.43
  - 32k-64k: (65536 - 32768) * (91.5 + 88.7)/2 = 2,952,396.58
  - 64k-128k: (131072 - 65536) * (88.7 + 93.5)/2 = 5,972,783.7
  - 128k-256k: (262144 - 131072) * (93.5 + 83.3)/2 = 11,586,747.3
  - 256k-512k: (524288 - 262144) * (83.3 + 76.0)/2 = 20,871,405.85
  - 512k-1M: (1048576 - 524288) * (76.0 + 68.1)/2 = 37,757,386.49
Sum Areas: 81,454,860.9
Normalize: The total width for AUC @1M is (1M bin - 8k bin) = 1048576 - 8192 = 1,040,384.
Normalized AUC = (Total Area / Total Width) = 81,454,860.9 / 1,040,384 ≈ 78.2931
Result: AUC @1M ≈ 78.3%

(Note: The example above dynamically uses data for the 2-needle benchmark from the model currently ranked #1 by AUC @1M for illustrative purposes. The same calculation method applies to all models.)

AUC @128k uses the same method but only sums areas up to the 128k point and normalizes by the width from the first bin to 128k. Model AUC normalizes by the width of the range actually tested by the model.

8. How is the cost calculated?

Cost estimates shown are calculated only for successful test runs (where the model returned a response) and are based on the following:

Pricing Source: Public pricing data reported by OpenRouter for each model, specifically using the rate from the cheapest provider listed for that model at its maximum context length.
Token Counting Method: Cost estimates use the input (prompt) and output (completion) token counts reported by the API provider for each successful run.
- Note: This method may differ from the tokenization used for grouping results. Test results are assigned to context length bins based on total tokens (prompt + expected completion) calculated with the `o200k_base` tokenizer, as opposed to the provider-reported counts used for cost estimation.
Input Cost: Estimates the cost based on the total tokens for the input (prompt) reported by the API. When generating multiple candidates (n>1), the prompt is typically sent once, but the exact cost impact can vary by API provider.
Output Cost: Estimates the cost of all output tokens generated by the model, which includes any reasoning and the final response (reasoning and response). It sums the output token costs across all generated candidates (e.g., all 8 runs per test).

Important Note on Actual vs. Estimated Cost:

The benchmark execution aimed to utilize cost-saving measures such as batch processing, caching, and available discounts/credits when possible.
Consequently, the actual costs incurred during the original benchmark run may have been lower than the estimates presented here.

To ensure a fair and standardized comparison across all models for users, this dashboard displays estimated costs. These are calculated using publicly available on-demand prices from OpenRouter (based on the cheapest provider listed for the model's max context length), rather than the potentially variable actual costs from the benchmark execution.

9. What does the "Runs" column represent?

This shows the total number of potential answers generated by the model across all successful test runs included in the summary. Each answer corresponds to one candidate response (or 'run'), typically generated using the 'n' parameter in the API call to get multiple outputs per input.

The total reflects the generation workload: (Number of Successful Tests) × (Candidates Generated Per Test).

10. What do the badges (#1, #2, #3, inc) in the table mean?

#1 #2 #3 These badges highlight the top 3 models for that specific column's metric. Higher scores are better for performance (%), while lower costs ($) are better.
unranked The "unranked" badge indicates that a model is not included in rankings for specific reasons, such as known issues with its performance. Hover over the badge on a model for the specific reason.
WARN This badge appears in a cost column if the API has a known token-reporting inaccuracy. The cost is an estimate based on potentially incorrect data and is excluded from ranking. Hover over the badge for details.
inc The "inc" (incomplete) badge appears in cost columns if the model didn't successfully complete tests for all context length bins (up to 1M tokens). The cost shown might be lower than it would be if all tests had passed. Models marked inc are not included in the cost rankings to ensure fair comparison.
FREE This badge appears in cost columns for models where cost data is zero or unavailable (e.g., free models). These are also excluded from cost rankings.
deprec This badge indicates the model may be deprecated or no longer actively supported/available via common APIs (like OpenRouter). Results might be stale.
new This badge indicates results for this model were updated recently (within the last 7 days). This badge takes precedence over the deprec badge.
OLD This badge indicates the model was released more than one year ago. It is not shown if the new badge is present.

Dashboard Features

11. How do I interact with the table headers?

The main results table headers provide several actions:

Sorting: Click any column header label (e.g., `AUC @1M (%)`) to sort the entire table by that column's values. Click again to reverse the sort direction.
Refresh Data: Click the refresh icon (refresh) in the 'Model' column header to reload the latest results for the currently selected needle count.
Model Performance Chart: Click the timeline icon (timeline) next to a model name in the 'Model' column to view a chart of that model's performance across all available needle counts.
Cost/Score Plot: Click the stats icon (query_stats) in the header of any context bin column (e.g., `128k (%)`) to view a Cost vs. Score scatter plot for models within that specific bin.

12. How can I see a Cost vs. Score chart for a specific context bin?

In the main results table, click the stats icon (query_stats) in the header of any context bin column (e.g., `128k (%)`). This opens a scatter plot showing the total cost vs. average score for all currently selected models specifically within that context length bin.

13. How can I view the detailed results for a specific test run?

In the main results table, hover over any individual score cell (the percentage value) and click the document icon (article) that appears. This opens a modal where you can browse through the individual test runs for that model/bin combination, view the expected answer, and see the actual responses generated by the model along with their scores.

14. Can I customize the table columns?

Yes! Go to the "Controls" tab (next to the "Leaderboard" tab). There you can find options to show/hide specific columns, like all the individual context length bins or the pricing information.

15. How are the initially visible traces (lines/bars) on the main chart chosen?

When viewing the main performance chart:

If 5 or fewer models are selected in the "Controls" tab, all their performance traces (lines/bars) will be visible on the chart by default.
If more than 5 models are selected, the chart initially shows only the traces for the top 5 performing non-deprecated models (ranked by their overall `AUC @1M (%)` score).
All other selected models are listed in the legend but their traces are hidden initially. Click a model's name in the legend to display its performance trace.
You can toggle the visibility of any model's performance trace by clicking its name in the legend.

16. How are the initially visible points on the Cost vs. Score chart chosen?

When viewing the Cost vs. Score chart for a specific bin or AUC metric:

The chart initially displays data points only for the top 10 performing models among those currently selected. The ranking is based on the primary score metric being plotted: either the average score (%) within that specific context bin, or the relevant AUC score (e.g., `AUC @1M (%)`, `Model AUC (%)`) if viewing an AUC plot.
All other selected models are listed in the legend but their data points are hidden initially. Click a model's name in the legend to display its data point.
You can toggle the visibility of any model's data point by clicking its name in the legend.

Data, Code & Contact

17. Where can I find the benchmark data and code?

The dataset details and evaluation methods are described on Hugging Face: openai/mrcr. OpenAI also discussed results in their GPT-4.1 blog post.

18. Can you evaluate more models?

We plan to add more models over time. Feel free to suggest specific models you'd like to see evaluated!

19. How can I contact you?

For questions, suggestions, or issues, please reach out on Twitter: @DillonUzar.

Context Arena

Table Options

Chart Options

Notes & Definitions:

Table Options

Chart Options

Select Models

amazon

anthropic

cohere

deepseek

google

meta-llama

minimax

mistralai

openai

qwen

thudm

x-ai

Benchmark Details & Methodology

Understanding the Results

Dashboard Features

Data, Code & Contact