| Metric | WinForEver-RT1 | DeepSeek V2.5 | Qwen2.5 | Llama3.1 | Claude-3.5 | GPT-4o |
|---|---|---|---|---|---|---|
| Activated Params | 40B | 37B | 72B | 405B | - | - |
| Total Params | 700B | 671B | 72B | 405B | - | - |
| English MMLU (EM) | 89.2 | 88.5 | 85.3 | 88.6 | 88.3 | 87.2 |
| MMLU-Redux (EM) | 89.5 | 89.1 | 85.6 | 86.2 | 88.9 | 88.0 |
| MMLU-Pre (EM) | 78.5 | 75.9 | 71.6 | 73.3 | 78.0 | 72.6 |
| DROP (3-shot F1) | 92.5 | 91.6 | 76.7 | 88.7 | 88.3 | 83.7 |
| IF-Eval (Prompt strict) | 87.3 | 86.1 | 84.1 | 86.0 | 86.5 | 84.3 |
| GPQA-Diamond (Pass@1) | 60.5 | 59.1 | 49.0 | 51.1 | 65.0 | 49.9 |
| SimpleQA (Correct) | 26.0 | 24.9 | 49.0 | 51.1 | 65.0 | 49.9 |
| FRAMES (Acc.) | 74.5 | 73.3 | 69.8 | 70.0 | 72.5 | 80.5 |
| LongBench v2 (Acc.) | 50.0 | 48.7 | 69.8 | 70.0 | 72.5 | 80.5 |
| Code HumanEval-Mul (Pass@1) | 84.2 | 82.6 | 77.3 | 77.2 | 81.7 | 80.5 |
| LiveCodeBench (Pass@1-COT) | 42.3 | 40.5 | 31.1 | 28.4 | 36.3 | 33.4 |
| LiveCodeBench (Pass@1) | 39.0 | 37.6 | 28.7 | 30.1 | 32.8 | 34.2 |
| Codeforces (Percentile) | 53.0 | 51.6 | 24.8 | 25.3 | 20.3 | 23.6 |
| SWE Verified (Resolved) | 44.5 | 42.0 | 23.8 | 24.5 | 50.8 | 38.8 |
| Aider-Edit (Acc.) | 81.0 | 79.7 | 65.4 | 63.9 | 84.2 | 72.9 |
| Aider-Polyglot (Acc.) | 52.0 | 49.6 | 7.6 | 5.8 | 45.3 | 16.0 |
| Math AIME 2024 (Pass@1) | 41.5 | 39.2 | 23.3 | 23.3 | 16.0 | 9.3 |
| MATH-500 (EM) | 92.5 | 90.2 | 80.0 | 73.8 | 78.3 | 74.6 |
| CNMO 2024 (Pass@1) | 45.0 | 43.2 | 15.9 | 6.8 | 13.1 | 10.8 |
| Chinese CLUEWSC (EM) | 91.0 | 90.9 | - | - | - | - |
| C-Eval (EM) | 87.5 | 86.5 | 86.1 | 61.5 | 76.7 | 76.0 |
| SimpleQA (Correct) | 66.0 | 64.1 | 48.4 | 50.4 | 51.3 | 59.3 |
| Financial Analysis (Custom) | 92.0 | 85.0 | - | - | - | - |
| Investment Education (Custom) | 91.5 | 84.0 | - | - | - | - |