Benchmark | Reflection 70B | Claude 3.5 Sonnet | Claude 3 Opus | GPT-4o | Gemini 1.5 Pro | Llama 3.1 405B |
---|---|---|---|---|---|---|
GPQA | 55.3% (0-shot Reflection) | 59.4%* (0-shot CoT) | 50.4% (0-shot CoT) | 53.6% (0-shot CoT) | — | 50.7% (0-shot) |
MMLU | 89.9% (0-shot Reflection) | 88.7%** (5-shot) 88.3% (0-shot CoT) | 85.7% (0-shot CoT) | 88.7% (5-shot) 85.9% (0-shot CoT) | 87.3% (5-shot) 88.6% (0-shot CoT) | — |
HumanEval | 91% (0-shot Reflection) | 92.0% (0-shot) | 84.9% (0-shot) | 90.2% (0-shot) | 84.1% | 89.0% (0-shot) |
MATH | 79.7% (0-shot Reflection) | 71.1% (0-shot CoT) | 60.1% (0-shot CoT) | 76.6% (4-shot) | 67.7% | 73.8% (0-shot CoT) |
GSM8K | 99.2% (0-shot Reflection) | 96.4% (0-shot CoT) | 95.0% (0-shot CoT) | — | 90.8% | 96.8% (8-shot CoT) |
IFEval | 90.13% (0-shot Reflection) | — | — | 85.6% | — | 88.6% |
How to use Reflection 70B Model Online?
Follow these simple steps to start chatting with Reflection 70B.
Reflection 70B Features
🧠
Architecture
Built on the Llama-3.1 framework, incorporating special tokens like <thinking>, <reflection>, and <output> to structure the reasoning process.
📊
Training Data
Trained on synthetic data generated by Glaive, utilizing large datasets to enhance performance in natural language processing tasks.
🏆
Performance
Demonstrated superior performance across benchmarks such as MMLU, MATH, IFEval, and GSM8K, outperforming closed-source models like GPT-4o.
🎯
Reduced Hallucinations
Employs stricter control mechanisms during information verification stages to significantly reduce false information, enhancing user trust and reliability.
FAQ
Frequently Asked Questions about Reflection-70B
- Reflection-70B is an advanced open-source language model designed to minimize hallucinations and improve accuracy in AI-generated outputs through a technique called Reflection-Tuning.
- Reflection-Tuning teaches the model to detect and correct its own reasoning errors by introducing special tokens like thinking , reflection , and output to structure its thought process.
- Reflection-70B has demonstrated superior performance across various benchmarks, including MMLU, MATH, IFEval, and GSM8K, outperforming even closed-source models like GPT-4o.
- By employing stricter control mechanisms during information verification stages, Reflection-70B significantly reduces the generation of false information, enhancing user trust and reliability.
- The weights for Reflection-70B are available on Hugging Face, and an API is set to be released through Hyperbolic Labs for easier integration into applications.
- An even more powerful version, Reflection-405B, is expected to be released soon, anticipated to outperform top proprietary models significantly.