AI Reasoning Benchmark: MathR-Eval in 2025

TheCryptocurrencyPost

18 hours ago

AI Reasoning Benchmark: MathR-Eval in 2025

We designed a new benchmark, Mathematical Reasoning Eval: MathR-Eval, to test the LLMs’ reasoning abilities, with 100 logical mathematics questions.

Benchmark results

Results show that OpenAI’s o1 and o3-mini are the best performing LLMs in our benchmark.

Methodology

Our dataset includes 100 mathematics questions, which do not include advanced calculus but require reasoning and problem-solving techniques. Our rationale for creating this dataset is that these questions are objective to rate since there is only one correct answer.

We also used the same dataset in our LMC-Eval: Logic/Math Coding Benchmark to compare the models’ reasoning and coding abilities. You can see an example question from our dataset in that article. This is a zero-shot benchmark, we did not provide example questions to train LLMs.

We did not only test the reasoning models, but we also wanted to see their difference from non-reasoning models.

To see the hallucination rates of the models, you can see our AI hallucination benchmark.

AI reasoning models

OpenAI o1, o1-mini and o1-pro: OpenAI released o1 and o1-pro on September 12, 2024. o1-mini is a faster model that is optimized for STEM tasks.

OpenAI o3, o3-mini and o3-mini-high: o3-mini is the most cost-efficient model in OpenAI’s reasoning models, which performed the same as o1 in our benchmark. It is a smaller model than o3 and o3-mini-high.

Claude Sonnet 3.7: Claude Sonnet 3.7 has an extended-thinking mode, where users can adjust the reasoning tokens.

DeepSeek R1: DeepSeek R1 is the only open-source model in this benchmark. It also offers the cheapest API among the reasoning models.

For details about their API pricing, see LLM pricing.

Types of AI reasoning

Different reasoning models employ various approaches:

Deductive reasoning: Drawing specific conclusions from general principles
Inductive reasoning: Forming general conclusions from specific observations
Abductive reasoning: Finding the most likely explanation for observations
Analogical reasoning: Applying solutions from similar past problems
Causal reasoning: Understanding cause-and-effect relationships
Common sense reasoning: Making intuitive judgments

Characteristics of AI reasoning models

Reasoning models perform:

Step-by-step processing: Rather than producing immediate answers, these models break down problems into logical components and work through them sequentially.
Chain-of-thought capabilities: Users can show their work and explain the reasoning pathway from question to conclusion. They can also see that on the chat interfaces.
Extended thinking modes: Some models incorporate dedicated “thinking time” before generating responses, improving accuracy on complex problems.

Machine learning helps develop AI reasoning models with supervised, unsupervised, and reinforcement learning algorithms.

FAQ

What is AI reasoning?

AI reasoning is a key component of artificial intelligence systems, enabling them to make informed decisions and take action. They can draw logical conclusions from complex tasks. AI reasoning involves the use of logical techniques, such as deduction and induction, to make decisions and solve problems.

What are the benefits and limitations of AI reasoning?

AI reasoning has many benefits, including making informed decisions and solving complex problems.
However, AI reasoning also has limitations, including the potential for bias and errors.
AI reasoning systems can be affected by the data quality and knowledge used to train them.
AI reasoning is not a replacement for human reasoning but rather a tool to augment and support human decision-making. It can help with mathematical logic, but it can make mistakes.
AI models still hallucinate in complex reasoning tasks. AI reasoning systems can perform certain types of logical inferences within their trained domains. However, they often struggle with complex deductions requiring consistent belief structures across multiple reasoning steps.

What are the real-world applications of AI reasoning?

AI reasoning is used in expert systems, decision support systems, and other types of AI systems, as well as in medical diagnosis, financial analysis, and customer service applications.

What is the future of AI reasoning?

Using large language models and other types of AI systems enables artificial intelligence reasoning to be applied in many applications. Such systems, which incorporate human thought processes, can help decision-making processes. The developments of machine learning systems, automated reasoning systems, will be increasingly helpful.
The structured approach produces fewer factual errors than models that generate responses in a single pass, and helps reduce hallucination

What are procedural reasoning systems?

Procedural Reasoning Systems (PRS) represent a specialized AI reasoning approach that explicitly encodes and executes procedural knowledge to achieve specific goals.
Unlike statistical reasoning in modern machine learning, PRS organizes knowledge as context-specific procedures within a Belief-Desire-Intention (BDI) framework, where agents maintain beliefs about their environment, desires they aim to fulfill, and intentions they commit to pursuing.
This structured approach offers greater transparency and explainability in reasoning processes than neural network approaches.
While modern AI systems often rely on learned statistical patterns, PRS principles continue to influence autonomous systems, cognitive architectures, and hybrid approaches that combine procedural knowledge with machine learning techniques for more robust reasoning capabilities.

Source link