OpenAI has announced two new models, o3, which are scheduled to appear early next year. Compared to o1, there was once again significant progress in logic tasks and extraordinarily high values were achieved in a benchmark that AGI wants to measure.
o3 models are coming in 2025
Initially, the o3-mini is scheduled to appear in January, with the larger model in the series a little later. What is particularly causing a stir are the results in the ARC benchmark that o3 achieved.
This comes from François Chollet. He was an AI developer at Google until November 2024, but he was best known to the public for his critical examination of the concept of intelligence used in the context of LLM. This thinking also shapes the ARC benchmark. This test is intended to test the AI systems with tasks that tend to be simple but cannot be derived from the training data.
The challenge is also to develop a solution independently. While people should be able to do this on average in 80 percent of tasks, current models like o1 only manage around 31 percent. Previous generations such as GPT-4o only achieved 5 percent. The ARC benchmark is therefore also considered a benchmark for the development of AGI, i.e. a general artificial intelligence that can keep up with or outperform humans in most tasks.
New milestone in the ARC benchmark
The new o3 models are now achieving new record values. While an efficient version achieves 75.7 percent, the computationally intensive version achieves 87.5 percent. While OpenAI unsurprisingly describes the progress at o3 as remarkable, Chollet is also surprised. In a blog post (via decoder), Chollet speaks of a surprising and important functional leap in AI capabilities that has not been observed before.
The fact that the o1 and o3 models can stand out in this way is due to the revised architecture. The models “think” when they solve tasks – the focus also shifts to the inference phase, where the model calculates the answer. In order to get to the correct answer, it is possible to try out different solutions.
However, operating the models in this form is also expensive. According to Chollet, the efficient model processed 33 million tokens to solve the tasks. So the cost is $2,012, which is $20 per task. And the computationally intensive route requires 172 times as much computing power as the efficient model.
o3 is not an AGI system
Despite the progress, Chollet does not consider the new o3 models to be AGI systems. On the other hand, they would still fail because of tasks that are too easy – this would not be comparable to human intelligence. He also emphasizes that the ARC benchmark alone is not an indicator of AGI. Rather, this stands for focusing on one of the central problems in AI systems – solving tasks for which there is no pattern in the training data.
Chollet has now announced a successor to the ARC benchmark. ARC-AGI-2 is scheduled to appear next year and relies on the same principle – tasks that are easy for humans to solve but pose major challenges for AI systems.
The first to be available will be the o3-mini. This is the version of the model that is intended to be used in most everyday tasks. Additionally, customers who access the models via the API can choose between different efficiency modes. This allows you to determine how much capacity the model has to calculate tasks.
As Sam Altman adds on
Google also announces a “thinking” model with Gemini 2.0 Flash
OpenAI are not the only providers presenting models that shift parts of the calculations to the inference phase. This week Google presented Gemini 2.0 Flash Thinking, a model that also solves tasks step by step.
In general, AI researchers evaluate such an architecture as a solution to keep the speed of AI development high. According to Ilya Sutskever, formerly of OpenAI and now startup founder, scaling reached a plateau in the pre-training phase. Accordingly, it is no longer enough to increase the amount of training data and computing time to make progress. New approaches are needed, said Sutskever. Scaling the inference calculations is one of them.