Thumbnail for GPT-o4 is HERE - OpenAI is BACK!

GPT-o4 is HERE - OpenAI is BACK!

Channel: Matthew BermanPublished: April 16th, 2025AI Score: 98
29.7K1.5K19315:47

AI Generated Summary

Airdroplet AI v0.2

OpenAI just shook things up again by dropping two new AI models, O3 and O4 Mini! These aren't just slight upgrades; they represent a big step forward, especially because they come with full tool usage capabilities right out of the box, something previous models lacked initially. The presenter is really hyped about this 'agentic tool use', seeing it as a potential new 'scaling law' alongside traditional model training.

What's New with O3 and O4 Mini?

  • Models Released: OpenAI launched O3 and O4 Mini. Confusingly, O3 is currently the top-performing model, while O4 Mini is positioned as a faster, cheaper alternative. The expectation is a full O4 model will eventually take the top spot.
  • Intelligence & Novelty: OpenAI claims these are their smartest models yet, capable of generating truly novel ideas. This is a big deal because the ability to create new concepts is seen as a key step towards the 'intelligence explosion' where AI can improve itself.
  • Multimodality: Like previous advanced models, these can handle various inputs (text, images, audio) and produce outputs in different formats.
  • Agentic Tool Use: This is the star feature. The models don't just use one tool; they can iteratively select and use multiple tools (like browsing the web, running code, analyzing images) to solve complex problems. This makes them feel much more like 'agents' that can actively work towards a goal.
  • Demo: A live demo showed O3 tackling a complex physics problem based on an old poster. It involved understanding the image, realizing data was missing, figuring out how to extrapolate the missing data, browsing the web to find recent research papers, comparing results, and summarizing its findings. The model navigated the task intelligently, even pointing out limitations in the original project's precision compared to modern research. This iterative tool use was impressive and reminded the presenter of agentic systems like Manus.
  • Underlying Tech: Greg Brockman from OpenAI emphasized that this is still fundamentally 'next token prediction', suggesting there's still plenty of room for scaling. They're scaling both pre-training and post-training (using reinforcement learning), and the presenter personally believes tool use is effectively another scaling dimension.

Benchmark Performance

The new models show significant gains across various benchmarks:

  • Math (AIME): Strong improvements over the previous O1 model.
  • Coding (CodeForces): O3 and O4 Mini (with terminal access) achieved rankings placing them in the top 200 human coders globally on this platform. This is a massive leap.
  • Science (GPQA Diamond): Decent improvements.
  • Humanities (Last Exam): Substantial jumps, especially when allowed to use tools like Python and web browsing. Deep Research still leads here, which the presenter attributes to potentially more advanced agent setups.
  • Multimodal Tasks (MMMU, MathVista, CharKive): Good gains, particularly in reasoning tasks (CharKive).
  • Real-World Coding (Sui Lancer, Sui Bench): Massive performance increases. O3 High reportedly 'earned' $65k in the Sui Lancer benchmark compared to O1 High's $28.5k, highlighting a potential 'arbitrage opportunity' using these AIs for software tasks. They are much better at coding overall.
  • Code Editing (Aider Polyglot): O3 High shows strong performance.
  • Instruction Following/Tool Use: O3 leads the pack.

Where Do These Models Come From? (Presenter's Theory)

The presenter speculates that O3, O4 Mini, etc., aren't entirely new architectures but might be derived from ongoing GPT-5 training. The idea is that OpenAI keeps training a powerful base model (GPT-5), periodically takes snapshots ('checkpoints'), fine-tunes these checkpoints using reinforcement learning to enhance their 'thinking' and tool-using abilities, and releases them as these new 'O' models. This would explain the rapid succession and performance jumps, aligning with Sam Altman's recent comments about GPT-5 being better than initially expected.

Cost Efficiency: A Major Focus

OpenAI is heavily emphasizing that O3 and O4 Mini are not just more capable but also significantly cheaper and faster to run for the performance they deliver. Graphs show improved benchmark scores at similar or lower inference costs compared to previous models. This is seen as a strategic move to attract developers and enterprises who are very sensitive to cost when choosing AI platforms.

The 'One More Thing': Codex CLI

OpenAI also dropped a surprise open-source project called Codex CLI. It's essentially an agentic coding assistant that runs in your command line interface (terminal).

  • Functionality: It uses OpenAI's models (like O3/O4 Mini) to understand requests, read files on your computer, write code, execute commands, and generally help with software development tasks directly in your local environment. It looks very similar to tools like Cloud Code.
  • Capabilities: Leverages the multimodal reasoning, tool use, and thinking abilities of the new models.
  • Auto Mode: It has an optional 'full auto mode' to execute tasks more autonomously (with safety considerations mentioned).
  • Initiative: OpenAI is launching a $1M grant program (in API credits) to encourage projects built using Codex CLI.

Platform Risk: A Word of Caution

While Codex CLI is cool, its release highlights a significant risk for developers building tools on top of OpenAI's platform. If you create a successful AI-powered coding tool, what's to stop OpenAI from releasing a similar, integrated feature themselves (like Codex CLI)? The presenter, citing experience in Silicon Valley, warns that large platforms often absorb functionalities developed by their users if the market is attractive enough. The advice is to be aware of this 'platform risk' and consider diversifying dependencies, potentially favoring open-source models where possible.

Availability

  • Paid Users: Plus, Pro, and Team users get access to O3, O4 Mini, and O4 Mini High immediately in the model selector.
  • Enterprise/Edu: Access rolls out in about a week.
  • Free Users: Can try O4 Mini using the 'Think' option.
  • Rate Limits: Remain the same as before.

The presenter plans to test these new models against competitors like Gemini 2.5 Pro, considered a benchmark leader, especially for coding.