Thumbnail for The Industry Reacts to o3 and o4!

The Industry Reacts to o3 and o4!

Channel: Matthew BermanPublished: April 19th, 2025AI Score: 98
18.7K1.2K22715:15

AI Generated Summary

Airdroplet AI v0.2

OpenAI recently dropped its new O3 and O4 Mini models, and the AI world is buzzing. These models, especially O3, are being hailed as potentially "genius level," even setting a new record on the Mensa IQ test, surpassing Google's Gemini 2.5 Pro. The reactions highlight incredible leaps in reasoning, tool use, and even multimodal capabilities like interpreting images.

Here's a breakdown of what's got everyone talking:

O3 Model - The "Genius" Challenger:

  • IQ Champion: O3 scored a whopping 136 on the Mensa IQ test, making it the highest-scoring AI model currently known. This significantly beats Gemini 2.5 Pro (128) and previous OpenAI models like O1 (122).
  • Expert-Level Responses: Early testers, like Daria Unutmez, feel it's a major milestone comparable to the O1 releases. It's described as smarter, more reliable, and rarely hallucinates. Its responses to complex medical questions sound like they're coming from top specialists – precise, thorough, and evidence-based.
  • Discovering New Knowledge: OpenAI itself claims O3 is capable of discovering genuinely new information, a significant step for AI.
  • Agent-Style Tool Use: A standout feature is its ability to use tools (like code execution) iteratively within its thinking process (chain of thought). This allows it to tackle complex, multi-step tasks with impressive reasoning and precision. This is seen as a major unlock for AI capabilities.
  • Handling Long Contexts ("Needle in a Haystack"): O3 performs almost perfectly when retrieving information from large amounts of text (up to 120k tokens), though it had tiny dips at 16k and 60k tokens in tests. It holds up very well compared to Gemini 2.5 Pro.
  • Geoguessing Master: O3 has shown mind-blowing ability in geoguessing (identifying locations from images). It aced an "impossible" challenge designed for expert human geoguessers and even identified a specific Chicago restaurant and patio spot just from a photo of a dish! This is seriously impressive, almost scary.
  • A Word of Caution: Because AI is getting so good at geoguessing, be extra careful about posting images that might reveal your location online. Anyone could potentially figure out where you are now, not just dedicated experts.
  • Not Perfect (Yet): Despite the hype, O3 isn't infallible. It initially failed a simple test like counting the 'R's in "strawberry" for one user (though it worked for another). It also failed to correctly identify people and their colors in a specific drawing test. This reminds us that even advanced models have limitations.
  • Amazing Multimodality: Beyond geoguessing, O3 flawlessly solved a complex 200x200 maze image in one attempt, showcasing strong visual understanding.

O4 Mini Model - Smarter, Faster, Cheaper:

  • Even Better at Math: O4 Mini High absolutely crushed recent math problems. It solved a brand new Project Euler problem faster than any human (sometimes under a minute!) and achieved a perfect score on the Math Arena Amy 2025 2 benchmark.
  • Top Coder: Independent benchmarks from Artificial Analysis show O4 Mini High taking the top spot in their coding intelligence index, showing significant gains over O3 Mini.
  • Coding Prowess: Both O3 and O4 Mini handled a complex physics simulation coding task (balls in hexagons) flawlessly, looking better than even Gemini 2.5 Pro in that specific test (though past tests showed Gemini 2.5 Pro doing well).
  • Benchmark Leader: O4 Mini High achieved the highest score to date on the Artificial Analysis Intelligence Index, slightly edging out Gemini 2.5 Pro and O3 Mini High.
  • Efficient: O4 Mini uses fewer computational tokens during its thinking process compared to models like Claude 3.7 Sonnet and Gemini 2.5 Pro on benchmark tests. This means it's potentially faster and cheaper to run for complex tasks.
  • Pricing & Context: O4 Mini is priced similarly to O3 Mini (though cache tokens are cheaper), but Google's Gemini 2.5 Flash is even more budget-friendly. A main drawback is its context window (200k tokens), which is the same as O3 Mini but smaller than competitors like Llama 4.1 (1M) and Gemini 2.5 Pro.

Overall Industry Sentiment:

  • Major Step Change: Many view O3/O4 Mini as a significant leap, possibly the most exciting since the original ChatGPT release, particularly in terms of practical usefulness and reasoning power.
  • Tool Use is Key: The ability for models to seamlessly use tools within their reasoning chain is seen as incredibly important and impressive.
  • Still Room for Improvement: While incredibly powerful, the models aren't perfect and can still fail on certain tasks, highlighting ongoing challenges in AI development.
  • Competition is Fierce: OpenAI's releases put pressure on competitors like Google (Gemini) and Anthropic (Claude), but options like Gemini 2.5 Flash offer compelling lower-cost alternatives.