
Did Meta Really Fake Benchmarks?
AI Generated Summary
Airdroplet AI v0.2Meta just dropped their new Llama 4 AI models, and honestly, the whole release feels a bit weird and confusing. They've been huge pioneers in open-weight AI, basically kickstarting the revolution with the original Llama, but this time around, things are less clear-cut. The new lineup includes Scout, Maverick, and the giant Behemoth, boasting features like a massive 10 million token context window and a cool 'Mixture of Experts' architecture, but the launch has been overshadowed by controversy around benchmarks and questions about how good these models actually are compared to the competition.
Here's a breakdown of what's going on:
-
The Llama 4 Models & Release:
- Meta unveiled Llama 4, the next generation of their open-weight models: Scout (17B active params, 16 experts), Maverick (17B active params, 128 experts), and the massive Behemoth (280B active params, 2T total params, used for distillation).
- The release on a Saturday (April 5th) felt rushed, possibly spurred by competitor moves like the Grok 3 API or that mysterious anonymous model, Quasar Alpha (which is definitely not Llama 4, based on testing).
- They're still 'open weight,' so you can download them, but there are new license strings attached.
- A headline feature is the gigantic 10 million token context window, designed to handle huge amounts of information.
-
Mixture of Experts (MOE) Explained Simply:
- Forget 'multimodal' for a sec (that's about handling text, images, etc.). The key tech here is Mixture of Experts (MOE).
- Imagine the AI model has tons of knowledge (parameters), but for any specific question (like coding), it only needs a fraction of that knowledge.
- MOE means the model has different 'expert' sections. When you ask it something, it activates only the most relevant experts (like 17 billion parameters out of potentially trillions).
- This makes massive models computationally practical. Maverick having 128 experts is pretty wild, suggesting it could be good at many different things.
-
Benchmark Drama and Performance Questions:
- This is where things get messy. Meta claimed Scout beats models like Gemini 2.0 Flashlight.
- Using 'Flashlight' instead of the slightly better, standard 'Flash' model immediately raised eyebrows. It felt like cherry-picking a weaker competitor because maybe Scout couldn't beat the regular Flash.
- Maverick claims to beat both GPT-4.0 and Gemini 2 Flash – bold, considering their different strengths.
- The huge 10M token context window sounds great, but tests show Llama 4 struggles with actually retrieving information from large contexts. Scout bombed a 60k token test (11% accuracy!) and other tests show it lagging significantly behind Gemini 2.0 Flash in pulling data from HTML.
- Basically, Google's Gemini models (especially Flash) still seem way better at using large context windows effectively, even if Llama 4's number is bigger on paper.
-
Did Meta Fake Benchmarks?
- There's a serious controversy brewing. The high ELO score Meta reported for Maverick on LM Arena apparently used a special 'experimental chat version' not available to the public, which looks like gaming the system.
- Worse, there are rumors (supposedly from an ex-employee) that Meta pushed teams to train the models on the benchmark test questions to inflate scores. Several people, including a VP of AI, reportedly quit over this.
- Meta strongly denies training on test sets, blaming inconsistent performance reports on third-party deployment partners needing time to optimize for the new complex MOE models.
- George Hotz (TinyGrad) is skeptical Meta would do something so obviously wrong, suggesting implementation issues are more likely the culprit for performance variance.
- It's true that MOE models are hard to deploy correctly, and partners like Grok or Together might need time to tune things.
-
Speed, Cost, and Market Position:
- On the plus side, Llama 4 (especially Scout) is incredibly fast when run on optimized platforms like Grok (used for T3 chat).
- But looking at price vs. performance charts (like on Artificial Analysis), Llama 4's place is unclear. Scout seems slightly worse and pricier than Gemini 2.0 Flash. Maverick is pricier and only marginally better.
- Gemini 2.0 Flash is just so good and cheap right now, it makes it hard to see why you'd choose Llama 4 unless you absolutely need an open-weight model.
- Currently, models like O3 Mini and DeepSeek R1 seem to offer the best bang for your buck in the smaller model category.
-
New License Restrictions:
- Meta tightened the license for Llama 4.
- Big companies (>700M monthly users) need special permission.
- You have to display 'Built with Llama' branding prominently.
- Any model you build on top must start with the name 'Llama'.
- These changes add friction and might make companies hesitant to invest heavily in the Llama ecosystem.
-
Overall Feeling:
- Right now, Llama 4 feels confusing and underwhelming. It doesn't seem groundbreaking like previous Llama releases.
- While Meta is still championing open source (like Google with Android vs. Apple's iOS), they seem to be struggling to keep up with competitors technically and are hurting their own image with the benchmark issues and license changes.
- There's still hope that performance will improve as implementations mature and that the upcoming Behemoth model might impress.
- But for now, Llama 4 doesn't feel like the future of AI. You can try Scout and Maverick on T3 chat to see for yourself.