
LLaMA 4 is HERE! Meta Just COOKED
Channel: Matthew BermanPublished: April 6th, 2025AI Score: 98
143.6K4.7K48913:50
AI Generated Summary
Airdroplet AI v0.2Meta just dropped the Llama 4 series of AI models, and it's a pretty big deal, especially the massive 10 million token context window on one of the models. These new models come in three sizes (Scout, Maverick, and the upcoming Behemoth), are all multimodal (meaning they handle text, images, etc.), and use a Mixture of Experts (MoE) architecture.
Here’s the breakdown of what was discussed:
Llama 4 Models Announced:
- General:
- Meta announced three new Llama 4 models: Scout, Maverick, and Behemoth.
- All three are multimodal, capable of processing inputs and outputs beyond just text, including images.
- They all use a Mixture of Experts (MoE) architecture. This means different parts of the model specialize in different tasks, and a router directs input to the relevant 'expert' part.
- Currently, these are not described as "thinking" models, but the foundation is there for future improvements.
- Llama 4 Scout:
- The "smallest" model, though still quite large at 109 billion total parameters.
- Has 17 billion active parameters and 16 experts.
- Features an absolutely insane 10 million token context window. This dwarfs the previous leader (Gemini at 2 million) and is described by Meta folks as "nearly infinite".
- This massive context window unlocks tons of new possibilities, especially for enterprise use cases analyzing huge amounts of data.
- It's efficient enough to fit on a single NVIDIA H100 GPU.
- Benchmarks show it outperforms models like Gemma 3, Gemini 2.0 Flashlight, and Mistral 3.1, although the presenter notes these competitor models have smaller total parameters, making the comparison slightly complex.
- It performs exceptionally well on the "Needle in the Haystack" test, successfully finding information within a 10 million token text block without failure.
- It also shows strong performance understanding video content, tested up to 20 hours in length.
- Llama 4 Maverick:
- Larger than Scout, with 400 billion total parameters.
- It has 17 billion active parameters but utilizes 128 experts.
- Comes with a 1 million token context window, which is expected to grow.
- Benchmark results show it beating GPT-4.0 and Gemini 2.0 Flash across the board.
- It achieves results comparable to the new DeepSeq v3 in reasoning and coding, but uses less than half the active parameters.
- A major highlight is its cost-effectiveness. It's estimated to cost around 19-49 cents per million tokens (input/output blend), significantly cheaper than competitors like GPT-4.0 ($4.38).
- It currently holds the #2 spot on the LM Arena leaderboard for chat models, just behind Gemini 2.5 Pro.
- Llama 4 Behemoth:
- This model is announced but not yet released – it's still "baking" (training).
- It's a truly massive, frontier-grade model with 2 trillion total parameters.
- It has 288 billion active parameters and 16 experts.
- This giant model was actually used as the 'teacher' to train or 'distill' the smaller Scout and Maverick models.
- Even while still in training, it's reported to outperform models like GPT 4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.
- Like the others, it's not yet a "thinking" model, but adding that capability is expected to be straightforward for Meta's team.
Technical Aspects & Training:
- Mixture of Experts (MoE): Llama 4 is Meta's first series using MoE. While effective, the presenter notes the trend is shifting towards "thinking" models, but these MoE models provide a strong base.
- Multilingual: Trained on 200 languages, with over 100 having more than a billion tokens each – 10 times more multilingual data than Llama 3.
- Training Efficiency: Used FP8 precision (a lower-precision number format) for training without sacrificing quality, achieving high efficiency (390 T-flops per GPU on 32,000 GPUs for Behemoth).
- Context Window Training: Scout was pre-trained and post-trained with a 256k context length, which helps it generalize to the massive 10 million token window in practice.
Reasoning Capability:
- While the current models aren't explicitly reasoning models, Meta has put up an Easter egg page (llama.com/llama4) teasing that reasoning capabilities are coming soon.
Licensing Concerns:
- The licensing remains a point of contention, similar to Llama 3.
- It's not a standard open-source license like MIT.
- Key restrictions include: companies over 700M users need a special license, mandatory "Built with Llama" attribution, derived models must be named starting with "Llama", and compliance with Meta's Acceptable Use Policy. The presenter feels this limits the 'openness'.
Running Llama 4:
- These are very large models. AI expert Jeremy Howard points out that even the smallest (Scout) is likely too big to run effectively on current consumer GPUs (like the RTX 4090), even when using quantization (reducing model size).
- Potential for Macs: There's speculation that Macs with high amounts of unified memory (like the presenter's 96GB Mac Studio) might be suitable for running these MoE models, as the unified memory helps, and the lower compute power is less critical since only a fraction of the experts are active at once.
- Quantization might need to be aggressive (like 1.58-bit, suggested by Imad Mostik) to run them on less powerful hardware. A Meta employee hinted something is "cooking" regarding accessibility.
Overall Sentiment:
- The presenter is clearly excited, calling the 10M context window "insane" and highlighting the impressive benchmarks and cost-efficiency of Maverick.
- There's huge enthusiasm for the potential these powerful, openly available (though restrictively licensed) models bring to the AI space.