Feature Request: llama 4 #12774

gulldan · 2025-04-05T19:17:29Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama 4 released
tech rep
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
weights
https://www.llama.com/llama4/

https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164

Motivation

great to be use a multimodal LLM

Possible Implementation

No response

Downtown-Case · 2025-04-05T19:43:36Z

Still waiting for access. How's the architecture different from llama 3.3?

gulldan · 2025-04-05T19:59:35Z

details will be on LlamaCon on April 29, guess...

@stalkermustang posted on TG https://t.me/seeallochnaya/2496

— The main emphasis is that the models are much better at multimodality (understanding images—even several at once), and that this is just the beginning. META will have LLAMACon at the end of April, and they might show even more models there, including reasoning ones.

— Llama 4 Scout, the “small” model with 109 billion parameters, has only 17 active (so it will be faster than, say, Gemma 3 27b). They say it can even run on a single GPU with 80 GB in 4-bit mode, but that’s a complete perversion. There is no “community” version of the small model.

— Llama 4 Maverick, the medium version (also with 17 billion active parameters, but with more experts and therefore more weights—400B) received an Elo rating of 1417 on LMSYS Arena. This is second place, above GPT-4.5, but below Gemini 2.5 Pro. However, this is without taking Style Control into account, and the leaderboard hasn’t been updated yet, so we’ll assess it a little later. The Maverick model is optimized to run on a single H100 DGX node (8 GPUs).

— Llama 4 Behemoth, the huge model with 2 trillion parameters, is still in training; it hasn’t been released yet but is planned for the future. It was used as a teacher when training the smaller Scout and Maverick models, which is why they turned out so powerful for their size. Without Behemoth, such quality wouldn’t have been achieved (the same applies to Claude Opus, which “doesn’t exist”, Gemini Ultra, which “doesn’t exist”, and GPT-4.5, which exists, but for some reason people are worried about its cost and speed 😀).

— For image processing, the approach has changed—they now use early fusion (if you don’t know what that is, it’s okay).

— For Llama 4’s training data, they added 10 times more tokens from languages other than English. The dataset now totals around 30 trillion tokens (twice as many as before). In total, there are more than 200 languages, 100 of which have at least 1 billion tokens.

— Behemoth is being trained on only 32k GPUs, but with FP8.

— Llama 4 Scout was initially trained with a 256k-token context, which was later expanded to 10M. They use a modified version of RoPE with some insights from this article. 10M tokens allow it to process ~20 hours of video.

— The long-context metrics were measured, among other things, on the MTOB benchmark, “translation by one book” (as explained here, TL;DR: a language that is almost undocumented, but linguists have worked on it; the LLM is given a book and asked to translate it—it's important to be able to read an entire book), and it performed better than Gemini 2.0 Flash Lite, but apparently worse than Flash (since Flash wasn’t measured).

— Fine-tuning Behemoth is a very challenging engineering task; META is boasting about their new framework, which significantly speeds up the process (almost 10 times faster). Interestingly, while for the small models they discarded 50% of the SFT datasets, for Behemoth they discarded 95%—keeping only the highest quality. In such a configuration, it turned out both efficient (since the training cycle is shorter) and better (because the models are given only the best quality data).

— Mark confirmed that reasoning models will be announced at LLAMACon at the end of April.

Downtown-Case · 2025-04-05T20:14:17Z

An interesting bit from their blog:

A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings. Additionally, we employ inference time temperature scaling of attention to enhance length generalization. We call this the iRoPE architecture, where “i” stands for “interleaved” attention layers, highlighting the long-term goal of supporting “infinite” context length, and “RoPE” refers to the rotary position embeddings employed in most layers.

Sounds like that would shink the K/V cache, no?

Also, base 256K is crazy by itself.

joseph777111 · 2025-04-05T20:29:25Z

https://github.com/meta-llama/llama-models/tree/main/models/llama4

etemiz · 2025-04-05T22:00:02Z

unsloth fork here https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct

ngxson · 2025-04-05T22:58:20Z

From what I understand from the HF transformers code, the difference between llama 3 vs 4 are:

chunked attention is enabled with a layer pattern of (3 chunked, 1 non-chunk), see this LOC and this visualization ==> the mask for this "chunked" attention can be a bit tricky to add
not all layers are MoE, some are dense
no RoPE on dense layer, see this LOC ==> simply add an if condition to toggle ggml_rope_ext
some layer norms are added (like qk_norm), but this can be added easily
attn_temperature_tuning
For MoE FFN: Router top-k takes the raw logits instead of the sigmoid-ed probs ; the sigmoid-ed probs (a.k.a router weights) is applied before gate activation (as opposed to normal moe ffn, which applies this after the activation)

So overall I think the most complicated part would be to support the chunked attn mask

Edit: yes I was missing the attn_temperature_tuning, though I think currently it's not working correctly on transformers

RefractAI · 2025-04-06T00:45:56Z

So overall I think the most complicated part would be to support the chunked attn mask

ml-explore/mlx-lm#74
As per mlx, it should be possible to get it working up to the chunk size (8192) without needing to implement the chunked attention.

jacekpoplawski · 2025-04-06T14:52:47Z

I was thinking maybe we should dynamically load experts, do you think it could work?
#12781

gulldan added the enhancement New feature or request label Apr 5, 2025

ngxson mentioned this issue Apr 7, 2025

llama : Support llama 4 text-only #12791

Merged

3 tasks

ngxson closed this as completed in #12791 Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: llama 4 #12774

Feature Request: llama 4 #12774

gulldan commented Apr 5, 2025 •

edited

Loading

Downtown-Case commented Apr 5, 2025

gulldan commented Apr 5, 2025

Downtown-Case commented Apr 5, 2025 •

edited

Loading

joseph777111 commented Apr 5, 2025

etemiz commented Apr 5, 2025

ngxson commented Apr 5, 2025 •

edited

Loading

RefractAI commented Apr 6, 2025

jacekpoplawski commented Apr 6, 2025

Feature Request: llama 4 #12774

Feature Request: llama 4 #12774

Comments

gulldan commented Apr 5, 2025 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

Downtown-Case commented Apr 5, 2025

gulldan commented Apr 5, 2025

Downtown-Case commented Apr 5, 2025 • edited Loading

joseph777111 commented Apr 5, 2025

etemiz commented Apr 5, 2025

ngxson commented Apr 5, 2025 • edited Loading

RefractAI commented Apr 6, 2025

jacekpoplawski commented Apr 6, 2025

gulldan commented Apr 5, 2025 •

edited

Loading

Downtown-Case commented Apr 5, 2025 •

edited

Loading

ngxson commented Apr 5, 2025 •

edited

Loading