Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Feature Request: llama 4 #12774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
gulldan opened this issue Apr 5, 2025 · 8 comments Β· Fixed by #12791
Closed
4 tasks done

Feature Request: llama 4 #12774

gulldan opened this issue Apr 5, 2025 · 8 comments Β· Fixed by #12791
Labels
enhancement New feature or request

Comments

@gulldan
Copy link

gulldan commented Apr 5, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama 4 released
tech rep
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
weights
https://www.llama.com/llama4/

https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164

Motivation

great to be use a multimodal LLM

Possible Implementation

No response

@gulldan gulldan added the enhancement New feature or request label Apr 5, 2025
@Downtown-Case
Copy link

Still waiting for access. How's the architecture different from llama 3.3?

@gulldan
Copy link
Author

gulldan commented Apr 5, 2025

details will be on LlamaCon on April 29, guess...

@stalkermustang posted on TG https://t.me/seeallochnaya/2496

β€” The main emphasis is that the models are much better at multimodality (understanding imagesβ€”even several at once), and that this is just the beginning. META will have LLAMACon at the end of April, and they might show even more models there, including reasoning ones.

β€” Llama 4 Scout, the β€œsmall” model with 109 billion parameters, has only 17 active (so it will be faster than, say, Gemma 3 27b). They say it can even run on a single GPU with 80β€―GB in 4-bit mode, but that’s a complete perversion. There is no β€œcommunity” version of the small model.

β€” Llama 4 Maverick, the medium version (also with 17 billion active parameters, but with more experts and therefore more weightsβ€”400B) received an Elo rating of 1417 on LMSYS Arena. This is second place, above GPT-4.5, but below Gemini 2.5 Pro. However, this is without taking Style Control into account, and the leaderboard hasn’t been updated yet, so we’ll assess it a little later. The Maverick model is optimized to run on a single H100 DGX node (8 GPUs).

β€” Llama 4 Behemoth, the huge model with 2 trillion parameters, is still in training; it hasn’t been released yet but is planned for the future. It was used as a teacher when training the smaller Scout and Maverick models, which is why they turned out so powerful for their size. Without Behemoth, such quality wouldn’t have been achieved (the same applies to Claude Opus, which β€œdoesn’t exist”, Gemini Ultra, which β€œdoesn’t exist”, and GPT-4.5, which exists, but for some reason people are worried about its cost and speed πŸ˜€).

β€” For image processing, the approach has changedβ€”they now use early fusion (if you don’t know what that is, it’s okay).

β€” For Llama 4’s training data, they added 10 times more tokens from languages other than English. The dataset now totals around 30 trillion tokens (twice as many as before). In total, there are more than 200 languages, 100 of which have at least 1 billion tokens.

β€” Behemoth is being trained on only 32k GPUs, but with FP8.

β€” Llama 4 Scout was initially trained with a 256k-token context, which was later expanded to 10M. They use a modified version of RoPE with some insights from this article. 10M tokens allow it to process ~20 hours of video.

β€” The long-context metrics were measured, among other things, on the MTOB benchmark, β€œtranslation by one book” (as explained here, TL;DR: a language that is almost undocumented, but linguists have worked on it; the LLM is given a book and asked to translate itβ€”it's important to be able to read an entire book), and it performed better than Gemini 2.0 Flash Lite, but apparently worse than Flash (since Flash wasn’t measured).

β€” Fine-tuning Behemoth is a very challenging engineering task; META is boasting about their new framework, which significantly speeds up the process (almost 10 times faster). Interestingly, while for the small models they discarded 50% of the SFT datasets, for Behemoth they discarded 95%β€”keeping only the highest quality. In such a configuration, it turned out both efficient (since the training cycle is shorter) and better (because the models are given only the best quality data).

β€” Mark confirmed that reasoning models will be announced at LLAMACon at the end of April.

@Downtown-Case
Copy link

Downtown-Case commented Apr 5, 2025

An interesting bit from their blog:

A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings. Additionally, we employ inference time temperature scaling of attention to enhance length generalization. We call this the iRoPE architecture, where β€œi” stands for β€œinterleaved” attention layers, highlighting the long-term goal of supporting β€œinfinite” context length, and β€œRoPE” refers to the rotary position embeddings employed in most layers.

Sounds like that would shink the K/V cache, no?

Also, base 256K is crazy by itself.

@joseph777111
Copy link

@etemiz
Copy link

etemiz commented Apr 5, 2025

unsloth fork here https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct

@ngxson
Copy link
Collaborator

ngxson commented Apr 5, 2025

From what I understand from the HF transformers code, the difference between llama 3 vs 4 are:

  • chunked attention is enabled with a layer pattern of (3 chunked, 1 non-chunk), see this LOC and this visualization ==> the mask for this "chunked" attention can be a bit tricky to add
  • not all layers are MoE, some are dense
  • no RoPE on dense layer, see this LOC ==> simply add an if condition to toggle ggml_rope_ext
  • some layer norms are added (like qk_norm), but this can be added easily
  • attn_temperature_tuning
  • For MoE FFN: Router top-k takes the raw logits instead of the sigmoid-ed probs ; the sigmoid-ed probs (a.k.a router weights) is applied before gate activation (as opposed to normal moe ffn, which applies this after the activation)

So overall I think the most complicated part would be to support the chunked attn mask

Edit: yes I was missing the attn_temperature_tuning, though I think currently it's not working correctly on transformers

@RefractAI
Copy link
Contributor

So overall I think the most complicated part would be to support the chunked attn mask

ml-explore/mlx-lm#74
As per mlx, it should be possible to get it working up to the chunk size (8192) without needing to implement the chunked attention.

@jacekpoplawski
Copy link

I was thinking maybe we should dynamically load experts, do you think it could work?
#12781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants