Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix for cohere plus#650

Merged
awni merged 2 commits into
mainfrom
cohere_plus
Apr 5, 2024
Merged

Fix for cohere plus#650
awni merged 2 commits into
mainfrom
cohere_plus

Conversation

@awni

@awni awni commented Apr 4, 2024

Copy link
Copy Markdown
Member

Use the qk norm param to work with cohere plus.

Machine setting:

sudo sysctl iogpu.wired_lwm_mb=100000

Command for generation:

python -m mlx_lm.generate --model mlx-community/c4ai-command-r-plus-4bit --prompt "Write a quicksort in c++" --temp 0.0 --max-tokens 256 --use-default-chat-template

Command for QLoRA:

python -m mlx_lm.lora --model mlx-community/c4ai-command-r-plus-4bit --data ../lora/data --train --iters 1000  --batch-size 1 --lora-layers 16

@Blaizzy

Blaizzy commented Apr 4, 2024

Copy link
Copy Markdown
Contributor

I was about to submit a PR, great I checked 😄.

Already uploaded the model to the hub.
https://huggingface.co/mlx-community/c4ai-command-r-plus-4bit

@DenisSergeevitch

Copy link
Copy Markdown

@Blaizzy Thank you! How much RAM does it require to run 4bit q?

@awni

awni commented Apr 5, 2024

Copy link
Copy Markdown
Member Author

Needs about 65GB to generate with 4-bit. But the generation is slow right now, trying to debug the performance issue.

@Blaizzy

Blaizzy commented Apr 5, 2024

Copy link
Copy Markdown
Contributor

@Blaizzy Thank you! How much RAM does it require to run 4bit q?

@DenisSergeevitch, as @awni said 👆🏽.

I can't run it myself, I use a M1 Air 16GB :)

@DenisSergeevitch

Copy link
Copy Markdown

Thank you, I will wait for i_q1 then

@awni

awni commented Apr 5, 2024

Copy link
Copy Markdown
Member Author

Btw to get this to run reasonably fast on an M2 Ultra you need to set the wired GPU memory lower limit appropriately. Something like:

sudo sysctl iogpu.wired_lwm_mb=100000

@angeloskath angeloskath left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@awni awni merged commit c386dd5 into main Apr 5, 2024
@awni awni deleted the cohere_plus branch April 5, 2024 21:11
@jeanromainroy

Copy link
Copy Markdown

I am running the 4-bit version of Command-R-Plus and I consistently see the GPU usage dropping during generation and performance becoming abysmal.

Screenshot 2024-04-08 at 10 10 21 PM

My machine is the M2 Ultra 192GB and,

ProductName: macOS
ProductVersion: 14.3
BuildVersion: 23D56

@Blaizzy

Blaizzy commented Apr 9, 2024

Copy link
Copy Markdown
Contributor

@awni 👆🏽

@awni

awni commented Apr 9, 2024

Copy link
Copy Markdown
Member Author

@jeanromainroy did you set the memory limits? You could try making it larger:

sudo sysctl iogpu.wired_lwm_mb=150000

@jeanromainroy

jeanromainroy commented Apr 9, 2024

Copy link
Copy Markdown

Even after setting this,

sudo sysctl iogpu.wired_lwm_mb=150000

I still see the GPU usage dropping before the completion ends.

Screenshot 2024-04-09 at 12 34 55 PM

@awni

awni commented Apr 9, 2024

Copy link
Copy Markdown
Member Author

Do you mind to open an issue and include the command, versions of MLX / MLX LM, OS etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants