Compute Where It Counts: High Quality Sparsely Activated LLMs

CWIC is a new method for creating efficient transformers that automatically decide when to use more or less compute.
⭐ CWIC makes models faster, more efficient, and more interpretable!

inference with different compute per token

📖 See more visual results on our project page

Stats

CWIC yields a 3x increase in CPU throughput with only a 10% reduction in benchmark performance.
CWIC uses a different amount of compute for each token, making task difficulty interpretable.
CWIC directly optimizes compute as a loss function, and learns to budget compute without labelled data or hand-crafted heuristics.
The CWIC architecture uses learned activation thresholds and expressive sparsity patterns to enable adaptive computation.

Made with ❤️ by Cyris, Adam and Niveditha at Crystal Computing Corp.

🔥 News

[Aug 25, 2025] torch training code is ready!
[Aug 22, 2025] torch inference code and huggingface weights are released!
[Aug 18, 2025] blog & jax training code is released!

🔧 Installation

Clone Repo

git clone https://github.com/crystal-ai-org/cwic
cd cwic

Install Pixi to manage the python environment

curl -fsSL https://pixi.sh/install.sh | sh
source ~/.bashrc # make sure pixi in path

Install Python Dependencies
```
pixi shell
```

Usage

Inference

python cwic/chat_with_it.py

This will let you chat with one of our Pretrained Models models and get output highlighted based off active parameters used! Note: this might take a few minutes the first time you run it as it will download HuggingFace checkpoints.

The model output is highlighted to indicate the amount of compute spent on each. Darker blue indicates more compute, lighter yellow indicates less compute. Here is an example from a run where we requested Python code for the Fibonacci Series: (If on MacOS, it's recommended to run this script in a code editor terminal, ghostty or iTerm. Running it in default Terminal.app doesn't seem to display the highlights.)

Training

wandb login
# gcloud auth application-default login --no-launch-browser # for saving checkpoints to google cloud
# CWIC train a huggingface model directly
python cwic/train.py
# or
# train using the original jax code for the paper
# python cwic/cwic_scripts/train_cwic.py

Pretrained Models

Pretrained models are uploaded to Hugging Face: finetuned on upto 1.3B tokens on crystal-ai/chat-compilation-benchmark-5x-Llama-3.2-Instruct-Shuffled

The models will be autodownloaded by the generation script below.

python cwic/chat_with_it.py

Model	Parameters	Avg. Active Parameters	Avg. Reduction	Tokens
crystal-ai/CWICLlama-3.2-1B-A620M-Instruct	1.2B	620M	2x	0.26B
crystal-ai/CWICLlama-3.2-1B-A413M-Instruct	1.2B	413M	3x	0.52B
crystal-ai/CWICLlama-3.2-1B-A310M-Instruct	1.2B	310M	4x	0.78B
crystal-ai/CWICLlama-3.2-1B-A248M-Instruct	1.2B	248M	5x	1.04B
crystal-ai/CWICLlama-3.2-1B-A206M-Instruct	1.2B	206M	6x	1.30B

Note: these are base models trained for only 1.3B tokens, without any form of downstream modification (instruction tuning, etc.). Performance is expected to be comparable or better than other sparsity methods trained on similar data, but might not beat dedicated small models such as SmolLM trained on trillions of tokens in all cases.

Background

Large language models have become ubiquitous tools for natural language tasks. However, LLM inference requirements have grown beyond consumer devices and drive massive industry hardware expenses. For many applications, especially agentic ones, inference speed and cost are critical bottlenecks for real world deployment.

Therefore, many methods have been proposed to improve LLM inference efficiency. These include quantization, pruning, and sparse Mixture of Experts (MoE). Activation sparsity, the category in which CWIC falls, is another such approach. It focuses on removing small and inconsequential activations from the inputs of matrix multiplications, allowing some computations to be skipped without affecting the model's output.

One of the earliest activation sparsity methods for LLMs was Relufication, which inserted ReLU activation functions into LLMs to induce sparsity. ProSparse further increased sparsity by adding an L1 penalty to the ReLU activations. Deja Vu and ShadowLLM predicted sparsity on the fly by training small auxiliary MLPs. Q-Sparse discarded all but the top-K largest activations, and demonstrated a sparse scaling law where larger models are more robust to sparsity.

Most similar to our work are CATS, TEAL, and R-SPARSE. These methods all remove activations with smaller magnitude than a threshold. However, none of these methods directly learn activation thresholds. Furthermore, these methods suffer from performance collapse at high sparsity levels. CWIC addresses both limitations.

Motivating Insights

Learned parameters perform better than heuristically chosen ones. The often-quoted "bitter lesson" states that general learning methods have historically outperformed hand-crafted approaches. We noticed that previous activation sparsity methods like TEAL (block-wise greedy optimization) and R-Sparse (search algorithm) used heuristics to determine activation thresholds. We hypothesized that learning thresholds directly through backpropagation would lead to better results.
Adaptive computation methods with higher combinatorial expressiveness perform better. This was theorized and demonstrated by DeepSeekMoE, which improved over previous MoE methods by increasing the number of experts to choose from. We posited that the same principle would apply to activation sparsity: sparsity patterns with higher flexibility than the standard column pattern would yield better performance.
Different parameters should have different sparsity levels. This insight was drawn from our own preliminary experiments. We found that, among other patterns, the Q attention matrix was more robust to sparsity than the K and V matrices. This shows a limitation in methods like CATS and Q-Sparse that use the same sparsity level for every parameter. Furthermore, while the sparsity level of each parameter could be manually tuned, we wanted to automate this by making sparsity thresholds learnable.
Easier problems should require less compute. As discussed in Dynamic Routing in MoE Models, it is intuitively obvious that some outputs should have simpler derivations, and therefore should need less compute. This is exemplified by GPT-5, which routes some inputs to a less costly model. We wanted to see if a sparsely activated model could learn this behavior on its own.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
assets		assets
cwic		cwic
cwic_triton		cwic_triton
model_code		model_code
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runconfig.ini		runconfig.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compute Where It Counts: High Quality Sparsely Activated LLMs

🔥 News

🔧 Installation

Usage

Inference

Training

Pretrained Models

Background

Motivating Insights

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

crystal-ai-org/cwic

Folders and files

Latest commit

History

Repository files navigation

Compute Where It Counts: High Quality Sparsely Activated LLMs

🔥 News

🔧 Installation

Usage

Inference

Training

Pretrained Models

Background

Motivating Insights

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages