crabml is focusing on the reimplementation of GGML using the Rust programming language.
The project is currently an active experiment with the capability to run inference on a Q8_0 quantized Llama 3B model. While the inference is currently not optimized for speed,
crabmlis a promising endeavor for efficient machine learning inferencing.
crabml is designed with the following objectives in mind:
- Focus on inference only.
- Limit tensor operators to the bare minimum required for LLM inference.
- Achieve fast enough inferencing on cheap hardwares.
mmap()from day one.- Prioritize SIMD ahead of GPU.
To build crabml, set the RUSTFLAGS environment variable to enable specific target features. For example, to enable NEON on ARM architectures, use RUSTFLAGS="-C target-feature=+neon". Then build the project with the following command:
cargo build --releaseThis command compiles the project in release mode, which optimizes the binary for performance.
After building the project, you can run an example inference by executing the crabml-cli binary with appropriate arguments. For instance, to use the tinyllamas-stories-15m-f32.gguf model to generate text based on the prompt "captain america", execute the command below:
./target/release/crabml-cli \
-m ./testdata/tinyllamas-stories-15m-f32.gguf \
"captain america" --steps 100 \
-t 0.8 -p 1.0In this command:
-mspecifies the checkpoint file.--stepsdefines the number of tokens to generate.-tsets the temperature, which controls the randomness of the output.-psets the probability of sampling from the top-p.
This contribution is licensed under Apache License, Version 2.0, (LICENSE or http://www.apache.org/licenses/LICENSE-2.0)