Fast and lightweight LLM inference engine for mobile and edge devices
| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |
MLLM-advanced is an extension of the MLLM project and offers more features and functionalities.
- Dynamic-static integrated computation graph for easy implementation of your algorithms.
- Multi-backend support for CPU/NPU. Designed for MOBILE devices.
- A complete graph-level IR with customizable Passes, and a Qnn Lowering Pipeline provided to compile your graph into QNN Graph.
- MobiLMCache! An edge-side LMCache with cache pagination management.
| Model | Quantization Methods | Backends |
|---|---|---|
| DeepSeek-Distill-Qwen-1.5B | W32A32, W4A32 | Arm-CPU |
| Qwen2VL-2B-Instruct | W32A32, W4A32 | Arm-CPU |
We demonstrate the usage with an example using DeepSeek distill qwen2 1.5B as the model.
./demo_ds_qwen2 -m {model path} -j {tokenize.json file}The following commands have been tested on Linux systems.
git clone --recursive https://github.com/chenghuaWang/mllm-advanced.git
export ANDROID_NDK_PATH = /path/to/android-ndk
# build
python task.py tasks/android_build.yaml
# push to u device
python task.py tasks/adb_push.yamlpip install .Use the following command to convert models:
python tools/convertor.py --input {safetensors file} --output {output file} --format safetensorsUsage:
Usage:
[-h|--help] <FILE> [-s|--show]
Options:
-h, --help Show help message
<FILE> Input file path
-s, --show Show parameters meta dataUsage:
Usage:
[-h|--help] [-j|--json] [-m|--merge] [-t|--type] [-i|--input_str]
Options:
-h, --help Show help message.
-j, --json SentencePiece json file path.
-m, --merge Merge file path.
-t, --type Model Type.
-i, --input_str Input string for testing.Example:
./mllm-tokenize-checker -j ../mllm-models/DeepSeek-R1-Distill-Qwen-1.5B/tokenizer.json -t ds-qwen2 -i "你好"Output:
Tensor Meta Info
address: 0x7f85e8224000
name: qwen2-tokenizer-i0
shape: 1x2
device: kCPU
dtype: kInt64
[[151646, 108386]]