A Python Flask server to run and use local and commercial LLM easier.
- Local Models
- Qwen 3
- Qwen 3 Next
- Llama-4 Scout
- Commercial Models
- GPT-4.1-nano
- GPT-5
- Add the model class in
llm_server/. The model class should inherit fromBaseModelinmodel.pyand implement__init__andrequestmethods. - Add the model to
__main__.pyto enable loading it from command line. Add it inadd_argumentfunction and in the model initialization section.
- Python 3.10+
- Pytorch with CUDA support (if using GPU)
Anaconda or Miniconda is strongly recommended.
- Note: Qwen recommends Python 3.11+ for best performance.
Install dependencies:
pip install -r requirements.txtRun the following command to start the server:
python -m llm_server <options> <model_name>Replace <model_name> with the desired model from the list below (e.g., qwen-3-next).
Options:
-j,--tensor-parallel-size: Number of GPUs to use for tensor parallelism (default: 1)--gpu-id: Space seperated list of GPU ids to use. Starts from 0. For example,--gpu-id 0 1 2to use first three GPUs. (default: all available GPUs)--hf-token: Hugging Face token for downloading models from Hugging Face Hub. Required for some models (e.g., Llama)-p,--port: Port to run the server on (default: 5000)--max-tokens: Maximum tokens for model (prompt + response). Default: 8192--temperature: Temperature for sampling. Default: 0.0--log-debug: Enable debug logging
Send a POST request with json body to communicate with LLM.
The body should follow the format below:
{
"system_msg": "<system message>",
"prompt": "<prompt>"
}Usually, sending POST request can simply be done by using Python requests library.
For example,
import requests
response = requests.post('http://localhost:<port>/request', json=input).json()Replace <post> to proper port (default: 5000) and input to json input.
To use the response:
response['response']This code will give a string of response.
Before running server, we highly recommend downloading the local LLMs with transformers library (Takes several hours).
llama-4-scout: Llama-4 Scout 17B 16E Instruct (Requires Hugging Face token)qwen-3-next: Qwen 3 Next 80B A3B Instructqwen-3: Qwen 3 1.7B
gpt-5: GPT-5 (Requires OPENAI_API_KEY environment variable)gpt-4-nano: GPT-4.1-nano (Requires OPENAI_API_KEY environment variable)