Rust based AI LLM inference service

This repository contains all code to run a super simple AI LLM model - such as Mistral 7b; probably currently the best model to run locally - for inference; it includes simple RAG functionalities. Most importantly it exposes metrics about how long it took to create a response, as well as how long it took to generate the tokens. A stripped down OpenAI API like interface is available so other services like aider or Open WebUI can be easily integrated with it.

Warning

This is for testing only; Use at your own risk! Main purpose is to learn hands-up on how this stuff works and to 
intrument and characterize the behaviour of AI LLMs.

Observability

The following key metrics are exposed through Prometheus:

token_creation_duration - Histogram for the time it took to generate the tokens.
inference_response_duration - Histogram for the time it took to generate the full response (includes tokenization and embedding additional context).
embedding_duration - Histogram for the time it took to create a vector representation of the query and lookup contextual information in the knowledge base.
TODO: add more such as time it took to tokenize, read from KV store etc; also check if we can add tracing.

Here is an example dashboard that capture the metrics described as well as some host metrics such as power, CPU utilisation etc.:

Prerequisites

You will need to download a model and embedding model:

This mistral-7b-instruct-v0.2.Q4_K_M.gguf model seems to give reasonable good results. Otherwise, give the Phi-3.5-mini-instruct-Q4_K_S.gguf a try. For resource constraint environments test with tinyllama-1.1b-chat-v1.0.Q2_K.gguf.
This bge-base-en-v1.5.Q8_0.gguf embedding model seem to work as well.

Best to put both files into a model/ folder as model.gguf and embed.gguf.

Configuration

This service can be configured through environment variables. The following variables are supported:

Environment variable	Description	Example/Default
DATA_PATH	Directory path from which to read text files into the knowledge base.	data
EMBEDDING_MODEL	Full path of the embedding model to use.	model/embed.gguf
HTTP_ADDRESS	Bind address to use.	127.0.0.1:8080
HTTP_WORKERS	Number of threads to run with the HTTP server.	1
INSTANCE_LABEL	Label used to tag the prometheus metrics.	default
MAIN_GPU	Identifies which GPU we should use.	0
MODEL_GPU_LAYERS	Number of layers to offload to GPU.	0
MODEL_MAX_TOKEN	Maximum number of tokens to generate.	128
MODEL_PATH	Full path to the gguf file of the model.	model/model.gguf
MODEL_PROMPT_TEMPLATE	A prompt template - should contain {context} and {query} elements.	Mistral prompt
MODEL_THREADS	Number of threads we'll use for inference.	6
PROMETHEUS_HTTP_ADDRESS	Bind address to use for prometheus.	127.0.0.1:8081

Other environment variables such as RUST_LOG can also be used.

Examples

The following curl commands show the format the service understands:

$ curl -N -X POST http://localhost:8080/v1/chat/completions \
    -d '{"stream": true, "model": "rusty_llm", 
         "messages": [{"role": "user", "content": "Who was Albert Einstein?"}]}' \
    -H 'Content-Type: application/json'
data: {"id":"foo","object":"chat.completion.chunk","created":1733007600,"model":"rusty_llm", "system_fingerprint": 
"fp0", "choices":[{"index":0,"delta":{"content": " Albert"},"logprobs":null,"finish_reason":null}]}
...

You can also test if the RAG works by asking Who was Thom Rhubarb? - notice how easy it is to trick these word prediction machines - and see if respond with sth on screw-printers.

Kubernetes based deployment

The Dockerfile to build the image is best used on a machine with the same CPU as were it will be deployed as it uses target-cpu=native flag. Note that is also can optionally also include options to build with CLBlast for GPU support.

Use the following example manifest to deploy this application:

kubectl apply -f k8s_deployment.yaml

Note: make sure to adapt the docker image & paths - the manifest above uses hostPaths!

Changelog

0.1.0 - initial release.
0.2.0 - switch to llama_cpp as llm stopped development.
0.3.0 - replaced the way we store knowledge.
0.4.0 - switch to llama-cpp-2 as it is under active development.
0.5.0 - support to OpenAI like API.
0.6.0 - version fixes.
0.7.0 - support for caching context etc.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
misc		misc
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
k8s_deployment.yaml		k8s_deployment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rust based AI LLM inference service

Warning

Observability

Prerequisites

Configuration

Examples

Kubernetes based deployment

Changelog

Further reading

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

tmetsch/rusty_llm

Folders and files

Latest commit

History

Repository files navigation

Rust based AI LLM inference service

Warning

Observability

Prerequisites

Configuration

Examples

Kubernetes based deployment

Changelog

Further reading

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages