Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 03f171e

Browse files
authored
example: LLM inference with Ray Serve (abetlen#1465)
1 parent b564d05 commit 03f171e

File tree

3 files changed

+42
-0
lines changed

3 files changed

+42
-0
lines changed

examples/ray/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
2+
3+
First, install the requirements:
4+
5+
```bash
6+
$ pip install -r requirements.txt
7+
```
8+
9+
Deploy a GGUF model to Ray Serve with the following command:
10+
11+
```bash
12+
$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
13+
```
14+
15+
This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:
16+
17+
```bash
18+
$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000
19+
```

examples/ray/llm.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from starlette.requests import Request
2+
from typing import Dict
3+
from ray import serve
4+
from ray.serve import Application
5+
from llama_cpp import Llama
6+
7+
@serve.deployment
8+
class LlamaDeployment:
9+
def __init__(self, model_path: str):
10+
self._llm = Llama(model_path=model_path)
11+
12+
async def __call__(self, http_request: Request) -> Dict:
13+
input_json = await http_request.json()
14+
prompt = input_json["prompt"]
15+
max_tokens = input_json.get("max_tokens", 64)
16+
return self._llm(prompt, max_tokens=max_tokens)
17+
18+
19+
def llm_builder(args: Dict[str, str]) -> Application:
20+
return LlamaDeployment.bind(args["model_path"])

examples/ray/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ray[serve]
2+
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
3+
llama-cpp-python

0 commit comments

Comments
 (0)