Official language model (LLM) server for the narrator and NPCs in the Raymond Maarloeve game project.
A lightweight REST API for managing local language models used by NPCs and the narrator in the game. Supports multiple model loading, response generation, and dynamic resource management.
Full project documentation is available at:
🔗 https://raymondmaarloeve.github.io/LLMServer/
Main repo:
🔗 https://github.com/RaymondMaarloeve/RaymondMaarloeve
- 🔁 Supports multiple LLMs simultaneously (
model_id) - 🔌 Simple
/chatendpoint with full conversation history handling - 🚦 Automatic response termination detection using special tags (
<npc>,<human>, etc.) - 🧹 Ability to unload models from memory (
/unload) - 📂 File browsing via API (
/list-files)
- Python 3.12
- Flask – REST API
- llama-cpp-python – interface for local LLaMA models
- PyInstaller – server binary packaging
-
Run the server:
python main.py
-
Load a model:
POST /load { "model_id": "npc_village", "model_path": "models/ggml-npc-q4.bin", "n_ctx": 2048, "n_gpu_layers": 16 }
-
Send a chat request:
POST /chat { "model_id": "npc_village", "messages": [ {"role": "system", "content": "You are a grumpy blacksmith."}, {"role": "user", "content": "Hello there!"}, {"role": "assistant", "content": "Hmph. What do you want?"}, {"role": "user", "content": "Got any gossip?"} ] }
-
Receive the response and display it in-game.
To build a standalone version:
CMAKE_ARGS="-DGGML_VULKAN=on" uv pip install llama-cpp-python --no-cache
uv run pyinstaller --onefile --additional-hooks-dir hooks main.py| Endpoint | Description |
|---|---|
/load |
Load a model into memory |
/chat |
Generate a response in chat style |
/unload |
Release model resources |
/status |
Check available models and GPU status |
/list-files |
List files in a specified directory |
/register |
Register a model for lazy-loading |
The
LLMServerproject is the foundation of narration and NPC behavior in the world of Raymond Maarloeve.