Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LM4eu/goinfer

Repository files navigation

Goinfer

Inference proxy – swap between multiple *.gguf models on remote machines and expose them through HTTPS-API with credentials. So you can a securely connect from any device to your home GPU computers, or to let employees to connect to idle GPUs within the company office.

TL;DR – Deploy a client on a GPU-rich desktop, a server on a machine with a static IP (or DNS), and let the server forward inference requests to the client. No VPN, no port-forwarding, end-to-end encryption.

Built on top of llama.cpp and llama-swap, Goinfer is designed to be DevOps-friendly, easily deployable/monitored on remote computers with the minimum manual operations (inspired by llamactl), and meaningful logs.

Problem: remote access to office/home-hosted LLM

⚠️ Not yet implemented. Please contribute. Contact us [email protected] ⚠️

Local-LLM enthusiasts often hit a wall when they try to expose a model to the Internet:

  • Security – exposing a raw llama-server or ollama instance can leak the GPU to anyone.
  • Network topology – most home routers block inbound connections, so the GPU machine can’t be reached from outside and home IP changes.
  • Privacy – using third-party inference services defeats the purpose of running models locally.

Existing tools (llamactl, llama-swap, olla, llm-proxy/rust, llm-proxy/py, langroute, optillm, VPNs, WireGuard, SSH...) either require inbound ports, complex network plumbing, or a custom client on every device.

Goinfer solves these issues by flipping the connection direction: the GPU-rich client (home) initiates a secure outbound connection to a server with a static IP. The server then acts as a public façade, forwarding inference requests back to the client (home-hosted LLM).

Key features

Category Feature
Model handling Load multiple *.gguf models, switch at runtime, change any inference parameter
API OpenAI-compatible HTTP API /v1/, LLama.cpp-compatible /completions API, streaming responses
Security API key, CORS control
Robustness Independent of ISP-provided IP, graceful reconnects
Admin control Remote monitoring, delete/upload new GGUF files, reload config, git pull llama.cpp, re-compile
Home-hosted LLM Run Goinfer on your GPU desktop and another Goinfer in a data-center (static IP/DNS)

Build

  • Go (any version, go will automatically use Go-1.25 to build Goinfer)
  • GCC/LLVM if you want to build llama.cpp or ik_llama.cpp or …
  • NodeJS (optional, llama.cpp frontend is already built)
  • One or more *.gguf model files

Container

See the Containerfile to build a Docker/Podman image with official Nvidia images, CUDA-13, GCC-14 and optimized CPU flags.

First run

git clone https://github.com/LM4eu/goinfer
cd goinfer

# discover the parent directories of your GUFF files
#   - find the files *.gguf 
#   - -printf their folders (%h)
#   - sort them, -u to keep a unique copy of each folder
#   - while read xxx; do xxx; done  =>  print the parent folders separated by ":"
export GI_MODELS_DIR="$(find ~ /mnt -name '*.gguf' -printf '%h\0' | sort -zu |
while IFS= read -rd '' d; do [[ $p && $d == "$p"/* ]] && continue; echo -n "$d:"; p=$d; done)"

# set the path of your inference engine (llama.cpp/ik_llama.cpp/...)
export GI_LLAMA_EXE=/home/me/bin/llama-server

# generates the config
go run . -write

# voilà, it's running
go run . -no-api-key

Goinfer listens on the ports defined in goinfer.yml. Default ports:

  • :4444 for extra-featured endpoints /models, /completions, /v1/chat/completions
  • :5555 for OpenAI-compatible API (provided by llama-swap)
# use the default model
curl -X POST localhost:4444/completions -d '{"prompt":"Hello"}'

# list the models
curl -X GET localhost:4444/models | jq

# pick up a model and prompt it
curl -X POST localhost:4444/completion \
  -d '{ "model":"qwen-3b", "prompt":"Hello AI" }'

# same using the OpenAI API
curl -X POST localhost:5555/v1/chat/completions \
  -d '{ "model": "qwen-3b",                     \
        "messages": [ {"role":"user",           \
                       "content":"Hello AI"}]   \
      }'

All-in-one script

Build all dependencies and run Goinfer with the bash script clone-pull-build-run.sh

  • clone and build llama.cpp using optimizations flags

  • clone and build the llama-swap frontend with --build--swap:

    git clone https://github.com/LM4eu/goinfer
    goinfer/scripts/clone-pull-build-run.sh --build--swap

Perfect to setup the environment, and to update/build daily the dependencies.

No need to edit manually the configuration files: this script discovers your GGUF files. Your personalized configuration files is automatically generated.

The script ends by running a fully configured Goinfer server.

To reuse your own llama-server set:
export GI_LLAMA_EXE=/home/me/path/llama-server (this will prevent cloning/building the llama.cpp)

If this script finds too much *.gguf files, set:
export GI_MODELS_DIR=/home/me/models:/home/me/other/path (this will disable the GUFF search and speedup the script)

Run Goinfer in local without the API key:
./clone-pull-build-run.sh -no-api-key

Full example:

git -C path/repo/goinfer pull --ff-only
export GI_MODELS_DIR=/home/me/models
export GI_DEFAULT_MODEL=my-favorite-model.gguf
export GI_LLAMA_EXE=/home/me/bin/llama-server
path/repo/goinfer/scripts/clone-pull-build-run.sh -no-api-key

Use the flag --help or the usage within the script.

Configuration

Environment variables

Discover the parent folders of your GUFF models:

  • find the files *.gguf in $HOME and /mnt
  • -printf their folders %h separated by nul character \0 (support folder names containing newline characters)
  • sort them, -u to keep a unique copy of each folder (z = input is \0 separated)
  • while read xxx; do xxx; done to keep the parent folders
  • echo $d: prints each parent folder separated by : (-n no newline)
export GI_MODELS_DIR="$(find "$HOME" /mnt -type f -name '*.gguf' -printf '%h\0' | sort -zu |
while IFS= read -rd '' d; do [[ $p && $d == "$p"/* ]] && continue; echo -n "$d:"; p=$d; done)"

# else manually

export GI_MODELS_DIR=/path/to/my/models

# multiple paths

export GI_MODELS_DIR=/path1:/path2:/path3

GI_MODELS_DIR is the root path where your models are stored. goinfer will search *.gguf files within all GI_MODELS_DIR sub-folders. So you can organize your models within a folders tree.

The other environment variables are:

export GI_LLAMA_EXE=/path/to/my/llama-server
export GI_HOST=0.0.0.0  # exposing llama-server is risky
export GI_ORIGINS=      # disabling CORS is risky
export GI_API_KEY="PLEASE SET SECURE API KEY"

Disable Gin debug logs:

export GIN_MODE=release 

API key

The flag -write also generates a random API key in goinfer.yml. This flag can be combined with:

  • -debug sets the debug API key (only during the dev cycle)

  • -no-api-key sets the API key with "Please ⚠️ Set your API key" admin: "PLEASE

Set the Authorization header within the HTTP request:

curl -X POST https://localhost:4444/completions  \
  -H "Authorization: Bearer $GI_API_KEY"         \
  -d '{ "prompt": "Say hello in French" }'

goinfer.yml

# Goinfer recursively search GGUF files in one or multiple folders separated by ':'
# List your GGUF dirs with `locate .gguf | sed -e 's,/[^/]*$,,' | uniq`
models_dir: /home/me/models 

# ⚠️ Set your API key, can be 64-hex-digit (32-byte) 🚨
# Generate these random API key with: ./goinfer -write
api_key: "PLEASE SET USER API KEY"
origins:   # CORS whitelist
  - "https://my-frontend.example.com"
  - "http://localhost"
listen:
  # format:  <address>: <list of enabled services>
  # <address> can be <ip|host>:<port> or simply :<port> when <host> is localhost
  ":4444": goinfer     # /completions endpoint letting tools like Agent-Smith doing the templating
  ":5555": llama-swap  # OpenAI-compatible API by llama-swap

llama:
  exe: /home/me/llama.cpp/build/bin/llama-server
  args:
    # common args used for every model
    common: --props --no-warmup --no-mmap
    # extra args to let tools like Agent-Smith doing the templating (/completions endpoint)
    goinfer: --jinja --chat-template-file template.jinja
    # extra llama-server flag when ./goinfer is used without the -q flag
    verbose: --verbose-prompt
    # extra llama-server flag for ./goinfer -debug
    debug: --verbosity 3
  • API key – Never commit them. Use env. var. GI_API_KEY or a secrets manager in production.
  • Origins – Set to the domains you’ll be calling the server from (including localhost for testing).
  • Ports – Adjust as needed; make sure the firewall on the server allows them.

llama-swap.yml

At startup, Goinfer verifies the available GUFF files. The flag -write allow Goinfer to write the llama-swap.yml file.

Official documentation: github/mostlygeek/llama-swap/wiki/Configuration

logLevel: info            # debug, info, warn, error
healthCheckTimeout: 500   # seconds to wait for a model to become ready
metricsMaxInMemory: 1000  # maximum number of metrics to keep in memory
startPort: 6000           # first ${PORT} incremented for each model

macros:  # macros to reduce common conf settings
    cmd-fim: /home/me/llama.cpp/build/bin/llama-server --props --no-warmup --no-mmap --verbose-prompt
    cmd-common: ${cmd-fim} --jinja --port ${PORT}
    cmd-goinfer: ${cmd-common} --chat-template-file template.jinja

models:

  # model name used in API requests
  ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0:
    description: "Small but capable model for quick testing"
    name: Qwen2.5-Coder-0.5B-Q8_0-GGUF  # for /v1/models response
    useModelName: "Qwen2.5-Coder"       # overrides the model name for /upstream (used by llama-swap web UI)
    aliases:
      - "Qwen2.5-Coder-0.5B-Q8_0"       # alternative names (unique globally)
      - "Qwen2.5-Coder-0.5B"
    env: []
    cmd: ${cmd-common}  -m /home/c/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0.gguf
    proxy: http://localhost:${PORT}     # default: http://localhost:${PORT}
    checkEndpoint: /health              # default: /health endpoint
    unlisted: false                     # unlisted=false => list model in /v1/models and /upstream responses
    ttl: 3600                           # stop the cmd after 1 hour of inactivity
    filters:
      # inference params to remove from the request, default: ""
      # useful for preventing overriding of default server params by requests
      strip_params: "temperature,top_p,top_k"

  # GI_ prefix for goinfer /completions endpoint
  GI_ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0:
      cmd: ${cmd-goinfer}  -m /home/c/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0.gguf
      proxy: http://localhost:${PORT}
      checkEndpoint: /health
      unlisted: true   # hide model name in /v1/models and /upstream responses
      useModelName: ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF # for /upstream (used by llama-swap web UI)

  # selected models by llama.cpp are also available with their specific port
  ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:
      cmd: ${cmd-fim} --fim-qwen-1.5b-default
      proxy: http://localhost:8012
      checkEndpoint: /health
      unlisted: false

# preload some models on startup 
hooks:
  on_startup:
    preload:
      - "Qwen2.5-1.5B-Instruct-Q4_K_M"

# Keep some models loaded indefinitely, while others are swapped out
# see https://github.com/mostlygeek/llama-swap/pull/109
groups:
  # example1: only one model is allowed to run a time (default mode)
  "group1":
    swap: true
    exclusive: true
    members:
      - "llama"
      - "qwen-unlisted"
  # example2: all the models in this group2 can run at the same time
  # loading another model => unloads all this group2
  "group2":
    swap: false
    exclusive: false
    members:
      - "docker-llama"
      - "modelA"
      - "modelB"
  # example3: persistent models are never unloaded
  "forever":
    persistent: true
    swap: false
    exclusive: false
    members:
      - "forever-modelA"
      - "forever-modelB"
      - "forever-modelC"

Developer info

  • flags override environment variables that override YAML config: Cfg defined in conf.go
  • GUFF files discovery: Search() in models.go
  • Graceful shutdown handling: handleShutdown() in goinfer.go
  • API-key authentication per service: configureAPIKeyAuth() in router.go
  • Comprehensive error handling: gie package in errors.go

API endpoints

Each service can be enabled/disabled in goinfer.yml.

Method Path Description
GET / llama.cpp Web UI
GET /ui llama-swap Web UI
GET /models List available GGUF models
POST /completions Llama.cpp inference API
GET /v1/models List models by llama-swap
POST /v1/chat/completions OpenAI-compatible chat endpoint
POST /v1/* Other OpenAI endpoints
POST /rerank /v1/rerank Reorder or answer questions about a document
POST /infill Auto-complete source code (or other edition)
GET /logs /logs/stream Retrieve the llama-swap or llama.cpp logs
GET /props Get the llama.cpp settings
GET /unload Stop all inference engines
GET /running List the running inference engines
GET /health Check if everything is OK

All endpoints require an Authorization: Bearer $GI_API_KEY header.

llama-swap starts llama-server using the command lines configured in llama-swap.yml. Goinfer generates that llama-swap.yml file setting two different command lines for each model:

  1. classic command line for models listed by /v1/models (to be used by tools like Cline / RooCode)
  2. with extra arguments --jinja --chat-template-file template.jinja when the requested model is prefixed with GI_

The first one is suitable for most of the use cases such as Cline / RooCode. The second one is a specific use case for tools like agent-smith requiring full inference control (e.g. no default Jinja template).

Server/Client mode

⚠️ Not yet implemented. Please contribute. Contact us [email protected] ⚠️

Design

╭──────────────────┐  1 ──>  ╭───────────────────┐         ╭──────────────┐
│ GPU-rich desktop │         │ host static IP/DNS│  <── 2  │ end-user app │
│ (Goinfer client) │  <── 3  │  (Goinfer server) │         │ (browser/API)│
└──────────────────╯  4 ──>  └───────────────────╯  5 ──>  └──────────────╯
  1. Goinfer client connects to the Goinfer server having a static IP (and DNS)
  2. the end user sends a prompt to the cloud-hosted Goinfer server
  3. the Goinfer server reuses the connection to the Goinfer client and forwards it the prompt
  4. the Goinfer client reply the processed prompt by the local LLM using llama.cpp
  5. the Goinfer server forwards the response to the end-user

No inbound ports are opened on neither the Goinfer client nor the end-user app, maximizing security and anonymity between the GPU-rich desktop and the end-user.

Another layer of security is the encrypted double authentication between the Goinfer client and the Goinfer server. Furthermore, we recommend to use HTTPS on port 443 for all these communications to avoid sub-domains because sub-domains remain visible over HTTPS, not URL paths.

High availability is provided by the multiple-clients/multiple-servers architecture:

  • The end-user app connects to one of the available Goinfer servers.
  • All the running Goinfer clients connects to all the Goinfer servers.
  • The Goinfer servers favor the most idle Goinfer clients depending on their capacity (vision prompts are sent to GPU-capable clients running the adequate LLM).
  • Fallback to CPU offloading when appropriate.

1. Run the server (static IP / DNS)

On a VPS, cloud VM, or any machine with a public address.

./goinfer

2. Run the client (GPU machine)

On your desktop with a GPU

./goinfer

The client will connect, register its available models and start listening for inference requests.

3. Test the API

# list the available models
curl -X GET https://my-goinfer-server.com/v1/models

# pick up a model and prompt it
curl -X POST https://my-goinfer-server.com/v1/chat/completions \
  -H "Authorization: Bearer $GI_API_KEY"                       \
  -d '{                                                        \
        "model":"Qwen2.5-1.5B-Instruct-Q4_K_M",                \
        "messages":[{"role":"user",                            \
                     "content":"Say hello in French"}]         \
      }'

You receive a JSON response generated by the model running on your GPU rig.

History & Roadmap

March 2023

In March 2023, Goinfer was an early local LLM proxy swapping models and supporting Ollama, Llama.cpp, and KoboldCpp. Goinfer has been initiated for two needs:

  1. to swap engine and model at runtime, something that didn’t exist back then
  2. to infer pre-configured templated prompts

This second point has been moved to the project github/synw/agent-smith with more templated prompts in github/synw/agent-smith-plugins.

August 2025

To simplify the maintenance, we decided in August 2025 to replace our process management with another well-maintained project. As we do not use Ollama/KoboldCpp any more, we integrated llama-swap into Goinfer to handle communication with llama-server.

New needs

Today the needs have evolved. We need most right now is a proxy that can act as a secure intermediary between a client (frontend/CLI) and a inference engine (local/cloud) with these these constrains:

Client Server Constraint
Frontend OpenRouter Intermediate proxy required to manage the OpenRouter key without exposing it on the frontend
Any Home GPU rig Access to another home GPU rig that forbids external TCP connections

Next implementation

Integrate a Web UI to select the model(s) to enable.

Optimizer of the llama-server command line: finding the best --gpu-layers --override-tensor --n-cpu-moe --ctx-size ... by iterating of GPU allocation error and benching. Timeline of llama.cpp optimization:

This last Python script llama-optimus is nice and could also be used for ik_llama.cpp. Its README explains:

Flag Why it matters Suggested search values Notes
--mmap / --no-mmap (memory-map model vs. fully load) • On fast NVMe & Apple SSD, --mmap 1 (default) is fine.
• On slower HDD/remote disks, disabling mmap (--no-mmap or --mmap 0) and loading the whole model into RAM often gives 10-20 % faster generation (no page-fault stalls).
[0, 1] (boolean) Keep default 1; let Optuna see if 0 wins on a given box.
--cache-type-k / --cache-type-v Setting key/value cache to f16 vs q4 or i8 trades RAM vs speed. Most Apple-Metal & CUDA users stick with f16 (fast, larger). For low-RAM CPUs increasing speed is impossible if it swaps; q4 can shrink cache 2-3× at ~3-5 % speed cost. ["f16","q4"] for both k & v (skip i8 unless you target very tiny devices). Only worth searching when the user is on CPU-only or small-VRAM GPU. You can gate this by detecting “CUDA not found” or VRAM < 8 GB.
--main-gpu / --gpu-split (or --tensor-split) Relevant only for multi-GPU rigs (NVIDIA). Picking the right primary or a tensor split can cut VRAM fragmentation and enable higher -ngl. If multi-GPU detected, expose [0,1] for main-gpu and a handful of tensor-split presets (0,1, 0,0.5,0.5, etc.). Keep disabled on single-GPU/Apple Silicon to avoid wasted trials.
[Preliminary] --flash-attn-type 0/1/2 (v0.2+ of llama.cpp) Metal + CUDA now have two flash-attention kernels (0 ≈ old GEMM, 1 = FMHA, 2 = in-place FMHA). !!Note!!: Not yet merged to llama.cpp main. Some M-series Macs get +5-8 % with type 2 vs 1. [0,1,2] —but only if llama.cpp commit ≥ May 2025. Add a version guard: skip the flag on older builds.

When the VRAM is not enouth or when the user needs to increase the context size, Goinfer needs to offload some layers to CPU. The idea is to identify the least used tensors and to offload them in priority. The command llama-gguf lists the tensors (experts are usually suffixed with _exps).

Other priority task

Two Goinfer instances (client / server mode):

  • a Goinfer on a GPU machine that runs in client mode
  • a Goinfer on a machine in a data-center (static IP) that runs in server mode
  • the client Goinfer connects to the server Goinfer (here, the server is the backend of a web app)
  • the user sends their inference request to the backend (data-center) which forwards it to the client Goinfer
  • we could imagine installing a client Goinfer on every computer with a good GPU, and the server Goinfer that forwards inference requests to the connected client Goinfer according to the requested model

Medium priority

Manage the OpenRouter API key of a AI-powered frontend.

Low priority

  • Comprehensive web admin (monitoring, download/delete .gguf, edit config, restart, git pull + rebuild llama.cpp, remote shell, upgrade Linux, reboot the machine, and other SysAdmin tasks)

contribute – If you’re interested in any of the above, open an issue or submit a PR :)

Prompting creator UI

Integrate a Web UI to ease creation of multi-step AI agents like:

The angentic prompt syntax YALP and its VS-Code extension authored by two Germans: Nils Abegg et einfachai (see also einfachai.com).

Nice to have

Some inspiration to extend the Goinfer stack:

Contributions welcomed

  1. Fork the repository.
  2. Create a feature branch (git checkout -b your-feature)
  3. Run the test suite: go test ./... (more tests are welcome)
  4. Ensure code is formatted and linted with golangci-lint-v2 run --fix
  5. Submit a PR with a clear description and reference any related issue

Feel free to open discussions for design ideas/decisions.

License

Merci

Special thanks to:

  • Georgi Gerganov for releasing and improving llama.cpp in 2023 so we could freely play with Local LLM.
  • All other contributors of llama.cpp.
  • Benson Wong for maintaining llama-swap with clean and well-documented code.
  • The open-source community that makes GPU-based LLM inference possible on commodity hardware. ❤️

See also

Some active local-LLM proxies:

Language Repository
Go github/inference-gateway/inference-gateway
Go github/lordmathis/llamactl
Go github/mostlygeek/llama-swap
Go github/thushan/olla
Python github/codelion/optillm
Python github/llm-proxy/llm-proxy (inactive?)
Rust github/x5iu/llm-proxy
TypeScript github/bluewave-labs/langroute

Compared to alternatives, we like llama-swap for its readable source code and because its author contributes regularly. So we integrated it into Goinfer to handle communication with llama-server (or other compatible forks as ik_llama.cpp). We also like llamactl ;-)

Enjoy remote GPU inference with Goinfer! 🚀

If you have questions, need help setting up your first client/server pair, or want to discuss future features, open an issue or ping us on the repo’s discussion board.