Goinfer

Inference proxy – swap between multiple *.gguf models on remote machines and expose them through HTTPS-API with credentials. So you can a securely connect from any device to your home GPU computers, or to let employees to connect to idle GPUs within the company office.

TL;DR – Deploy a client on a GPU-rich desktop, a server on a machine with a static IP (or DNS), and let the server forward inference requests to the client. No VPN, no port-forwarding, end-to-end encryption.

Built on top of llama.cpp and llama-swap, Goinfer is designed to be DevOps-friendly, easily deployable/monitored on remote computers with the minimum manual operations (inspired by llamactl), and meaningful logs.

Problem: remote access to office/home-hosted LLM

⚠️ Not yet implemented. Please contribute. Contact us [email protected] ⚠️

Local-LLM enthusiasts often hit a wall when they try to expose a model to the Internet:

Security – exposing a raw llama-server or ollama instance can leak the GPU to anyone.
Network topology – most home routers block inbound connections, so the GPU machine can’t be reached from outside and home IP changes.
Privacy – using third-party inference services defeats the purpose of running models locally.

Existing tools (llamactl, llama-swap, olla, llm-proxy/rust, llm-proxy/py, langroute, optillm, VPNs, WireGuard, SSH...) either require inbound ports, complex network plumbing, or a custom client on every device.

Goinfer solves these issues by flipping the connection direction: the GPU-rich client (home) initiates a secure outbound connection to a server with a static IP. The server then acts as a public façade, forwarding inference requests back to the client (home-hosted LLM).

Key features

Category	Feature
Model handling	Load multiple `*.gguf` models, switch at runtime, change any inference parameter
API	OpenAI-compatible HTTP API `/v1/`, LLama.cpp-compatible `/completions` API, streaming responses
Security	API key, CORS control
Robustness	Independent of ISP-provided IP, graceful reconnects
Admin control	Remote monitoring, delete/upload new GGUF files, reload config, `git pull llama.cpp`, re-compile
Home-hosted LLM	Run Goinfer on your GPU desktop and another Goinfer in a data-center (static IP/DNS)

Build

Go (any version, go will automatically use Go-1.25 to build Goinfer)
GCC/LLVM if you want to build llama.cpp or ik_llama.cpp or …
NodeJS (optional, llama.cpp frontend is already built)
One or more *.gguf model files

Container

See the Containerfile to build a Docker/Podman image with official Nvidia images, CUDA-13, GCC-14 and optimized CPU flags.

First run

git clone https://github.com/LM4eu/goinfer
cd goinfer

# discover the parent directories of your GUFF files
#   - find the files *.gguf 
#   - -printf their folders (%h)
#   - sort them, -u to keep a unique copy of each folder
#   - while read xxx; do xxx; done  =>  print the parent folders separated by ":"
export GI_MODELS_DIR="$(find ~ /mnt -name '*.gguf' -printf '%h\0' | sort -zu |
while IFS= read -rd '' d; do [[ $p && $d == "$p"/* ]] && continue; echo -n "$d:"; p=$d; done)"

# set the path of your inference engine (llama.cpp/ik_llama.cpp/...)
export GI_LLAMA_EXE=/home/me/bin/llama-server

# generates the config
go run . -write

# voilà, it's running
go run . -no-api-key

Goinfer listens on the ports defined in goinfer.yml. Default ports:

:4444 for extra-featured endpoints /models, /completions, /v1/chat/completions
:5555 for OpenAI-compatible API (provided by llama-swap)

# use the default model
curl -X POST localhost:4444/completions -d '{"prompt":"Hello"}'

# list the models
curl -X GET localhost:4444/models | jq

# pick up a model and prompt it
curl -X POST localhost:4444/completion \
  -d '{ "model":"qwen-3b", "prompt":"Hello AI" }'

# same using the OpenAI API
curl -X POST localhost:5555/v1/chat/completions \
  -d '{ "model": "qwen-3b",                     \
        "messages": [ {"role":"user",           \
                       "content":"Hello AI"}]   \
      }'

All-in-one script

Build all dependencies and run Goinfer with the bash script clone-pull-build-run.sh

clone and build llama.cpp using optimizations flags

clone and build the llama-swap frontend with --build--swap:

git clone https://github.com/LM4eu/goinfer
goinfer/scripts/clone-pull-build-run.sh --build--swap

Perfect to setup the environment, and to update/build daily the dependencies.

No need to edit manually the configuration files: this script discovers your GGUF files. Your personalized configuration files is automatically generated.

The script ends by running a fully configured Goinfer server.

To reuse your own llama-server set:
export GI_LLAMA_EXE=/home/me/path/llama-server (this will prevent cloning/building the llama.cpp)

If this script finds too much *.gguf files, set:
export GI_MODELS_DIR=/home/me/models:/home/me/other/path (this will disable the GUFF search and speedup the script)

Run Goinfer in local without the API key:
./clone-pull-build-run.sh -no-api-key

Full example:

git -C path/repo/goinfer pull --ff-only
export GI_MODELS_DIR=/home/me/models
export GI_DEFAULT_MODEL=my-favorite-model.gguf
export GI_LLAMA_EXE=/home/me/bin/llama-server
path/repo/goinfer/scripts/clone-pull-build-run.sh -no-api-key

Use the flag --help or the usage within the script.

Configuration

Environment variables

Discover the parent folders of your GUFF models:

find the files *.gguf in $HOME and /mnt
-printf their folders %h separated by nul character \0 (support folder names containing newline characters)
sort them, -u to keep a unique copy of each folder (z = input is \0 separated)
while read xxx; do xxx; done to keep the parent folders
echo $d: prints each parent folder separated by : (-n no newline)

export GI_MODELS_DIR="$(find "$HOME" /mnt -type f -name '*.gguf' -printf '%h\0' | sort -zu |
while IFS= read -rd '' d; do [[ $p && $d == "$p"/* ]] && continue; echo -n "$d:"; p=$d; done)"

# else manually

export GI_MODELS_DIR=/path/to/my/models

# multiple paths

export GI_MODELS_DIR=/path1:/path2:/path3

GI_MODELS_DIR is the root path where your models are stored. goinfer will search *.gguf files within all GI_MODELS_DIR sub-folders. So you can organize your models within a folders tree.

The other environment variables are:

export GI_LLAMA_EXE=/path/to/my/llama-server
export GI_HOST=0.0.0.0  # exposing llama-server is risky
export GI_ORIGINS=      # disabling CORS is risky
export GI_API_KEY="PLEASE SET SECURE API KEY"

Disable Gin debug logs:

export GIN_MODE=release

API key

The flag -write also generates a random API key in goinfer.yml. This flag can be combined with:

-debug sets the debug API key (only during the dev cycle)
-no-api-key sets the API key with "Please ⚠️ Set your API key" admin: "PLEASE

Set the Authorization header within the HTTP request:

curl -X POST https://localhost:4444/completions  \
  -H "Authorization: Bearer $GI_API_KEY"         \
  -d '{ "prompt": "Say hello in French" }'

`goinfer.yml`

# Goinfer recursively search GGUF files in one or multiple folders separated by ':'
# List your GGUF dirs with `locate .gguf | sed -e 's,/[^/]*$,,' | uniq`
models_dir: /home/me/models 

# ⚠️ Set your API key, can be 64-hex-digit (32-byte) 🚨
# Generate these random API key with: ./goinfer -write
api_key: "PLEASE SET USER API KEY"
origins:   # CORS whitelist
  - "https://my-frontend.example.com"
  - "http://localhost"
listen:
  # format:  <address>: <list of enabled services>
  # <address> can be <ip|host>:<port> or simply :<port> when <host> is localhost
  ":4444": goinfer     # /completions endpoint letting tools like Agent-Smith doing the templating
  ":5555": llama-swap  # OpenAI-compatible API by llama-swap

llama:
  exe: /home/me/llama.cpp/build/bin/llama-server
  args:
    # common args used for every model
    common: --props --no-warmup --no-mmap
    # extra args to let tools like Agent-Smith doing the templating (/completions endpoint)
    goinfer: --jinja --chat-template-file template.jinja
    # extra llama-server flag when ./goinfer is used without the -q flag
    verbose: --verbose-prompt
    # extra llama-server flag for ./goinfer -debug
    debug: --verbosity 3

API key – Never commit them. Use env. var. GI_API_KEY or a secrets manager in production.
Origins – Set to the domains you’ll be calling the server from (including localhost for testing).
Ports – Adjust as needed; make sure the firewall on the server allows them.

`llama-swap.yml`

At startup, Goinfer verifies the available GUFF files. The flag -write allow Goinfer to write the llama-swap.yml file.

Official documentation: github/mostlygeek/llama-swap/wiki/Configuration

logLevel: info            # debug, info, warn, error
healthCheckTimeout: 500   # seconds to wait for a model to become ready
metricsMaxInMemory: 1000  # maximum number of metrics to keep in memory
startPort: 6000           # first ${PORT} incremented for each model

macros:  # macros to reduce common conf settings
    cmd-fim: /home/me/llama.cpp/build/bin/llama-server --props --no-warmup --no-mmap --verbose-prompt
    cmd-common: ${cmd-fim} --jinja --port ${PORT}
    cmd-goinfer: ${cmd-common} --chat-template-file template.jinja

models:

  # model name used in API requests
  ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0:
    description: "Small but capable model for quick testing"
    name: Qwen2.5-Coder-0.5B-Q8_0-GGUF  # for /v1/models response
    useModelName: "Qwen2.5-Coder"       # overrides the model name for /upstream (used by llama-swap web UI)
    aliases:
      - "Qwen2.5-Coder-0.5B-Q8_0"       # alternative names (unique globally)
      - "Qwen2.5-Coder-0.5B"
    env: []
    cmd: ${cmd-common}  -m /home/c/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0.gguf
    proxy: http://localhost:${PORT}     # default: http://localhost:${PORT}
    checkEndpoint: /health              # default: /health endpoint
    unlisted: false                     # unlisted=false => list model in /v1/models and /upstream responses
    ttl: 3600                           # stop the cmd after 1 hour of inactivity
    filters:
      # inference params to remove from the request, default: ""
      # useful for preventing overriding of default server params by requests
      strip_params: "temperature,top_p,top_k"

  # GI_ prefix for goinfer /completions endpoint
  GI_ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0:
      cmd: ${cmd-goinfer}  -m /home/c/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-0.5B-Q8_0-GGUF_qwen2.5-coder-0.5b-q8_0.gguf
      proxy: http://localhost:${PORT}
      checkEndpoint: /health
      unlisted: true   # hide model name in /v1/models and /upstream responses
      useModelName: ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF # for /upstream (used by llama-swap web UI)

  # selected models by llama.cpp are also available with their specific port
  ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:
      cmd: ${cmd-fim} --fim-qwen-1.5b-default
      proxy: http://localhost:8012
      checkEndpoint: /health
      unlisted: false

# preload some models on startup 
hooks:
  on_startup:
    preload:
      - "Qwen2.5-1.5B-Instruct-Q4_K_M"

# Keep some models loaded indefinitely, while others are swapped out
# see https://github.com/mostlygeek/llama-swap/pull/109
groups:
  # example1: only one model is allowed to run a time (default mode)
  "group1":
    swap: true
    exclusive: true
    members:
      - "llama"
      - "qwen-unlisted"
  # example2: all the models in this group2 can run at the same time
  # loading another model => unloads all this group2
  "group2":
    swap: false
    exclusive: false
    members:
      - "docker-llama"
      - "modelA"
      - "modelB"
  # example3: persistent models are never unloaded
  "forever":
    persistent: true
    swap: false
    exclusive: false
    members:
      - "forever-modelA"
      - "forever-modelB"
      - "forever-modelC"

Developer info

flags override environment variables that override YAML config: Cfg defined in conf.go
GUFF files discovery: Search() in models.go
Graceful shutdown handling: handleShutdown() in goinfer.go
API-key authentication per service: configureAPIKeyAuth() in router.go
Comprehensive error handling: gie package in errors.go

API endpoints

Each service can be enabled/disabled in goinfer.yml.

Method	Path	Description
GET	`/`	llama.cpp Web UI
GET	`/ui`	llama-swap Web UI
GET	`/models`	List available GGUF models
POST	`/completions`	Llama.cpp inference API
GET	`/v1/models`	List models by llama-swap
POST	`/v1/chat/completions`	OpenAI-compatible chat endpoint
POST	`/v1/*`	Other OpenAI endpoints
POST	`/rerank` `/v1/rerank`	Reorder or answer questions about a document
POST	`/infill`	Auto-complete source code (or other edition)
GET	`/logs` `/logs/stream`	Retrieve the llama-swap or llama.cpp logs
GET	`/props`	Get the llama.cpp settings
GET	`/unload`	Stop all inference engines
GET	`/running`	List the running inference engines
GET	`/health`	Check if everything is OK

All endpoints require an Authorization: Bearer $GI_API_KEY header.

llama-swap starts llama-server using the command lines configured in llama-swap.yml. Goinfer generates that llama-swap.yml file setting two different command lines for each model:

classic command line for models listed by /v1/models (to be used by tools like Cline / RooCode)
with extra arguments --jinja --chat-template-file template.jinja when the requested model is prefixed with GI_

The first one is suitable for most of the use cases such as Cline / RooCode. The second one is a specific use case for tools like agent-smith requiring full inference control (e.g. no default Jinja template).

Server/Client mode

⚠️ Not yet implemented. Please contribute. Contact us [email protected] ⚠️

Design

╭──────────────────┐  1 ──>  ╭───────────────────┐         ╭──────────────┐
│ GPU-rich desktop │         │ host static IP/DNS│  <── 2  │ end-user app │
│ (Goinfer client) │  <── 3  │  (Goinfer server) │         │ (browser/API)│
└──────────────────╯  4 ──>  └───────────────────╯  5 ──>  └──────────────╯

Goinfer client connects to the Goinfer server having a static IP (and DNS)
the end user sends a prompt to the cloud-hosted Goinfer server
the Goinfer server reuses the connection to the Goinfer client and forwards it the prompt
the Goinfer client reply the processed prompt by the local LLM using llama.cpp
the Goinfer server forwards the response to the end-user

No inbound ports are opened on neither the Goinfer client nor the end-user app, maximizing security and anonymity between the GPU-rich desktop and the end-user.

Another layer of security is the encrypted double authentication between the Goinfer client and the Goinfer server. Furthermore, we recommend to use HTTPS on port 443 for all these communications to avoid sub-domains because sub-domains remain visible over HTTPS, not URL paths.

High availability is provided by the multiple-clients/multiple-servers architecture:

The end-user app connects to one of the available Goinfer servers.
All the running Goinfer clients connects to all the Goinfer servers.
The Goinfer servers favor the most idle Goinfer clients depending on their capacity (vision prompts are sent to GPU-capable clients running the adequate LLM).
Fallback to CPU offloading when appropriate.

1. Run the server (static IP / DNS)

On a VPS, cloud VM, or any machine with a public address.

./goinfer

2. Run the client (GPU machine)

On your desktop with a GPU

./goinfer

The client will connect, register its available models and start listening for inference requests.

3. Test the API

# list the available models
curl -X GET https://my-goinfer-server.com/v1/models

# pick up a model and prompt it
curl -X POST https://my-goinfer-server.com/v1/chat/completions \
  -H "Authorization: Bearer $GI_API_KEY"                       \
  -d '{                                                        \
        "model":"Qwen2.5-1.5B-Instruct-Q4_K_M",                \
        "messages":[{"role":"user",                            \
                     "content":"Say hello in French"}]         \
      }'

You receive a JSON response generated by the model running on your GPU rig.

History & Roadmap

March 2023

In March 2023, Goinfer was an early local LLM proxy swapping models and supporting Ollama, Llama.cpp, and KoboldCpp. Goinfer has been initiated for two needs:

to swap engine and model at runtime, something that didn’t exist back then
to infer pre-configured templated prompts

This second point has been moved to the project github/synw/agent-smith with more templated prompts in github/synw/agent-smith-plugins.

August 2025

To simplify the maintenance, we decided in August 2025 to replace our process management with another well-maintained project. As we do not use Ollama/KoboldCpp any more, we integrated llama-swap into Goinfer to handle communication with llama-server.

New needs

Today the needs have evolved. We need most right now is a proxy that can act as a secure intermediary between a client (frontend/CLI) and a inference engine (local/cloud) with these these constrains:

Client	Server	Constraint
Frontend	OpenRouter	Intermediate proxy required to manage the OpenRouter key without exposing it on the frontend
Any	Home GPU rig	Access to another home GPU rig that forbids external TCP connections

Next implementation

Integrate a Web UI to select the model(s) to enable.

Optimizer of the llama-server command line: finding the best --gpu-layers --override-tensor --n-cpu-moe --ctx-size ... by iterating of GPU allocation error and benching. Timeline of llama.cpp optimization:

Apr 2024 llama.cpp PR -ngl auto (Draft)
Jan 2025 [study](https://github.com/robbiemu/llama-gguf-optimize
Mar 2025 Python script determining -ngl
Jun 2025 llama.cpp another PR (Draft) based on these ideas
Jun 2025 Python script running llama-bench for best -b -ub -fa -t -ngl -ot (maybe integrated in llamap.cpp)

This last Python script llama-optimus is nice and could also be used for ik_llama.cpp. Its README explains:

Flag	Why it matters	Suggested search values	Notes
`--mmap / --no-mmap` (memory-map model vs. fully load)	• On fast NVMe & Apple SSD, `--mmap 1` (default) is fine. • On slower HDD/remote disks, disabling mmap (`--no-mmap` or `--mmap 0`) and loading the whole model into RAM often gives 10-20 % faster generation (no page-fault stalls).	`[0, 1]` (boolean)	Keep default `1`; let Optuna see if `0` wins on a given box.
`--cache-type-k / --cache-type-v`	Setting key/value cache to `f16` vs `q4` or `i8` trades RAM vs speed. Most Apple-Metal & CUDA users stick with `f16` (fast, larger). For low-RAM CPUs increasing speed is impossible if it swaps; `q4` can shrink cache 2-3× at ~3-5 % speed cost.	`["f16","q4"]` for both k & v (skip i8 unless you target very tiny devices).	Only worth searching when the user is on CPU-only or small-VRAM GPU. You can gate this by detecting “CUDA not found” or VRAM < 8 GB.
`--main-gpu` / `--gpu-split` (or `--tensor-split`)	Relevant only for multi-GPU rigs (NVIDIA). Picking the right primary or a tensor split can cut VRAM fragmentation and enable higher `-ngl`.	If multi-GPU detected, expose `[0,1]` for `main-gpu` and a handful of tensor-split presets (`0,1`, `0,0.5,0.5`, etc.).	Keep disabled on single-GPU/Apple Silicon to avoid wasted trials.
[Preliminary] `--flash-attn-type 0/1/2` (v0.2+ of llama.cpp)	Metal + CUDA now have two flash-attention kernels (`0` ≈ old GEMM, `1` = FMHA, `2` = in-place FMHA). !!Note!!: Not yet merged to llama.cpp main. Some M-series Macs get +5-8 % with type 2 vs 1.	`[0,1,2]` —but only if llama.cpp commit ≥ May 2025.	Add a version guard: skip the flag on older builds.

When the VRAM is not enouth or when the user needs to increase the context size, Goinfer needs to offload some layers to CPU. The idea is to identify the least used tensors and to offload them in priority. The command llama-gguf lists the tensors (experts are usually suffixed with _exps).

Other priority task

Two Goinfer instances (client / server mode):

a Goinfer on a GPU machine that runs in client mode
a Goinfer on a machine in a data-center (static IP) that runs in server mode
the client Goinfer connects to the server Goinfer (here, the server is the backend of a web app)
the user sends their inference request to the backend (data-center) which forwards it to the client Goinfer
we could imagine installing a client Goinfer on every computer with a good GPU, and the server Goinfer that forwards inference requests to the connected client Goinfer according to the requested model

Medium priority

Manage the OpenRouter API key of a AI-powered frontend.

Low priority

Comprehensive web admin (monitoring, download/delete .gguf, edit config, restart, git pull + rebuild llama.cpp, remote shell, upgrade Linux, reboot the machine, and other SysAdmin tasks)

contribute – If you’re interested in any of the above, open an issue or submit a PR :)

Prompting creator UI

Integrate a Web UI to ease creation of multi-step AI agents like:

The angentic prompt syntax YALP and its VS-Code extension authored by two Germans: Nils Abegg et einfachai (see also einfachai.com).

Nice to have

Some inspiration to extend the Goinfer stack:

compose.yml with something like github/j4ys0n/local-ai-stack and github/LLemonStack/llemonstack
WebUI github/oobabooga/text-generation-webui, github/danny-avila/LibreChat, github/JShollaj/Awesome-LLM-Web-UI
Vector Database and Vector Search Engine github/qdrant/qdrant
Convert an webpage (URL) into clean markdown or structured data github/firecrawl/firecrawl github/unclecode/crawl4ai github/browser-use/browser-use
github/BerriAI/litellm + github/langfuse/langfuse
github/claraverse-space/ClaraCore automatizes installation & configuration of llama-swap

Contributions welcomed

Fork the repository.
Create a feature branch (git checkout -b your-feature)
Run the test suite: go test ./... (more tests are welcome)
Ensure code is formatted and linted with golangci-lint-v2 run --fix
Submit a PR with a clear description and reference any related issue

Feel free to open discussions for design ideas/decisions.

License

License: MIT – see LICENSE file.
Dependencies:
- llama.cpp – Apache-2.0
- llama-swap – MIT

Merci

Special thanks to:

Georgi Gerganov for releasing and improving llama.cpp in 2023 so we could freely play with Local LLM.
All other contributors of llama.cpp.
Benson Wong for maintaining llama-swap with clean and well-documented code.
The open-source community that makes GPU-based LLM inference possible on commodity hardware. ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 641 Commits
.github/workflows		.github/workflows
.old-code		.old-code
.roo		.roo
conf		conf
docsite		docsite
examples		examples
gie		gie
infer		infer
prompts		prompts
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.nojekyll		.nojekyll
Containerfile		Containerfile
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
go.mod		go.mod
go.sum		go.sum
goinfer.go		goinfer.go

Language	Repository
Go	github/inference-gateway/inference-gateway
Go	github/lordmathis/llamactl
Go	github/mostlygeek/llama-swap
Go	github/thushan/olla
Python	github/codelion/optillm
Python	github/llm-proxy/llm-proxy (inactive?)
Rust	github/x5iu/llm-proxy
TypeScript	github/bluewave-labs/langroute

License

LM4eu/goinfer

Folders and files

Latest commit

History

Repository files navigation

Goinfer

Problem: remote access to office/home-hosted LLM

Key features

Build

Container

First run

All-in-one script

Configuration

Environment variables

API key

goinfer.yml

llama-swap.yml

Developer info

API endpoints

Server/Client mode

Design

1. Run the server (static IP / DNS)

2. Run the client (GPU machine)

3. Test the API

History & Roadmap

March 2023

August 2025

New needs

Next implementation

Other priority task

Medium priority

Low priority

Prompting creator UI

Nice to have

Contributions welcomed

License

Merci

See also

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages

`goinfer.yml`

`llama-swap.yml`