rama-swap
is a simple alternative to llama-swap that wraps ramalama.
It is designed to make using ramalama as easy as ollama.
rama-swap
can be run as either a container image or directly on the host.
The rama-swap
docker image bundles rama-swap
with ramalama's inference container image.
All model inference servers will run inside one container.
# run the container (you will need to add flags to enable gpu inference)
podman run --rm -v ~/.local/share/ramalama:/app/store:ro,Z -p 127.0.0.1:4917:4917 ghcr.io/wk-y/rama-swap:master
rama-swap
can be built/run using standard go tooling.
go run github.com/wk-y/rama-swap@latest
rama-swap
supports a few command-line flags for configuration.
See <HELP.txt> or run rama-swap -help
for the list of supported flags.
The following OpenAI compatible endpoints are proxied to the underlying ramalama instances:
-
/v1/models
-
/v1/completions
-
/v1/chat/completions
Ollama-compatible endpoints are also implemented:
-
/api/version
-
/api/tags
$^1$ -
/api/chat
$^1$
Similar to llama-swap
, the /upstream/{model}/...
endpoints provide access to the upstream model servers.
Models with slashes in their name are accessible through /upstream
by replacing the slashes with underscores.
/upstream/
provides links to each models' url.