rama-swap is a simple alternative to llama-swap that wraps ramalama.
It is designed to make using ramalama as easy as ollama.
rama-swap can be run as either a container image or directly on the host.
The rama-swap docker image bundles rama-swap with ramalama's inference container image.
All model inference servers will run inside one container.
# run the container (you will need to add flags to enable gpu inference)
podman run --rm -v ~/.local/share/ramalama:/app/store:ro,Z -p 127.0.0.1:4917:4917 ghcr.io/wk-y/rama-swap:masterrama-swap can be built/run using standard go tooling.
go run github.com/wk-y/rama-swap@latestrama-swap supports a few command-line flags for configuration.
See <HELP.txt> or run rama-swap -help for the list of supported flags.
The following OpenAI compatible endpoints are proxied to the underlying ramalama instances:
-
/v1/models -
/v1/completions -
/v1/chat/completions
Ollama-compatible endpoints are also implemented:
-
/api/version -
/api/tags$^1$ -
/api/chat$^1$
Similar to llama-swap, the /upstream/{model}/... endpoints provide access to the upstream model servers.
Models with slashes in their name are accessible through /upstream by replacing the slashes with underscores.
/upstream/ provides links to each models' url.