You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/cli/README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -161,7 +161,7 @@
161
161
|`-mmu, --mmproj-url URL`| URL to a multimodal projector file. see tools/mtmd/README.md<br/>(env: LLAMA_ARG_MMPROJ_URL) |
162
162
|`--mmproj-auto, --no-mmproj, --no-mmproj-auto`| whether to use multimodal projector file (if available), useful when using -hf (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_AUTO) |
163
163
|`--mmproj-offload, --no-mmproj-offload`| whether to enable GPU offloading for multimodal projector (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_OFFLOAD) |
164
-
|`--image, --audioFILE`| path to an imageor audio file. use with multimodal models, use comma-separated values for multiple files |
164
+
|`--image, --audio, --video FILE`| path to an image, audio, or video file. use with multimodal models, use comma-separated values for multiple files |
165
165
|`--image-min-tokens N`| minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
166
166
|`--image-max-tokens N`| maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
167
167
|`--chat-template-kwargs STRING`| sets additional params for the json template parser, must be a valid json object string, e.g. '{"key1":"value1","key2":"value2"}'<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_KWARGS) |
@@ -174,6 +174,7 @@
174
174
|`--chat-template-file JINJA_TEMPLATE_FILE`| set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, granite-4.0, granite-4.1, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
175
175
|`--skip-chat-parsing, --no-skip-chat-parsing`| force a pure content parser, even if a Jinja template is specified; model will output everything in the content section, including any reasoning and/or tool calls (default: disabled)<br/>(env: LLAMA_ARG_SKIP_CHAT_PARSING) |
176
176
|`--simple-io`| use basic IO for better compatibility in subprocesses and limited consoles |
177
+
|`--log-prompts-dir PATH`| Log prompts to directory (only used for debugging, default: disabled) |
177
178
|`--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]`| Same as --hf-repo, but for the draft model (default: unused)<br/>(env: LLAMA_ARG_SPEC_DRAFT_HF_REPO) |
178
179
|`--spec-draft-threads, -td, --threads-draft N`| number of threads to use during generation (default: same as --threads) |
179
180
|`--spec-draft-threads-batch, -tbd, --threads-batch-draft N`| number of threads to use during batch and prompt processing (default: same as --threads-draft) |
|`-mm, --mmproj FILE`| path to a multimodal projector file. see tools/mtmd/README.md<br/>note: if -hf is used, this argument can be omitted<br/>(env: LLAMA_ARG_MMPROJ) |
178
-
|`-tk, --talker-model FILE`| path to the qwen3-omni talker gguf, enables the /v1/audio/speech endpoint<br/>(env: LLAMA_ARG_TALKER_MODEL) |
179
-
|`-c2w, --code2wav-model FILE`| path to the qwen3-omni code2wav gguf, the talker code detokenizer<br/>(env: LLAMA_ARG_CODE2WAV_MODEL) |
180
178
|`-mmu, --mmproj-url URL`| URL to a multimodal projector file. see tools/mtmd/README.md<br/>(env: LLAMA_ARG_MMPROJ_URL) |
181
179
|`--mmproj-auto, --no-mmproj, --no-mmproj-auto`| whether to use multimodal projector file (if available), useful when using -hf (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_AUTO) |
182
180
|`--mmproj-offload, --no-mmproj-offload`| whether to enable GPU offloading for multimodal projector (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_OFFLOAD) |
183
181
|`--image-min-tokens N`| minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
184
182
|`--image-max-tokens N`| maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
183
+
|`--mtmd-batch-max-tokens N`| maximum number of image tokens per batch when encoding images (default: 1024)<br/>(env: LLAMA_ARG_MTMD_BATCH_MAX_TOKENS) |
185
184
|`-a, --alias STRING`| set model name aliases, comma-separated (to be used by API)<br/>(env: LLAMA_ARG_ALIAS) |
186
185
|`--tags STRING`| set model tags, comma-separated (informational, not used for routing)<br/>(env: LLAMA_ARG_TAGS) |
|`--webui-mcp-proxy, --no-webui-mcp-proxy`|[DEPRECATED: use --ui-mcp-proxy/--no-ui-mcp-proxy] experimental: whether to enable MCP CORS proxy<br/>(env: LLAMA_ARG_WEBUI_MCP_PROXY) |
198
-
|`--ui-mcp-proxy, --no-ui-mcp-proxy`| experimental: whether to enable MCP CORS proxy - do not enable in untrusted environments (default: disabled)<br/>(env: LLAMA_ARG_UI_MCP_PROXY) |
|`--ui-mcp-proxy, --webui-mcp-proxy, --no-ui-mcp-proxy, --no-webui-mcp-proxy`| experimental: whether to enable MCP CORS proxy - do not enable in untrusted environments (default: disabled)<br/>(env: LLAMA_ARG_UI_MCP_PROXY) |
199
195
|`--tools TOOL1,TOOL2,...`| experimental: whether to enable built-in tools for AI agents - do not enable in untrusted environments (default: no tools)<br/>specify "all" to enable all tools<br/>available tools: read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff, get_datetime<br/>(env: LLAMA_ARG_TOOLS) |
200
-
|`--webui, --no-webui`|[DEPRECATED: use --ui/--no-ui] whether to enable the Web UI<br/>(env: LLAMA_ARG_WEBUI) |
201
-
|`--ui, --no-ui`| whether to enable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_UI) |
196
+
|`-ag, --agent, -no-ag, --no-agent`|whether to enable CORS proxy and all built-in tools - do not enable in untrusted environments (default: disabled)<br/>(env: LLAMA_ARG_AGENT) |
197
+
|`--ui, --webui, --no-ui, --no-webui`| whether to enable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_UI) |
202
198
|`--embedding, --embeddings`| restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
203
199
|`--rerank, --reranking`| enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
204
200
|`--api-key KEY`| API key to use for authentication, multiple keys can be provided as a comma-separated list (default: none)<br/>(env: LLAMA_API_KEY) |
@@ -207,6 +203,7 @@ For the full list of features, please refer to [server's changelog](https://gith
207
203
|`--ssl-cert-file FNAME`| path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE) |
208
204
|`--chat-template-kwargs STRING`| sets additional params for the json template parser, must be a valid json object string, e.g. '{"key1":"value1","key2":"value2"}'<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_KWARGS) |
209
205
|`-to, --timeout N`| server read/write timeout in seconds (default: 3600)<br/>(env: LLAMA_ARG_TIMEOUT) |
206
+
|`--sse-ping-interval N`| server SSE ping interval in seconds (-1 = disabled, default: 30)<br/>(env: LLAMA_ARG_SSE_PING_INTERVAL) |
210
207
|`--threads-http N`| number of threads used to process HTTP requests (default: -1)<br/>(env: LLAMA_ARG_THREADS_HTTP) |
|`--cache-reuse N`| min chunk size to attempt reusing from the cache via KV shifting, requires prompt caching to be enabled (default: 0)<br/>[(card)](https://ggml.ai/f0.png)<br/>(env: LLAMA_ARG_CACHE_REUSE) |
@@ -231,6 +228,7 @@ For the full list of features, please refer to [server's changelog](https://gith
231
228
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.10, 0.0 = disabled) |
232
229
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
233
230
|`--sleep-idle-seconds SECONDS`| number of seconds of idleness after which the server will sleep (default: -1; -1 = disabled) |
231
+
|`--log-prompts-dir PATH`| Log prompts to directory (only used for debugging, default: disabled) |
234
232
|`--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]`| Same as --hf-repo, but for the draft model (default: unused)<br/>(env: LLAMA_ARG_SPEC_DRAFT_HF_REPO) |
235
233
|`--spec-draft-threads, -td, --threads-draft N`| number of threads to use during generation (default: same as --threads) |
236
234
|`--spec-draft-threads-batch, -tbd, --threads-batch-draft N`| number of threads to use during batch and prompt processing (default: same as --threads-draft) |
0 commit comments