Getting Started with Shared Models
Prerequisites
Created Auth Token
- Create token via STACKIT Portal
- Create token via product API
Shared Models
The term "Shared Models" refers to models that are used communally by all clients. Through the shared hosting of our LLMs, we enable a large number of users to cost-effectively access these powerful models and utilize them for their specific applications.
Further information about the licenses and endpoints of the provided models can be found on Models & Licenses.
Chat Models
Model | Description | Modalities | Size (Parameter) | Context Length | max Output Length | Bits | TPM limit* | RPM limit** |
---|---|---|---|---|---|---|---|---|
cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic | The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B. The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot. | Input:
Output:
| 70.60 Billion | 128k tokens | 4096 tokens | 8 | 130000 | 80 |
google/gemma-3-27b-it | Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot. | Input:
Output:
| 27.40 Billion | 37k tokens | 4096 tokens | 16 | 130000 | 80 |
neuralmagic/Mistral-Nemo-Instruct-2407-FP8 | The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407. The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency. The model is intended for commercial and research use in English, particularly for assistant-like chat applications. | Input:
Output:
| 12.20 Billion | 128k tokens | 4096 tokens | 8 | 130000 | 80 |
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 | The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B. Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks. | Input:
Output:
| 8.03 Billion | 128k tokens | 4096 tokens | 8 | 130000 | 80 |
* TPM - Tokens per minute: The TPM limit is calculated by adding the prompt tokens to the generation tokens, with generation tokens weighted by a factor of 5.
** RPM - Requests per minute.
Example usage:
MODEL_SERVING_AUTH_TOKEN="ey..."
curl -X POST https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1/chat/completions \
--header "Authorization: Bearer $MODEL_SERVING_AUTH_TOKEN" \
--header "Content-Type: application/json" \
--data '{"model": "cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic","messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Why is this documentation great?"}], "max_completion_tokens": 250,"temperature": 0.1}'
# {"id":"cmpl-a7a2f78e5ff74fc5b975b8d0059a0001","object":"text_completion","created":1729776894,"model":"cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic","choices":[{"index":0,"text":"ΒΆ\n\nThis documentation is designed to help you understand how to use the PyTorch library, which is a popular open-source machine learning framework. Here are some reasons why you should use this documentation:\n\n1. **Comprehensive coverage**: This documentation covers all aspects of PyTorch, including its core features, modules, and tools. You'll find detailed explanations, examples, and tutorials to help you master PyTorch.\n2. **Official source**: This documentation is maintained by the PyTorch team, ensuring that the information is accurate, up-to-date, and authoritative.\n3. **Easy to navigate**: The documentation is organized in a logical and intuitive way, making it easy to find the information you need. You can browse by topic, search for specific keywords, or use the table of contents to navigate.\n4. **Code examples and tutorials**: The documentation includes numerous code examples, tutorials, and guides to help you get started with PyTorch. You can learn by doing, and the examples will help you understand how to apply PyTorch to your own projects.\n5. **Community involvement**: The PyTorch community is active and engaged, and the documentation is open-source. This means that you can contribute to the documentation,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":8,"total_tokens":258,"completion_tokens":250}}
Embedding Models
Model | Description | Modalities | Maximum Input Tokens | Output Dimension | TPM limit* | RPM limit** |
---|---|---|---|---|---|---|
intfloat/e5-mistral-7b-instruct | This is an embedding model and has no chat capabilities. The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it's well-suited for tasks like passage ranking and retrieval. However, it's recommended to use this model for English-only tasks, as its performance may degrade for other languages. It's capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications. | Input:
| 4096 | 4096 | 60000 | 600 |
* TPM - Tokens per minute: The TPM limit is calculated by summing all input tokens.
** RPM - Requests per minute.
Example usage:
curl -i -X POST https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1/embeddings -d '{"model": "intfloat/e5-mistral-7b-instruct","input": ["Test"]}' -H "Content-Type: application/json" -H "Authorization: Bearer ey..."
# {"id":"embd-96d405966aa14e8eb3d7e202a006e2cf","object":"list","created":1262540,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.0167388916015625,0.005096435546875,0.01302337646484375,0.006805419921875,0.0089569091796875,-0.01406097412109375,...], "usage":{"prompt_tokens":3,"total_tokens":3,"completion_tokens":0}}