Deploying LLMs using HuggingFace and Kubernetes on OCI Container Engine for Kubernetes (OKE)

Introduction

Large language models (LLMs) have made significant strides in text generation, problem-solving, and following instructions. As businesses integrate LLMs to develop cutting-edge solutions, the need for scalable, secure, and efficient deployment platforms becomes increasingly imperative. Kubernetes has risen as the preferred option for its scalability, flexibility, portability, and resilience.

In this demo, we demonstrate how to deploy fine-tuned LLM inference containers on Oracle Container Engine for Kubernetes (OKE), an OCI-managed Kubernetes service that simplifies deployments and operations at scale for enterprises. This service enables them to retain the custom model and datasets within their own tenancy without relying on a third-party inference API.

We will use Text Generation Inference (TGI) as the inference framework to expose the Large Language Models.

Check out the demo [here](TODO LINK)

HuggingFace text generation inference

Text generation inference (TGI) is an open source toolkit available in containers for serving popular LLMs. The example fine-tuned model in this post is based on Llama 2, but you can use TGI to deploy other open source LLMs, including Mistral, Falcon, BLOOM, and GPT-NeoX. TGI enables high-performance text generation with various optimization features supported on multiple AI accelerators, including NVIDIA GPUs with CUDA 12.2+:

GPU memory consideration

The GPU memory requirement is largely determined by the pretrained LLM's size.

For example, Llama-2-7B (7 billion parameters) loaded in 16-bit precision requires 7 billion * 2 bytes (16 bits / 8 bits/byte) = 14 GB for the model weights.

Quantization is a technique used to reduce model size and improve inferencing performance by decreasing precision without significantly sacrificing accuracy. In this example, we use the quantization feature of TGI to load a fine-tuned model based on Llama 2 13B in 8-bit precision and fit it on VM.GPU.A10.1 (single NVIDIA A10 Tensor Core GPU with 24-GB VRAM).

The following image depicts the real memory utilization after the inference container loads the quantized model. Alternatively, consider employing a smaller model, opting for a GPU instance with larger memory capacity, or selecting an instance with multiple GPUs, such as VM.GPU.A10.2 (2x NVIDIA A10 GPUs), to prevent CUDA out-of-memory errors. By default, TGI shards across and uses all available GPUs to run the model:

Deploying the LLM container on OKE

These are the instructions we will perform in this demo:

(Optional) Take one of the pretrained LLMs from HuggingFace model hub, such as Meta Llama2 13B, and fine-tune it with a targeted dataset on an OCI NVIDIA GPU Compute instance. You can refer to this AI solution to learn how to do finetuning if you're particularly interested in this step.
Save the customized LLM locally and upload it to OCI Object Storage, to store it as a model repository.
Deploy an OKE cluster and create a node pool consisting of VM.GPU.A10.1 compute instances, powered by NVIDIA A10 Tensor Core GPUs (or any other Compute instance you want). OKE offers worker node images with preinstalled NVIDIA GPU drivers.
Install NVIDIA device plugin for Kubernetes, a DaemonSet that allows you to run GPU enabled containers in the Kubernetes cluster.
Build a Docker image for the model-downloader container to pull model files from Object Storage service. (The previous session provides more details.)
Create a Kubernetes deployment to roll out the TGI containers and model-downloader container. To schedule the TGI container on GPU, specify the resources limit using “nvidia.com/gpu.” Run model-downloader as Init Container to ensure that TGI container only starts after the successful completion of model downloads.
Create a Kubernetes service of type Loadbalancer. OKE will automatically spawn an OCI load balancer to expose the TGI application API to the Internet, allowing us to consume it wherever we want in our AI applications.
To interact with the model, you can use curl to send a request to <Load Balancer IP address>:<port>/generate, or deploy an inference client, such as Gradio, to observe your custom LLM in action. We also prepared this Python script to run requests against the model with Python.

0. Prerequisites & Docs

Prerequisites

An OCI tenancy with available credits to spend, and access to NVIDIA A10 Tensor Core GPU(s).
A registered and verified HuggingFace account with a valid Access Token

Docs

For more information, see the following resources:

1. Create & Access OKE Cluster

To create an OKE Cluster, we can perform this step through the OCI Console:

And wait for the creation of the cluster, it'll take around 5 minutes.

You will be able to access this cluster however you want. It's recommended to use OCI Cloud Shell to access and connect to the cluster, as all OCI configuration is performed automatically. If you still want to use a Compute Instance or your own local machine, you will need to set up authentication to your OCI tenancy. Also, you must have downloaded and installed OCI CLI version 2.24.0 (or later) and configured it for use. If your version of the OCI CLI is earlier than version 2.24.0, download and install a newer version from here.

After the cluster has been provisioned, to get access into the OKE cluster, follow these steps.

Click Access Cluster on the Cluster details page:
Accept the default Cloud Shell Access and click Copy to copy the oci ce cluster create-kubeconfig ... command.
To access the cluster, paste the command into your Cloud Shell session and hit Enter.
Verify that the kubectl is working by using the get nodes command:
```
kubectl get nodes
```
Repeat this command multiple times until all three nodes show Ready in the STATUS column:

When all nodes are Ready, your OKE installation has finished successfully.

2. Text Generation Inference (TGI) & Hardware Specs

To install TGI on the OKE cluster, run the following command:

model=HuggingFaceH4/zephyr-7b-beta # huggingface model's repository (model_creator_name/model_name) (can be any model)
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

To see all possible deploy flags and options, you can use the --help flag. It’s possible to configure the number of shards, quantization, generation parameters, and more.

This will setup whichever model from HF you want, as long as the repository is well-formed and contains the necessary files.

After the container has been pulled, you can make requests via an API to /generate_stream to localhost (from within the container) or to <Load Balancer IP address>:<port>/generate:

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":50}}' \
    -H 'Content-Type: application/json'

Here is a full API specification for all callable endpoints.

We've also pulled a Python script here if you'd rather make the requests programatically using Python.

2. Environment Setup

We need to set up NVIDIA CUDA and NVCC (Nvidia CUDA Compiler) to run parallel code on GPUs. For that, let's run the following commands.

Configure the repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& sudo apt-get update

Install the NVIDIA Container Toolkit packages:

sudo apt-get install -y nvidia-container-toolkit

Configure the container runtime by using the nvidia-ctk command:
```
sudo nvidia-ctk runtime configure --runtime=docker
```
Restart the Docker daemon:
```
sudo systemctl restart docker
```
Check that CUDA and nvcc are installed by running the following commands:
```
nvcc -V
nvidia-smi
```

It's recommended to install NVIDIA drivers with CUDA 12.2 or higher.

For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

3. Model loading

TGI supports loading models from HuggingFace model hub or locally. To retrieve a custom LLM from the OCI Object Storage service, we created a Python script using the OCI Python software developer SDK, packaged it as a container, and stored the Docker image on the OCI Container Registry. This model-downloader container runs before the initialization of TGI containers. It retrieves the model files from Object Storage and stores them on the emptyDir volumes, enabling sharing with TGI containers within the same pod.

Conclusion

Deploying a production-ready LLM becomes straightforward when using the HuggingFace TGI container and OKE. This approach allows you to harness the benefits of Kubernetes without the complexities of deploying and managing a Kubernetes cluster. The customized LLMs are fine-tuned and hosted within your Oracle Cloud Infrastructure tenancy, offering complete control over data privacy and model security.

Contributing

This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.

Security

Please consult the security guide for our responsible security vulnerability disclosure process.

License

Licensed under the Universal Permissive License (UPL), Version 1.0.

See LICENSE for more details.

ORACLE AND ITS AFFILIATES DO NOT PROVIDE ANY WARRANTY WHATSOEVER, EXPRESS OR IMPLIED, FOR ANY SOFTWARE, MATERIAL OR CONTENT OF ANY KIND CONTAINED OR PRODUCED WITHIN THIS REPOSITORY, AND IN PARTICULAR SPECIFICALLY DISCLAIM ANY AND ALL IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. FURTHERMORE, ORACLE AND ITS AFFILIATES DO NOT REPRESENT THAT ANY CUSTOMARY SECURITY REVIEW HAS BEEN PERFORMED WITH RESPECT TO ANY SOFTWARE, MATERIAL OR CONTENT CONTAINED OR PRODUCED WITHIN THIS REPOSITORY. IN ADDITION, AND WITHOUT LIMITING THE FOREGOING, THIRD PARTIES MAY HAVE POSTED SOFTWARE, MATERIAL OR CONTENT TO THIS REPOSITORY WITHOUT ANY REVIEW. USE AT YOUR OWN RISK.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
img		img
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
license_policy.yml		license_policy.yml
load-balancer.yaml		load-balancer.yaml
release_files.json		release_files.json
repolinter.json		repolinter.json
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploying LLMs using HuggingFace and Kubernetes on OCI Container Engine for Kubernetes (OKE)

Introduction

HuggingFace text generation inference

GPU memory consideration

Deploying the LLM container on OKE

0. Prerequisites & Docs

Prerequisites

Docs

1. Create & Access OKE Cluster

2. Text Generation Inference (TGI) & Hardware Specs

2. Environment Setup

3. Model loading

Conclusion

Contributing

Security

License

About

Releases

Packages

Contributors 2

Languages

License

oracle-devrel/oci-k8s-hf-inference

Folders and files

Latest commit

History

Repository files navigation

Deploying LLMs using HuggingFace and Kubernetes on OCI Container Engine for Kubernetes (OKE)

Introduction

HuggingFace text generation inference

GPU memory consideration

Deploying the LLM container on OKE

0. Prerequisites & Docs

Prerequisites

Docs

1. Create & Access OKE Cluster

2. Text Generation Inference (TGI) & Hardware Specs

2. Environment Setup

3. Model loading

Conclusion

Contributing

Security

License

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages