December 31 2024
The integration of Large Language Models (LLMs) into applications has become an essential part of innovative solutions. Running these models in a Kubernetes environment can bring scalability, fault tolerance, and efficiency to your AI workflows. In this blog post, we will explore how you can deploy and manage LLMs on Kubernetes.
I'm not going to fully touch the parts of scaling, but I will guide you through with this setup to have the basics, so once you do need to scale it, it will be possible.
The full manifests repo is here: ollama-kubernetes.
Kubernetes provides a robust orchestration layer for managing containerized applications. For LLMs, Kubernetes offers:
The ollama-kubernetes
repository provides a streamlined way to deploy LLMs. Let’s walk through the steps to get started.
Before diving in, ensure you have the following:
Start by cloning the ollama-kubernetes
repository:
git clone https://github.com/edenreich/ollama-kubernetes.git
cd ollama-kubernetes
I kept everything simple, by using kubernetes vanilla manifests, this also helps to understand how it structured, I didn't want to reach helm because it comes with a lot of templating and could be hard to understand.
At the beginning I reached a repository that claims to have a Kubernetes Operator with CRDs which seemed to be official from Ollama, apparently it's not, and I figured it's partially implemented so I've decided to skip it.
ollama
.When running it just locally to experiment with it. Just run:
flox activate
task cluster-create
You should have a running 3 node cluster.
To deploy the LLM as a service (let's start with the small model, to make sure it fits on the local hardware):
kubectl apply -f ollama/phi3/
It may take some time to download the LLM, so be patient, the container will be up and running once the LLM has been downloaded.
Check the status of your pods:
kubectl get pods
You should see pods running the LLM service. To access the API:
kubectl -n ollama port-forward svc/ollama 8080:8080
curl http://localhost:8080/api/tags
This will retrieve the LLM that was pulled earlier via the API.
If your cluster supports GPUs, ensure the NVIDIA
device plugin is installed. It could be installed via helm chart on all cloud providers.
Don't forget to comment back in the nvidia.com/gpu
resource quotas, so the pod will be automatically scheduled on the nodes that have sufficient nvidia GPUs.
You can interact with the model using port-forward and a bunch of curl requests from the terminal to generate responses or...
You can interact with it like I like to do, by simply using OpenWeb-UI
which is fully open source - it's awesome.
The interaction with the API directly, only make sense if you're planning to build agents. If you'd like me to cover it, let me know.
Back to the subject,
To deploy it:
kubectl apply -f openweb-ui/
Verify the UI is ready:
kubectl -n openweb-ui rollout status deploy openweb-ui
It might take some time to go up.
To view the UI, run:
kubectl -n openweb-ui port-forward svc/openweb-ui-service 8080:8080
Open http://localhost:8080 and create a sample account.
Note that there is an environment variable OLLAMA_BASE_URLS
- a ;
separated list of the deployed LLMs services. So if you deploy another LLM ensure to add an entry so it will be discoverable by the UI.
Running LLMs on Kubernetes is super simple, once you get the initial setup correctly.
There are some solutions out there for Kubernetes, but they are not tailored for LLMs, so I've decided to drop them. Also for clarity you can think about it more of a guide like "Kubernetes the hard way" to understand the fundamentals without many abstractions that reduces clarity. With this building-blocks you can take that same solution and implement an helm chart easily or perhaps use the Kubernetes operator once it's a more mature project.
Using OpenWeb-UI can help you get a seamless experience as if you are using ChatGPT. Personally I like the features of OpenWeb-UI better, due to it's flexibility and the fact that it's fully open source.
Now that you have the fundamentals how to deploy the LLMs and interact with them with a UI, you can easily add Home-assistant, which is also fully open-source into the mix and communicate with those local LLMs using your voice, which will allow you to communicate with your home smart devices.
Running an LLM fully locally can be cost effective, not as expensive as you might think.
All the manifests and a quick guide are in here.
If you'd like me to cover deploying and managing Home-assistant, how to deploy this on Google Kubernetes Engine, let me know.
Let me know what you think or if you have any questions, by dropping a comment down below.