LLM

llamafile : An Executable LLM

LLM Deployment with llamafile: Discover how Mozilla’s llamafile simplifies LLM deployment into a single executable. Learn to optimize Docker images, explore alternatives like llama.cpp and ollama, and leverage quantized models for efficient resource use.

DALL.E 3 Generated Image

LLMs have always been powerful tools, but they can be a real headache when it comes to setup, use, and especially deployment as a service. That’s where llamafile comes in. Llamafile, an open-source initiative by Mozilla, simplifies the complexity of an LLM chatbot stack into a single file of C/C++ code. If you don't believe me, here are the instructions to run a fully functional Llama-3.1 8B model on your local machine:

wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-llamafile/resolve/main/Meta-Llama-3.1-8B.Q6_K.llamafile

chmod +x Meta-Llama-3.1-8B.Q6_K.llamafile

./Meta-Llama-3.1-8B.Q6_K.llamafile

https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-llamafile

It's really that easy. LLMs can be a pain to setup and use, let alone deploy as a service, but llamafile provides a lightweight executable file to integrate state-of-the-art LLM capabilities into just about any service. It's quite versatile and can be customized to suit just about any use case you can imagine.

Use Case

In my current project, I’m focused on creating curated datasets from the Common Crawl Web Archives. To handle the massive amount of data, I’m using an Apache Spark cluster setup with JupyterLab for tasks like filtering and script extraction. This entire setup is containerized using Docker, with different services managed through Docker Compose.

Having an LLM service is a crucial part of this setup. It allows me to enrich my data and perform a wide range of NLP-specific tasks such as Text Classification and Named Entity Recognition. However, running the Meta-Llama-3.1 8B Instruct llamafile on my system and integrating it with this Docker setup turned out to be a bit more challenging than I expected.

In a Docker Compose setup, containers communicate with each other through an internal network managed by Docker. Services running outside this network, like llamafile running directly on my machine, can only be accessed via methods like port forwarding. This can get tedious and introduces additional complexity to the setup.

To solve this, I decided to create a Docker image for the llamafile service. This approach ensures that all services within my Docker Compose setup can communicate seamlessly without the need for cumbersome network configurations. By containerizing the llamafile service, I’ve managed to keep the entire cluster’s services tightly integrated, making the whole system more efficient and easier to manage.

Initial Docker Build

Here's the code for my initial llamafile Docker build:

FROM debian:bookworm-slim

WORKDIR /llama_app

RUN apt-get update && \
        apt-get install -y --no-install-recommends \
        curl \
        wget \
        vim \
        ssh \
        ca-certificates \
        && rm -rf /var/lib/apt/lists/*

RUN wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile

EXPOSE 9090

RUN mv ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile llama8b.llamafile
RUN chmod +x llama8b.llamafile

CMD ["/bin/bash", "./llama8b.llamafile", "--server", "--nobrowser", "--host", "0.0.0.0", "--port", "9090"]

Initial Dockerfile

The llamafile I used was around 6 GB, and since I based my Docker image on a slim Debian base, I expected the final image to be about 7 GB. Shockingly, the final Docker image ballooned to a massive 20 GB!

On further snooping, I whittled down the cause of this excessive bloat to these factors:

Layered Filesystem
In Docker, each command in the Dockerfile (like RUN, COPY, ADD) creates a new layer in the image. If you run multiple commands that download, install, or modify files, each change adds to the final Docker image size.
Package Installation
Installing packages like wget, ca-certificates, vim, and ssh, you're adding a lot of files that are not required for the final application. Even if you remove these packages afterward, the layers containing those installations still exist in the Docker image's history unless handled with care.
wget Usage
When using wget to download a large file, the file itself contributes to the image size. If you download the file in one layer and then later move or modify it in another, the original file might still be retained in a previous layer, adding unnecessary size to the final image.
Redundant Layers
If you perform operations like moving or copying files in separate RUN commands, these actions create additional layers.

Optimized Docker Build

Here's the code for my optimized llamafile Docker build:

FROM debian:bookworm-slim AS builder

WORKDIR /llama_app

RUN apt-get update && \
        apt-get install -y --no-install-recommends \
        wget \
        vim \
        ca-certificates && \
        wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile -O llama8b.llamafile && \
        chmod +x llama8b.llamafile && \
        apt-get purge -y --auto-remove wget && \
        apt-get autoremove -y && apt-get clean && rm -rf /var/lib/apt/lists/* /usr/share/doc/* /usr/share/man/*

FROM debian:bookworm-slim

WORKDIR /llama_app

COPY --from=builder /llama_app/llama8b.llamafile .

EXPOSE 9090

CMD ["/bin/bash", "./llama8b.llamafile", "--server", "--nobrowser", "--host", "0.0.0.0", "--port", "9090", "--parallel", "2"]

Optimized Dockerfile

After optimizing the Dockerfile and addressing the factors mentioned earlier, I managed to reduce the final Docker image size to 6.91 GB—an impressive 66% reduction! Here’s how I achieved this:

Multi-stage Builds
The Dockerfile uses a multi-stage build strategy to separate the build environment from the runtime environment. In the builder stage, necessary packages such as wget and ca-certificates are used to download the llamafile. The file is renamed and given executable access as well. By doing this in a dedicated build stage and copying only the necessary files to the final image, the Dockerfile avoids including additional packages and other build-related dependencies in the final image. This allows the second stage to focus solely on running the acquired llamafile, and reduces overall image bloat.
Clean-Up & Removal of Build Dependencies
In the builder stage, tools such as wget are initially installed to download and prepare the llamafile. After the llamafile is downloaded, these tools are removed with apt-get purge and apt-get autoremove. Further, the Dockerfile runs apt-get clean to remove cached package files, and rm -rf /var/lib/apt/lists/* /usr/share/doc/* /usr/share/man/* to delete APT lists and documentation files.
Efficient Layer Management
Combining commands into fewer layers reduces the overall size of the image by eliminating the need for Docker to maintain multiple layers with overlapping content. The optimized Dockerfile downloads the llamafile, renames it and sets its execution permissions all within a single RUN command. Therefore, Docker processes this as a single layer, avoiding the extra overhead of managing separate layers for each step.

Conclusion

Llamafile simplifies Large Language Model (LLM) deployment by turning complex setups into a single executable file. Its ease of integration and reduced complexity make it great for various tasks, from data enrichment to advanced NLP.

But llamafile isn’t alone. Alternatives like llama.cpp and ollama each bring unique strengths. llama.cpp is easy to use, while ollama focuses on efficient execution. Plus, quantized models are making it easier to run LLMs on different hardware by cutting down on computational needs.

In short, llamafile offers a streamlined approach to deploying LLMs, but exploring other solutions and quantized models can provide even more flexibility and efficiency.