How to launch an LLM API server with minimal coding

llama on your own server
Background image: 123RF Foreground image: generated with

The landscape of open-source language models has been rapidly evolving in recent months. The release of LLaMA by Meta, followed by a slew of other models, has sparked a surge of interest among organizations to run their own large language models (LLM). 

Having your own LLM, such as LLaMA 2, offers several advantages. It provides greater control over the model’s usage, ensures privacy of data, allows customization to suit specific needs, and facilitates seamless integration with existing systems. However, the journey to running your own LLM has many challenges. It involves setting up the necessary IT infrastructure, selecting the appropriate LLM, running and potentially fine-tuning the model with your own data, scaling it to meet demand, and navigating through licensing issues.

Fortunately, there are numerous solutions available that can simplify the process of getting started with open-source LLMs. In this article, I will take you through two solutions that can set up an open-source LLM API server without writing code.

Why a web API server?

LLM server
How a web application and an LLM server can interact

One of the inherent challenges with machine learning projects is the long development cycle. When you’re looking to add LLM capabilities to an existing product, it’s crucial to minimize the time it takes to develop and deploy the model. 

The open-source LLM is especially challenging because models are written in a variety of programming languages and frameworks. This diversity can pose integration challenges when trying to incorporate these models into existing applications. 

One solution to minimize development and integration time is to set up a web API. If you’re already using a commercial LLM API like GPT-4 or Cohere, setting up an API for your open-source LLM allows you to test and compare the models with minimal changes to your existing code. This approach provides the flexibility to easily switch between your open-source and commercial APIs.

Beyond integration, a web API can also address other challenges. For instance, your web application might be running serverless or on a virtual machine without specialized hardware for running LLMs.

In this case, an LLM API allows you to decouple different parts of your application and run them separately. For example, you can deploy the LLM on a separate virtual machine equipped with an A100 GPU and make it accessible to the web server through the API endpoint. This decoupling means that as your model evolves, you can modify the underlying server and scale it without needing to make any changes to the web server.

However, it’s important to note that an API deployment might not be the ultimate solution for your application. As you experiment and progress through development, you might discover a more efficient way to integrate the LLM into your application. As you move from prototype to production, remain flexible and evolve your approach as you gain more insights into the capabilities and potential of your chosen LLM.

Which language model should you use?

closed vs open source language models

The diversity of the open-source LLM landscape also makes it hard to select the right model for your application. A good starting point is LLaMA. Its popularity makes it a compelling choice, as many libraries and projects support it. Furthermore, numerous models have been built on top of LLaMA and LLaMA 2, which means if you build something that works with one of them, there’s a high probability that adapting your project to those other models will be straightforward.

book recommendation transformers for natural language processing
Transformers for Natural Language Processing is an excellent introduction to the technology underlying LLMs

LLaMA 2, in particular, stands out for its impressive benchmarks among open-source models. If you’re aiming to be as close as possible to the state-of-the-art API LLMs, LLaMA 2 is likely your best bet. It also comes in different sizes, ranging from 7 billion to 70 billion parameters. This scalability allows you to adjust the model’s complexity without making major changes to your application (although you’ll need to review your hardware settings to accommodate the model’s size).

Another advantage of LLaMA 2 is its permissive license that allows commercial use. This essentially means you can use it for virtually any application. There are some caveats in the license regarding the number of users, but these restrictions apply to very few products.

However, once you’ve conducted your tests with a model of choice, you might want to explore alternate LLM families. The MPT and the Cerebras-GPT family of models are also highly performant and could offer unique advantages for your specific use case.

With the choice of LLM out of the way, the next step is to find a framework that can serve your LLM as an API endpoint with minimal or no coding. In the following sections, we’ll explore two frameworks that can help you get your LLM API server up and running quickly and efficiently.

Launching an API server with vLLM


vLLM is a powerful Python library that provides quick and easy access to a wide array of models. Developed by researchers at UC Berkeley, vLLM supports not only LLaMA and LLaMA 2, but also other state-of-the-art open-source language models such as MPT, OPT, Falcon, Dolly, and BLOOM.

gpt-3 sandra kublik book recommendation
GPT-3 by Sandra Kublik is an excellent book to learn about creating applications with LLM APIs

One of the most compelling features of vLLM is its speed. In some experiments, it has demonstrated a throughput that is 14-24x higher than Hugging Face Transformers. This speed, combined with its ready-made API endpoint feature, makes vLLM a highly efficient tool for launching a language model web API server.

To run vLLM, you’ll need a Linux machine equipped with Python 3.8 or higher, CUDA 11.0–11.8, and a suitable GPU. Alternatively, you can use an NVIDIA PyTorch Docker image that comes with all the necessary preinstalled packages and libraries. (If you opt for the Docker image, you must uninstall Pytorch before using vLLM.)

Installing vLLM is easy with a simple command:

pip install vllm

The installation may take a few minutes, depending on your internet connection. Once installed, launching a LLaMA 2 API endpoint is as easy as running the following command:

python -m vllm.entrypoints.api_server --env MODEL_NAME=openlm-research/open_llama_13b

This command starts the server on http://localhost:8000. You can specify the address and port with –host and –port arguments, and you can replace the MODEL_NAME argument with the name of your own model.

To check the server, you can run a curl command:

curl http://localhost:8000/generate \
    -d '{
        "prompt": "The most famous book of J.R.R. Tolkien is",
        "temperature": 0,
        "max_tokens": 50

If everything is working correctly, you should receive the model’s output from the server. 

One of the key benefits of vLLM is its compatibility with the OpenAI API. If your application is already working with an OpenAI model, you can run a vLLM server that imitates the OpenAI API:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-13b-hf

After that, you only need to change your code to point the API calls to your server instead of the OpenAI API:

openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

If you used –host and –port arguments to change the default settings, change the api_base variable accordingly.

vLLM also supports parallel GPU inference and distribution across multiple servers. If your server has multiple GPUs, you can easily adjust vLLM to leverage them by using the tensor-parallel-size argument: 

python -m vllm.entrypoints.api_server \
    --env MODEL_NAME=openlm-research/open_llama_13b \
    --tensor-parallel-size 4
generative ai with langchain book recommendation
Generative AI with LangChain is a great intro to programming LLMs with one of the most popular libraries.

This feature is particularly useful for applications that require high computational power.

Another notable feature of vLLM is token streaming. This means that the model can return output tokes as they are being generated, rather than waiting for the entire sequence to be complete. This feature is especially useful if your application generates long responses and you don’t want to keep the user waiting for long.

But vLLM is not without its limitations. It does not support LoRA and QLoRA adapters, which are popular techniques for fine-tuning open-source LLMs without modifying the original model weights. vLLM also does not support quantization, which is a technique used to make LLMs compact enough to fit on smaller GPUs. Despite these limitations, vLLM remains a highly convenient tool for quickly testing an LLM.

For more information about vLLM, you can visit its GitHub page

Creating an OpenLLM server

OpenLLM (source: OpenLLM GitHub)

OpenLLM is another widely used platform for creating web servers for language models. It’s known for its simplicity and versatility, making it a popular choice among developers and researchers alike.

To install OpenLLM, ensure your machine meets the necessary prerequisites. Once confirmed, the installation process is straightforward. Simply run the following command:

pip install openllm

After successful installation, you can launch the server directly from the command line. Here’s an example of how to do it:

openllm start llama --model-id openlm-research/open_llama_7b_v2 \
  --max-new-tokens 200 \
  --temperature 0.95 \

One of the standout features of OpenLLM is its support for adapters such as LoRA. This feature allows you to combine a full LLM and several lightweight adapters, enabling you to run multiple applications on a single model.

OpenLLM also integrates seamlessly with popular libraries like LangChain. This integration simplifies the process of writing applications or porting code to new LLMs, saving developers valuable time and effort.

However, OpenLLM is not without its drawbacks. Unlike vLLM, it doesn’t support batched inference, which can become a bottleneck in high-usage applications. Additionally, OpenLLM lacks built-in support for distributed inference across multiple GPUs. Despite these limitations, OpenLLM remains a robust and user-friendly platform for deploying language model servers. And it is especially useful for prototyping LLM applications, especially if you want to play around with LLM fine-tuning.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.