This article is part of our coverage of the latest in AI research.
Large language models (LLM) such as ChatGPT and GPT-4 are very convenient. You can get them to do impressive things with a few API calls. Every API call has a marginal cost and you can put together proofs of concepts and working examples in short time.
However, when using LLMs for real applications that send thousands of API calls per day, the costs can quickly pile up. You can end up paying thousands of dollars per month to accomplish tasks that would otherwise require a fraction of the money.
A recent study by researchers at Stanford University shows that you can considerably reduce the costs of using GPT-4, ChatGPT, and other LLM APIs. In a paper titled “FrugalGPT,” they introduce several techniques to cut the costs of LLM APIs by up to 98 percent while preserving or even improving their performance.
Which language model API should you use?
GPT-4 is arguably the most capable large language model. But it is also the most expensive. And the costs only grow as your prompt becomes longer. In many cases, you can find another language model, API provider, or even prompt that can reduce the costs of inference. For example, OpenAI provides a wide range of models, whose costs range from $0.0004 to $0.12 per 1,000 tokens, a 300x difference. Moreover, you can try other providers such as AI21 Labs, Cohere, and Textsynth for other pricing options.
Fortunately, most API services have similar interfaces. With a bit of effort, you can create a layer of abstraction that can be applied to different APIs seamlessly. In fact, Python libraries such as LangChain have already done most of the work for you. However, without a systematic approach to select the most efficient LLM for each task, you’ll have to choose between quality and costs.
In their paper, the researchers from Stanford University propose an approach that keeps LLM API costs within a budget constraint. To achieve this, they propose three strategies: prompt adaptation, LLM approximation, and LLM cascade. While these techniques have not been applied in production settings, initial tests show promising results.
All LLM APIs have a pricing model that is a function of the prompt length. Therefore, the simplest way to reduce the costs of API usage is to shorten your prompts. There are several ways to do so.
For many tasks, LLMs require few-shot prompting. This means that to improve the model’s performance, you must prepend your prompt with a few examples, usually in the prompt->answer format. Frameworks like LangChain provide tools that enable you to create templates that include few-shot examples.
With LLMs supporting longer and longer contexts, developers sometimes tend to create very large few-shot templates to improve the model’s accuracy. However, the model might not need so many examples.
The researchers propose “prompt selection,” where you reduce the number of few-shot examples to a minimum amount that preserves the output quality. Even if you can shave off 100 tokens from the template, it can result in huge savings when used many times.
Another technique they propose is “query concatenation,” where you bundle several prompts into one and have the model generate multiple outputs in one call. Again, this is especially effective when using few-shot prompting. If you send your questions one at a time, you’ll have to include the few-shot examples with every prompt. But if you concatenate your prompts, you’ll only need to send the context once and get several answers in the output.
One tip I would add is optimizing context documents. For some applications, the vanilla LLM will not have the knowledge to provide the right answers to user queries. One popular method to address this gap is retrieval augmentation. Here, you have a set of documents (PDF files, documentation pages, etc.) that contain the knowledge for your application. When the user sends a prompt, you find the most relevant document and prepend it to the prompt as context before sending it to the LLM. This way, you condition the model to answer the user based on the knowledge in the document.
This is a very effective method to address the hallucination problem of ChatGPT and customize it for your own applications. But it can also increase the size of the prompts. You can reduce the costs of retrieval augmentation by experimenting with smaller chunks of context.
Another solution to lower costs is to reduce the number of API calls made to the LLM. The researchers propose to approximate costly LLMs “using more affordable models or infrastructure.”
One method for approximating LLMs is “completion cache,” in which you store the prompts and responses of the LLM in an intermediate server. If a user submits a prompt that is identical or similar to a previously cached prompt, you retrieve the cached response instead of querying the model again. While implementing completion cache is easy, it has some severe tradeoffs. First, it reduces the creativity and variations of the LLM’s response. Second, its applicability will depend on how similar users’ queries are. Third, the cache might become very large if the stored prompts and responses are very diverse. Finally, if the LLM’s output depends on user context, then caching responses will not be very efficient.
The Stanford researchers propose “model fine-tuning” as another approximation method. In this case, you gather a collection of prompt-response pairs from a powerful and expensive LLM such as ChatGPT or GPT-4. You then use these responses to fine-tune a smaller and more affordable model, possibly an open-source LLM that is run on your own servers. Alternatively, you can fine-tune a more affordable online model (e.g., GPT-3 Ada or Babbage) with the collected data.
This approach, sometimes referred to as “model imitation,” is a viable method to approximate the capabilities of the larger model, but also has limits. Notably, small LLMs trained on model imitation have been observed to mimic the style of the larger model without acquiring its knowledge. Therefore, the model’s accuracy drops.
A more complex solution is to create a system that selects the best API for each prompt. Instead of sending everything to GPT-4, the system can be optimized to choose the cheapest LLM that can respond to the user’s prompt. This can result in both cost reduction and performance improvement.
The researchers propose a method called “LLM cascade” that works as follows: The application keeps track of a list of LLM APIs that range from simple/cheap to complex/expensive. When the app receives a new prompt, it starts by sending it to the simplest model. If the response is reliable, it stops and returns it to the user. If not, it continues down the cascade and queries the next language model. If you get reliable responses early in the pipeline, you’ll reduce the costs of your application considerably.
There are a few catches, however. First, if your application is too complicated for the smaller models, you’ll add unnecessary overhead that will increase costs and reduce performance.
The other challenge is creating a system that can determine the quality and reliability of the output of an LLM. The researchers suggest training a regression model that determines whether a generation is correct from the query and generated answer. This adds additional complexity and requires an upfront effort from the development team to test each of the LLM APIs on a range of prompts that represent the kind of queries their application receives. It also remains to be seen how practical it is in real production environments.
Another solution is to combine different strategies to create a more efficient (albeit more complex) LLM cascade. For example, the researchers propose “joint prompt and LLM selection” to select the smallest prompt and most affordable LLM that can achieve satisfactory task performance.
The researchers implemented the LLM cascade strategy with FrugalGPT, a system that uses 12 different APIs from OpenAI, Cohere, AI21 Labs, Textsynth, and ForeFrontAI.
They tested FrugalGPT with several natural language benchmarks. Their initial results show that they were able to reduce the costs by orders of magnitude while sometimes improving the performance.
The researchers write, “FrugalGPT enables smooth performance-cost trade-offs across all evaluated datasets. This offers flexible choices to LLM users and potentially helps LLM API providers save energy and reduce carbon emissions. In fact, FrugalGPT can simultaneously reduce costs and improve accuracy.”
It is worth noting that benchmark tests are not necessarily accurate indicators of how a model will perform in real-world applications. The researchers also note that the approach has some limitations, including the need for labeled data and compute resources to train FrugalGPT’s response evaluator. “We view this as an [sic] one-time upfront cost; this is beneficial when the final query dataset is larger than the data used to train the cascade,” the researchers write.
But it provides interesting directions to explore in LLM applications. While this work focuses on costs, similar approaches can be used for other concerns, such as risk criticality, latency, and privacy. “The continuous evolution of LLMs and their applications will inevitably unveil new challenges and opportunities, fostering further research and development in this dynamic field,” the researchers write.