Blog

What to expect of Microsoft’s “cheaper generative AI” efforts

January 31, 2024

llm race — Image generated with Bing Image Creator

This article is part of our series that explores the business of artificial intelligence

Microsoft has formed a new team to develop “cheaper generative AI” systems, according to a recent report by The Information. This happens while Microsoft is deeply invested in OpenAI, which sells access to expensive large language models (LLM).

Microsoft is already taking major steps to diversify its AI portfolio and reduce its dependence on OpenAI. CEO Satya Nadella told Bloomberg at Davos, “Our products are not about one model. We care about having the best frontier model, which happens to be GPT-4 today. But we also have Mixtral in Azure as a model as a service. We use Llama in places. We have Phi, which is the best SLM from Microsoft. So there is going to be diversity in capability and models that we will have, that we will invest in.”

This is a smart move because all signs point to the market for LLMs becoming commoditized. There is no guarantee that OpenAI will continue to be the dominant player in the field. And with advances in open-source and customized models, the market for LLMs is growing in different directions. Private models such as GPT-4 and Claude might become a niche market as more enterprises start exploring open LLMs. And there is growing interest in creating small language models (SLM) that run on phones and personal computers.

But what will Microsoft’s cheap generative AI team do? We’re already seeing some hints.

More efficient open-source models

There are now several open LLMs that compete with GPT-3.5 at a fraction of the size and costs. And it is only a matter of time before open models surpass GPT-4. And with all the tools for fine-tuning and compressing open LLMs, they can become much more useful than proprietary models in enterprise settings.

Microsoft is already investing in the open LLM ecosystem. In addition to creating its own models, Microsoft also supports models from Meta and Hugging Face on its cloud platform. Companies can build Copilot products on top of Llama, Mistral, and other advanced open LLMs.

Given Microsoft’s financial and computational resources, its new team will probably add to the open LLM catalog. It might also work on adding new enterprise tools such as S-LoRA, which allows you to add multiple fine-tuned adapters on top of a base LLM, which cuts the costs of deployment by orders of magnitude.

Small language models

Microsoft is also looking into small language models (SLM) that can run on low-memory edge devices. Phi-2, which was released in December, has 2.7 billion parameters, enough to fit on many edge devices. It has impressive performance and can become an important part of Microsoft’s Copilot ecosystem, running behind the scenes for some of the tasks that require on-device inference.

In SLMs, Microsoft is competing with other companies such as Stability AI, which recently released Stable LM 2 1.6B, which is smaller than Phi-2 but has high performance on key LLM benchmarks. Alibaba has released Qwen-1.8B. And other research labs are close behind, finding ways to squeeze more out of small language models.

While cheap to run, SLMs are still expensive to build. Phi-2 was trained on 96 Nvidia A100 GPUs with 80 gigabytes of memory for 14 days, which is more than most organizations can afford. This is why for the moment, SLMs will remain the domain of wealthy tech companies that can run expensive experiments, especially since there is no direct path to profitability on such models yet.

More efficient inference

There are limits to how much you can shrink a language model without rendering it useless. The smallest language models still require gigabytes of memory and can run slowly on consumer devices. This is why another important direction of research is finding ways to run generative models more efficiently.

Microsoft is also making efforts in this area. A recent paper by researchers at Microsoft and ETH Zurich introduces a method that reduces the size of models after training. The technique, called SliceGPT, takes advantage of sparse representations in LLMs to compress the parameters in dense matrices.

This allows them to reduce up to 25% of parameters from models such as Llama 2 70B, OPT 66B, and Phi 2, without causing a significant reduction in their performance.

These kinds of efforts can have an important effect in reducing the costs of running LLMs. In particular, ETH Zurich has been leading impressive efforts in this field. In a previous paper, they introduced a new transformer architecture that removes up to 16% of the parameters from LLMs. And another paper from the university’s researchers presents a technique that can speed up LLM inference by up to 300%. I expect closer collaboration between Microsoft’s GenAI team and ETH Zurich researchers in the future.

Apple vs Microsoft

The other tech giant that Microsoft will be up against in the battle for efficiency is Apple. While Apple has not been making much noise, it has been publishing interesting research, including Ferret, a 7-13B parameter multimodal LLM silently released in October. But the battle over cheap generative AI dominance will go beyond releasing new model architectures.

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

What to expect of Microsoft’s “cheaper generative AI” efforts

More efficient open-source models

Small language models

More efficient inference

Apple vs Microsoft

Like this:

More efficient open-source models

Small language models

More efficient inference

Apple vs Microsoft

Subscribe to continue reading

Like this: