Blog

What OpenELM language models say about Apple’s generative AI strategy

April 29, 2024

apple language model — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Apple is best known for its walled-garden approach to its software and hardware. However, the company has recently been sharing information and code about its machine learning models.

Its latest release, OpenELM, is a family of small language models (SLM) designed to run on memory-constrained devices. Apple has yet to reveal its generative AI strategy, but everything hints at it trying to dominate the yet-to-flourish on-device AI market. And the potential could be big enough for Apple to shed its usual culture of secrecy.

While Apple is not the only company working on SLMs, it has several factors that can work to its advantage.

What is OpenELM?

OpenELM is a family of language models pre-trained and fine-tuned on publicly available datasets. OpenELM comes in four sizes, ranging from 270 million to 3 billion parameters, small enough to easily run on laptops and phones. Their experiments on various benchmarks show that OpenELM models outperform other SLMs of similar size by a fair margin.

The main feature of OpenELM is its resource efficiency. The basic idea of OpenELM is that given a limited amount of resources (e.g., memory and compute), how can you get the best-performing model?

OpenELM uses a series of tried and tested techniques to improve the performance and efficiency of the models. Some of these techniques include the removal of learnable bias parameters in the feed-forward layers of the transformer block; better normalization and positional encoding techniques to improve the attention mechanism; grouped query attention (GQA) to make the attention mechanism more compute-efficient; and Flash Attention to make the model more memory-efficient.

However, one standout feature of OpenELM is its non-uniform structure. Transformer models are designed to have the same configuration across layers and blocks. While this makes the architecture much more manageable, it results in the models not allocating parameters efficiently. Unlike these models, each transformer layer in OpenELM has a different configuration, such as the number of attention heads and the dimensions of the feed-forward network. This makes the architecture more complicated but enables OpenELM to better use the available parameter budget for higher accuracy.

They implement this non-uniform allocation using “layer-wise scaling,” adjusting the parameters based on how close they are to the input and output layers of the model. This method uses smaller latent dimensions in the attention and feed-forward modules of the layers closer to the input, and gradually widens the layers as they approach the output.

Apple’s on-device AI strategy

But perhaps more surprising is the open release of OpenELM. While there is much debate about what is and isn’t open source, Apple has gone out of its way to make everything public, including the model weights, training logs, multiple training checkpoints, and pre-training configurations of OpenELM. They have also released two series of models, including plain pre-trained OpenELM models as well as instruction fine-tuned versions.

Apple has also released the code for converting the models to MLX, a programming library for mass parallel computations designed for Apple chips. The assets are released under Apple’s license, which states no limitation in using them in commercial applications.

“This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors,” the researchers write.

While this is a break from Apple’s secretive culture, it makes sense from a business standpoint. In less than two years the generative AI market has undergone major changes. One way to corner the market is with huge private models, as OpenAI has done.

But in the past year, open models have made impressive advances. Running them costs a fraction of private models and their performance is quickly catching up. But more importantly, open models are making it possible for the research community to repurpose them for new applications and environments. For example, in the few days since its release, Meta’s Llama 3 has been forked, fine-tuned, and modified in thousands of ways.

And the market for on-device language models is showing great promise as researchers are finding ways to make SLMs accomplish complicated tasks. Among the authors of OpenELM is Mohammad Rastegari, a scientist who also worked on Apple’s “LLM in a flash” paper, which introduces a technique that reduces the memory consumption of language models on low-memory devices such as phones and laptops. (Rastegari recently moved to Meta.)

While Apple doesn’t have the advantages of a hyperscaler like Microsoft or Google, it certainly has the advantage when it comes to on-device inference. Apple has full control over the software and hardware of its devices. Therefore, it can optimize its models for its processors, and it can optimize the next generation of its processors for its models. This is why every model Apple releases also includes a version optimized for Apple silicone. At the same time, opening the models will stimulate activity among researchers who are interested in creating applications for billions of Apple devices on users’ desks and in their pockets. This can create the network effect that gives Apple devices an edge in on-device AI, attracting more developers to create SLM applications for the Apple ecosystem and enabling Apple to better understand how to optimize the next generation of its hardware and software.

But Apple will also be facing competition from other companies, including Microsoft, which is betting big on small language models and is creating an ecosystem of AI Copilots that run seamlessly on device and in the cloud. It remains to be seen who will be the ultimate winner of the generative AI market and whether there will be parallel markets with many dominant companies.

How far can you trust chain-of-thought prompting?

Train your LLMs to choose between RAG and internal memory automatically