Amazon has officially entered the competitive arena of commercial large language models (LLM) with the announcement of its next-generation Alexa assistant. Like its major tech counterparts, Amazon is banking on the burgeoning market for generative AI, confident that its unique approach to LLMs will set it apart.
With an abundance of data, huge computing resources, and a substantial budget at its disposal, Amazon is well-equipped to train a high-quality LLM to power Alexa. The company has not yet released a paper on its LLM (or a pseudo-paper aka “technical report” as other big tech companies have been releasing lately).
Nevertheless, a blog post by Alexa’s Vice President, Daniel Rausch, and a product demo video offer a glimpse of behind the curtain. In this article, I will delve into these hints and speculate on the potential capabilities and features of Amazon’s Alexa LLM.
It will be multimodal
According to the Amazon blog, the Alexa LLM has been “custom-built and specifically optimized for voice interactions.”
However, the blog also suggests that Alexa’s LLM will not be limited to voice alone. It will be a multimodal LLM, capable of processing voice, video, and text embeddings. Multimodal language models can provide a richer experience because they can capture nuances in conversations that are not included in text alone.
The blog post states, “In any conversation, we process tons of additional information, such as body language, knowledge of the person you’re talking with, and eye contact. To enable that with Alexa, we fused the input from the sensors in an Echo—the camera, voice input, its ability to detect presence—with AI models that can understand those non-verbal cues.”
Achieving this multimodal functionality could be done in several ways. One approach could be a modular architecture that processes voice, text, and video separately and creates a text embedding for the model. Alternatively, an end-to-end model could be used that simultaneously processes all modalities, creating a unified embedding for the LLM.
Interestingly, the demo exclusively used Amazon Echo Show. I have previously discussed the limitations of voice-only interfaces, and it appears Amazon also recognizes these constraints. The company seems to be moving towards multimodal outputs, where voice and graphical user interfaces (VUI and GUI) complement each other. This approach provides a more robust user experience and opens the door to more complex use cases involving multiple steps and options. It can also enable Amazon to experiment with different business models.
It will be extendible
The demo video for Alexa’s LLM primarily showcases users asking for information or requesting the composition of poems. However, the Amazon blog suggests that the LLM’s capabilities extend beyond these basic tasks. It will be able to perform “the things we know our customers love—getting real-time information, efficient smart home control, and maximizing their home entertainment.”
Amazon reveals that the Alexa LLM is connected to thousands of APIs and can execute complex sequences of tasks, so I guess it aims to venture beyond your home. The blog post illustrates this with an example: “For example, the LLM gives you the ability to program complex Routines entirely by voice—customers can just say, ‘Alexa, every weeknight at 9 p.m., make an announcement that it’s bed time for the kids, dim the lights upstairs, turn on the porch light, and switch on the fan in the bedroom.’ Alexa will then automatically program that series of actions to take place every night at 9 p.m.”
This feature is akin to the ChatGPT plugins or the external service calling technique proposed by the Toolformer paper. It will be intriguing to see how Amazon integrates existing Alexa skills into this new framework or whether it provides developers with tools to create their own API endpoints for the LLM to call.
However, it’s important to note that external API calls are not a fully solved problem and can potentially go awry. This is where the multimodal interface can increase robustness by enabling users to intervene or provide final confirmation before the model executes sensitive actions such as placing orders or making purchases.
It will adapt to the user
Amazon has placed a strong emphasis on customization with the Alexa LLM to enhance the user experience. For instance, customers can opt to use Visual ID, which allows them to initiate a conversation with Alexa simply by facing the screen on an Echo Show, eliminating the need for a wake word. While this is an on-device feature and not directly linked to the LLM, it is an example of putting the right pieces together to streamline the user experience.
More crucially, Amazon has stated that the Alexa LLM will tailor its responses based on user preferences. According to the blog, “The next generation of Alexa will be able to deliver unique experiences based on the preferences you’ve shared, the services you’ve interacted with, and information about your environment.”
Given the prohibitive costs of training and serving a unique LLM for each user, it’s likely that the Alexa LLM will employ some form of retrieval augmentation. This means that each interaction with the LLM will be enhanced by context from your past experiences and the current conversation. However, incorporating user context can become complex as more data accumulates, necessitating a mechanism to filter the data and use only the parts relevant to the current discussion. A recent paper by Stanford researchers proposes an intriguing approach to this filtering process. The model could construct a hierarchy of memories and preferences, retrieving them based on a ranking algorithm that assesses their relevance.
How Amazon will convert past user data into a format compatible with the new model remains unclear. One possibility is that they built the model on top of their legacy user data format. This approach could impose some limitations on the model, as the old data format wasn’t designed with LLMs in mind. Alternatively, Amazon may have developed an entirely new user data infrastructure for the new model. While this would offer greater flexibility, it would likely entail significant costs. Without concrete information, we can only speculate based on the user experience.
It will use in-context learning
One of the key features of Amazon’s Alexa LLM is its ability to build context within each conversation. The blog post states, “Ask Alexa a question about a museum, and you’ll be able to ask a series of follow-ups about its hours, exhibits, and location without needing to restate any of the prior context, like the name or the day you plan to go.”
This suggests that Alexa’s LLM will be using a form of in-context learning to maintain a continuous thread of conversation and use prior interactions to reduce the need for repeating past information.
In essence, each conversation with the Alexa LLM will be a long, evolving prompt, incorporating both user input and model responses. However, Amazon has not disclosed how long these conversations can be or the number of tokens that the Alexa LLM can support. Given that voice interactions with Alexa are unlikely to be as lengthy as text-based sessions with GPT-4, where users may write full articles or papers, it’s reasonable to assume a manageable conversation length.
Another feature is the ability to interrupt the assistant and modify your request. This is similar to clicking the “Stop Generation” button and adding a new prompt in LLM interfaces. It will be interesting to see how this feature performs in real-world environments, where multiple people may be speaking simultaneously, and the LLM will need to discern which voice stream to follow. But if done right, it could significantly enhance the responsiveness of the Alexa LLM.
It will be biased
The issue of bias in large language models has sparked considerable controversy, with concerns that these models may reflect and perpetuate certain political or cultural views at the expense of others. In response, many LLM service providers have adopted conservative strategies, designing their models to avoid expressing opinions.
However, Amazon appears to be charting a different course. According to the Amazon blog, “the most boring dinner party is one where nobody has an opinion—and, with this new LLM, Alexa will have a point of view, making conversations more engaging.” Among the examples provided, Alexa might offer an opinion on which movie should have won the Oscars.
While Amazon will undoubtedly implement safeguards to prevent the LLM from expressing views on sensitive topics, this approach is not without risks. By enabling Alexa to form and express opinions, Amazon is treading a fine line, and any missteps could potentially lead to backlash. It will be interesting to see how this strategy plays out in practice.
It will probably not be free
While Amazon has not yet announced any pricing plans, it’s worth noting that Alexa was not making money prior to this update. Moreover, operating a model that can compete with GPT-4 would entail significant costs. The Amazon blog states that the Alexa LLM will be offered as a “free preview” to U.S. customers in the near future. However, the long-term business model remains unclear, and I find it unlikely that Amazon will continue to provide this service free of charge.
Despite its impressive capabilities, the Alexa LLM seems limited compared to other multimodal LLMs that support a variety of file types and prompt engineering techniques. Nevertheless, Amazon appears confident in its approach, even taking a playful swipe at ChatGPT and other LLMs in the blog: “To our knowledge this is the largest integration of an LLM, real time services, and a suite of devices—and it’s not limited to a tab in a browser.”
While such claims may be premature, given the upcoming experiences like Microsoft Copilot, Google’s integration of Bard into its services, and Apple’s similar efforts, Amazon’s entry into the commercial LLM space is undoubtedly significant. It will be fascinating to observe how Amazon’s presence shapes the market and influences the development and deployment of LLMs in the future.