This article is part of our coverage of the latest in AI research.
Artificial intelligence research lab OpenAI made headlines again, this time with DALL-E 2, a machine learning model that can generate stunning images from text descriptions. DALL-E 2 builds on the success of its predecessor DALL-E and improves the quality and resolution of the output images thanks to advanced deep learning techniques.
The announcement of DALL-E 2 was accompanied by a social media campaign by OpenAI’s engineers and its CEO, Sam Altman, who shared wonderful photos created by the generative machine learning model on Twitter.
DALL-E 2 shows how far the AI research community has come toward harnessing the power of deep learning and addressing some of its limits. It also provides an outlook of how generative deep learning models might finally unlock new creative applications for everyone to use. At the same time, it reminds us of some of the obstacles that remain in AI research and disputes that need to be settled.
The beauty of DALL-E 2
Like other milestone OpenAI announcements, DALL-E 2 comes with a detailed paper and an interactive blog post that shows how the machine learning model works. There’s also a video that provides an overview of what the technology is capable of doing and what its limitations are.
DALL-E 2 is a “generative model,” a special branch of machine learning that creates complex output instead of performing prediction or classification tasks on input data. You provide DALL-E 2 with a text description, and it generates an image that fits the description.
Generative models are a hot area of research that received much attention with the introduction of generative adversarial networks (GAN) in 2014. The field has seen tremendous improvements in recent years, and generative models have been used for a vast variety of tasks, including creating artificial faces, deepfakes, synthesized voices, and more.
However, what sets DALL-E 2 apart from other generative models is its capability to maintain semantic consistency in the images it creates.
For example, the following images (from the DALL-E 2 blog post) are generated from the description “An astronaut riding a horse.” One of the descriptions ends with “as a pencil drawing” and the other “in photorealistic style.”
The model remains consistent in drawing the astronaut sitting on the back of the horse and holding his/her hands in front. This kind of consistency shows itself in most examples OpenAI has shared.
The following examples (also from OpenAI’s website) show another feature of DALL-E 2, which is to generate variations of an input image. Here, instead of providing DALL-E 2 with a text description, you provide it with an image, and it tries to generate other forms of the same image. Here, DALL-E maintains the relations between the elements in the image, including the girl, the laptop, the headphones, the cat, the city lights in the background, and the night sky with moon and clouds.
Other examples suggest that DALL-E 2 seems to understand depth and dimensionality, a great challenge for algorithms that process 2D images.
Even if the examples on OpenAI’s website were cherry-picked, they are impressive. And the examples shared on Twitter show that DALL-E 2 seems to have found a way to represent and reproduce the relationships between the elements that appear in an image, even when it is “dreaming up” something for the first time.
In fact, to prove how good DALL-E 2 is, Altman took to Twitter and asked users to suggest prompts to feed to the generative model. The results (see the thread below) are fascinating.
The science behind DALL-E 2
DALL-E 2 takes advantage of CLIP and diffusion models, two advanced deep learning techniques created in the past few years. But at its heart, it shares the same concept as all other deep neural networks: representation learning.
Consider an image classification model. The neural network transforms pixel colors into a set of numbers that represent its features. This vector is sometimes also called the “embedding” of the input. Those features are then mapped to the output layer, which contains a probability score for each class of image that the model is supposed to detect. During training, the neural network tries to learn the best feature representations that discriminate between the classes.
Ideally, the machine learning model should be able to learn latent features that remain consistent across different lighting conditions, angles, and background environments. But as has often been seen, deep learning models often learn the wrong representations. For example, a neural network might think that green pixels are a feature of the “sheep” class because all the images of sheep it has seen during training contain a lot of grass. Another model that has been trained on pictures of bats taken during the night might consider darkness a feature of all bat pictures and misclassify pictures of bats taken during the day. Other models might become sensitive to objects being centered in the image and placed in front of a certain type of background.
Learning the wrong representations is partly why neural networks are brittle, sensitive to changes in the environment, and poor at generalizing beyond their training data. It is also why neural networks trained for one application need to be finetuned for other applications—the features of the final layers of the neural network are usually very task-specific and can’t generalize to other applications.
In theory, you could create a huge training dataset that contains all kinds of variations of data that the neural network should be able to handle. But creating and labeling such a dataset would require immense human effort and is practically impossible.
This is the problem that Contrastive Learning-Image Pre-training (CLIP) solves. CLIP trains two neural networks in parallel on images and their captions. One of the networks learns the visual representations in the image and the other learns the representations of the corresponding text. During training, the two networks try to adjust their parameters so that similar images and descriptions produce similar embeddings.
One of the main benefits of CLIP is that it does not need its training data to be labeled for a specific application. It can be trained on the huge number of images and loose descriptions that can be found on the web. Additionally, without the rigid boundaries of classic categories, CLIP can learn more flexible representations and generalize to a wide variety of tasks. For example, if an image is described as “a boy hugging a puppy” and another described as “a boy riding a pony,” the model will be able to learn a more robust representation of what a “boy” is and how it relates to other elements in images.
CLIP has already proven to be very useful for zero-shot and few-shot learning, where a machine learning model is shown on-the-fly to perform tasks that it hasn’t been trained for.
The other machine learning technique used in DALL-E 2 is “diffusion,” a kind of generative model that learns to create images by gradually noising and denoising its training examples. Diffusion models are like autoencoders, which transform input data into an embedding representation and then reproduce the original data from the embedding information.
DALL-E trains a CLIP model on images and captions. It then uses the CLIP model to train the diffusion model. Basically, the diffusion model uses the CLIP model to generate the embeddings for the text prompt and its corresponding image. It then tries to generate the image that corresponds to the text.
Disputes over deep learning and AI research
For the moment, DALL-E 2 will only be made available to a limited number of users who have signed up for the waitlist. Since the release of GPT-2, OpenAI has been reluctant to release its AI models to the public. GPT-3, its most advanced language model, is only available through an API interface. There’s no access to the actual code and parameters of the model.
OpenAI’s policy of not releasing its models to the public has not rested well with the AI community and has attracted criticism from some renowned figures in the field.
DALL-E 2 has also resurfaced some of the longtime disagreements over the preferred approach toward artificial general intelligence. OpenAI’s latest innovation has certainly proven that with the right architecture and inductive biases, you can still squeeze more out of neural networks.
Proponents of pure deep learning approaches jumped on the opportunity to slight their critics, including a recent essay by cognitive scientist Gary Marcus titled, “Deep Learning is Hitting a Wall.” Marcus endorses a hybrid approach that combines neural networks with symbolic systems.
Based on the examples that have been shared by the OpenAI team, DALL-E 2 seems to manifest some of the commonsense capabilities that have so long been missing in deep learning systems. But it remains to be seen how deep this commonsense and semantic stability goes, and how DALL-E 2 and its successors will deal with more complex concepts such as compositionality.
The DALL-E 2 paper mentions some of the limitations of the model in generating text and complex scenes. Responding to the many tweets directed his way, Marcus pointed out that the DALL-E 2 paper in fact proves some of the points he has been making in his papers and essays.
Some scientists have pointed out that despite the fascinating results of DALL-E 2, some of the key challenges of artificial intelligence remain unsolved. Melanie Mitchell, Professor of Complexity at the Santa Fe Institute and author of Artificial Intelligence: A Guide For Thinking Humans, raised some important questions in a Twitter thread.
Mitchell referred to Bongard problems, a set of challenges that test the understanding of concepts such as sameness, adjacency, numerosity, concavity/convexity, and closedness/openness.
“We humans can solve these visual puzzles due to our core knowledge of basic concepts and our abilities of flexible abstraction and analogy,” Mitchell tweeted. “If such an AI system were created, I would be convinced that the field is making real progress on human-level intelligence. Until then, I will admire the impressive products of machine learning and big data, but will not mistake them for progress toward general intelligence.”
The business case for DALL-E 2
Since switching from non-profit to a “capped profit” structure, OpenAI has been trying to find the balance between scientific research and product development. The company’s strategic partnership with Microsoft has given it solid channels to monetize some of its technologies, including GPT-3 and Codex.
In a blog post, Altman suggested a possible DALL-E 2 product launch in the summer. Many analysts are already suggesting applications for DALL-E 2, such as creating graphics for articles (I could certainly use some for mine) and doing basic edits on images. DALL-E 2 will enable more people to express their creativity without the need for special skills with tools.
Altman suggests that advances in AI are taking us toward “a world in which good ideas are the limit for what we can do, not specific skills.”
In any case, the more interesting applications of DALL-E will surface as more and more users tinker with it. For example, the idea for Copilot and Codex emerged as users started using GPT-3 to generate source code for software.
If OpenAI releases a paid API service a la GPT-3, then more and more people will be able to build apps with DALL-E 2 or integrate the technology into existing applications. But as was the case with GPT-3, building a business model around a potential DALL-E 2 product will have its own unique challenges. A lot of it will depend on the costs of training and running DALL-E 2, the details of which have not been published yet.
And as the exclusive license holder to GPT-3’s technology, Microsoft will be the main winner of any innovation built on top of DALL-E 2 because it will be able to do it faster and cheaper. Like GPT-3, DALL-E 2 is a reminder that as the AI community continues to gravitate toward creating larger neural networks trained on ever-larger training datasets, power will continue to be consolidated in a few very wealthy companies that have the financial and technical resources needed for AI research.