What will come after OpenAI’s Sora

Pirate ships in a coffe cup
Image generated with Bing Image Creator

The release of OpenAI Sora has caused fascination and panic among scientists, artists, and politicians. The quality of the videos generated by Sora is really impressive, especially when compared to AI-generated videos from last year. 

The model is still experimental and few people have access to it. But from the examples OpenAI has shared so far, it is clear that despite the impressive results, text-to-video still has some fundamental flaws that need to be fixed before it can be used in production.

What we know about how Sora works

Unfortunately, OpenAI has not released much information about the model(s) behind Sora except that it uses diffusion and transformer architectures. We also know that the model has been trained at a very large scale thanks to OpenAI’s vast compute and data resources.

The accompanying “technical report,” which also doesn’t go into implementation details, contains some hints about what kind of models and techniques it uses. Sora has built on much of the research done at Google, Facebook, and university labs, a reminder of the good old days of sharing knowledge.

One Google DeepMind researcher poked fun at OpenAI for using their open research without sharing theirs in return.

Saining Xie, a deep learning researcher at New York University, also speculated on how Sora works based on the technical report. OpenAI has apparently used its massive compute and data resources to scale a simple architecture to a degree that led to “emerging simulating capabilities.”

However, despite its impressive results, Sora still has clear signs and artifacts that show it does not understand the world. On the one hand, it can produce great details on individual scenes and objects. But on the other, it violates the basic rules of physics and cause and effect. 

For example, objects can appear out of nowhere, or the model can get the scale of objects wrong throughout the video. And sometimes it can mix up different objects. It is especially bad at simulating limbs. Feet and hands can be bent in the wrong direction. When characters walk, their legs can get mixed when they cross each other from the camera’s viewpoint. The character’s gait gets disrupted in midstride.

When there are a few simple objects in the scene, the video is more consistent, which shows that compositionality remains a big challenge of current generative models. OpenAI’s blog acknowledges that the model “may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect.” Sora might also confuse “spatial details,” another problem it shares with DALL-E and other image-generation tools

Continuing to scale

There are different opinions about solutions for Sora’s current problems. One obvious approach is to continue scaling the models. The paper shows the researchers were able to improve the results with more parameters, data, and compute. This is a pattern that has regularly been seen with transformer-based models. We still haven’t reached the ceiling of what transformers can do as you continue to increase their size and training data.

But scaling is cost-prohibitive and is only accessible to companies like OpenAI, which have great financial and compute resources and also have a profitable business model that enables them to throw cash at such experiments.

Another possible direction is to explore other ways to improve the existing model with different training techniques. Jim Fan from Nvidia compared Sora to GPT-3, the precursor to ChatGPT.

The first version of ChatGPT was built on top of GPT-3 but improved with reinforcement learning from human feedback (RLHF) and better training data. It turned out that the existing model had a lot of untapped potential that could be activated with the right training techniques. After that, it was a short jump to GPT-4 (at least given the vague information that OpenAI has made available). The Sora report indicates that the team was able to use synthetic data to annotate the training examples, which is an approach that can be scaled with more compute resources. Therefore, combining scale, better data, and new training techniques might help Sora take the next leap.

Other approaches that might work

An alternative solution is to redesign the generative models or combine them with other systems to obtain more accurate results. 

For example, a model like Sora can pass its output to another generative model such as a neural radiance field (NeRF) to create a 3D map of the video it has generated. Those objects and their movements can then be refined with a physics simulator such as Unreal Engine, which already provides very accurate results. Finally, other generative models such as StyleGAN can change the lighting, style, and other aspects of the final output. Many other small bits can be added to further control the pipeline, such as modifying specific objects or backgrounds. 

Nvidia is using a similar technique in its Neural Reconstruction Engine, which creates highly detailed 3D environments from videos recorded by cameras installed on its cars. It uses these environments to create photo-realistic simulated scenarios to train the models used in self-driving cars.

I’m also looking forward to other architectures that can either complement or replace Sora. The Joint Embedding Predictive Architecture (JEPA) by Meta can be a good solution to explore. The idea behind JEPA is to learn latent representations that ensure consistency throughout time without the need to predict pixel-level features. For example, JEPA can measure how different objects should move in a scene without the need to predict their finest details. Such models can be used to measure and correct errors that generative models make across frames.

There is a lot of appeal to an end-to-end system that goes straight from the prompt to a finished movie. But in practice, if these generative models are to become useful in production, they should provide more control to their users. While a modular system that combines different models and physics engines might have some limitations, it will provide more accurate results and allow users to adjust the final video as they see fit.

I can easily see this evolving into a tool where every user can start with a prompt, generate a 3D scene, and then use more natural language commands or visual tools to refine the scene to their liking. Adobe is already exploring this hybrid approach with its generative AI tools.

It will be interesting to see how the text-to-video landscape evolves after the release of Sora. What is evident is that we’re seeing an accelerating pace of innovation and progress.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.