Blog

What will come after OpenAI’s Sora

February 19, 2024

Pirate ships in a coffe cup — Image generated with Bing Image Creator

The release of OpenAI Sora has caused fascination and panic among scientists, artists, and politicians. The quality of the videos generated by Sora is really impressive, especially when compared to AI-generated videos from last year.

The model is still experimental and few people have access to it. But from the examples OpenAI has shared so far, it is clear that despite the impressive results, text-to-video still has some fundamental flaws that need to be fixed before it can be used in production.

What we know about how Sora works

Unfortunately, OpenAI has not released much information about the model(s) behind Sora except that it uses diffusion and transformer architectures. We also know that the model has been trained at a very large scale thanks to OpenAI’s vast compute and data resources.

The accompanying “technical report,” which also doesn’t go into implementation details, contains some hints about what kind of models and techniques it uses. Sora has built on much of the research done at Google, Facebook, and university labs, a reminder of the good old days of sharing knowledge.

One Google DeepMind researcher poked fun at OpenAI for using their open research without sharing theirs in return.

You're welcome, OpenAI. I'll share my home address in DM if you want to send us flowers and chocolate.

Actually, fun fact: one of the runner-ups for ViT's name was "ToP" meaning "Transformer on Patches". However, we ditched it because "the ToP model" was kinda borderline. pic.twitter.com/dyyY7HBDd1
— Lucas Beyer (@giffmana) February 15, 2024

Saining Xie, a deep learning researcher at New York University, also speculated on how Sora works based on the technical report. OpenAI has apparently used its massive compute and data resources to scale a simple architecture to a degree that led to “emerging simulating capabilities.”

Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community.

What we… pic.twitter.com/XbxFmgPKxT
— Saining Xie (@sainingxie) February 16, 2024

However, despite its impressive results, Sora still has clear signs and artifacts that show it does not understand the world. On the one hand, it can produce great details on individual scenes and objects. But on the other, it violates the basic rules of physics and cause and effect.

For example, objects can appear out of nowhere, or the model can get the scale of objects wrong throughout the video. And sometimes it can mix up different objects. It is especially bad at simulating limbs. Feet and hands can be bent in the wrong direction. When characters walk, their legs can get mixed when they cross each other from the camera’s viewpoint. The character’s gait gets disrupted in midstride.

When there are a few simple objects in the scene, the video is more consistent, which shows that compositionality remains a big challenge of current generative models. OpenAI’s blog acknowledges that the model “may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect.” Sora might also confuse “spatial details,” another problem it shares with DALL-E and other image-generation tools

Continuing to scale

There are different opinions about solutions for Sora’s current problems. One obvious approach is to continue scaling the models. The paper shows the researchers were able to improve the results with more parameters, data, and compute. This is a pattern that has regularly been seen with transformer-based models. We still haven’t reached the ceiling of what transformers can do as you continue to increase their size and training data.

But scaling is cost-prohibitive and is only accessible to companies like OpenAI, which have great financial and compute resources and also have a profitable business model that enables them to throw cash at such experiments.

Another possible direction is to explore other ways to improve the existing model with different training techniques. Jim Fan from Nvidia compared Sora to GPT-3, the precursor to ChatGPT.

I see some vocal objections: "Sora is not learning physics, it's just manipulating pixels in 2D".

I respectfully disagree with this reductionist view. It's similar to saying "GPT-4 doesn't learn coding, it's just sampling strings". Well, what transformers do is just manipulating… pic.twitter.com/6omzD423vr
— Jim Fan (@DrJimFan) February 16, 2024

The first version of ChatGPT was built on top of GPT-3 but improved with reinforcement learning from human feedback (RLHF) and better training data. It turned out that the existing model had a lot of untapped potential that could be activated with the right training techniques. After that, it was a short jump to GPT-4 (at least given the vague information that OpenAI has made available). The Sora report indicates that the team was able to use synthetic data to annotate the training examples, which is an approach that can be scaled with more compute resources. Therefore, combining scale, better data, and new training techniques might help Sora take the next leap.

Other approaches that might work

An alternative solution is to redesign the generative models or combine them with other systems to obtain more accurate results.

For example, a model like Sora can pass its output to another generative model such as a neural radiance field (NeRF) to create a 3D map of the video it has generated. Those objects and their movements can then be refined with a physics simulator such as Unreal Engine, which already provides very accurate results. Finally, other generative models such as StyleGAN can change the lighting, style, and other aspects of the final output. Many other small bits can be added to further control the pipeline, such as modifying specific objects or backgrounds.

Nvidia is using a similar technique in its Neural Reconstruction Engine, which creates highly detailed 3D environments from videos recorded by cameras installed on its cars. It uses these environments to create photo-realistic simulated scenarios to train the models used in self-driving cars.

I’m also looking forward to other architectures that can either complement or replace Sora. The Joint Embedding Predictive Architecture (JEPA) by Meta can be a good solution to explore. The idea behind JEPA is to learn latent representations that ensure consistency throughout time without the need to predict pixel-level features. For example, JEPA can measure how different objects should move in a scene without the need to predict their finest details. Such models can be used to measure and correct errors that generative models make across frames.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards @ylecun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and… pic.twitter.com/5i6uNeFwJp
— AI at Meta (@AIatMeta) February 15, 2024

There is a lot of appeal to an end-to-end system that goes straight from the prompt to a finished movie. But in practice, if these generative models are to become useful in production, they should provide more control to their users. While a modular system that combines different models and physics engines might have some limitations, it will provide more accurate results and allow users to adjust the final video as they see fit.

I can easily see this evolving into a tool where every user can start with a prompt, generate a 3D scene, and then use more natural language commands or visual tools to refine the scene to their liking. Adobe is already exploring this hybrid approach with its generative AI tools.

It will be interesting to see how the text-to-video landscape evolves after the release of Sora. What is evident is that we’re seeing an accelerating pace of innovation and progress.

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

Train your LLMs to choose between RAG and internal memory automatically

What OpenELM language models say about Apple’s generative AI strategy

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

What will come after OpenAI’s Sora

What we know about how Sora works

Continuing to scale

Other approaches that might work

Like this:

Leave a ReplyCancel reply

What we know about how Sora works

Continuing to scale

Other approaches that might work

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks