This generative AI model can be a big deal for the gaming industry

pile of 3d objects
Generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Generative AI systems, particularly diffusion models, have made significant strides in creating art. However, one of their biggest challenges is generating consistent objects from different angles. 

Humans can easily visualize an object from various perspectives given a single image, but replicating this ability in AI remains a challenging task. A new technique, SyncDreamer, developed by researchers at various universities, has made significant progress in this area. It uses a special variant of diffusion model to generate multiple views of an object from a single image. 

This tool could have great benefits for game developers by reducing the time and effort required to create art and assets for games and other virtual environments.

Diffusion models and 3D objects

Diffusion models are the backbone of generative AI systems such as Stable Diffusion, DALL-E, and Midjourney. During training, these models learn to predict how an image would look if a layer of noise was added to it. They do this through several stages until the image goes from fully clear to complete noise. They then reverse the process to recreate the original image. This process is repeated on numerous images, enabling the trained model to generate highly detailed images from noise. 

Text-to-image models take this concept a step further by tying it with text descriptions. The model learns the distributions that connect text embeddings to images based on billions of image/description pairs.

Despite the impressive advances in diffusion models, muliview consistency remains a challenge. A diffusion model can create stunning images of scenes and objects. But it struggles to take a 2D image of an object and depict the same object from another angle.

Various efforts have been made to address this issue. One approach involves creating diffusion models that generate 3D objects, but this requires a large volume of annotated 3D objects to capture the complexity of 3D shapes. Another approach involves adding neural radiance fields (NeRF), a type of neural network that can create 3D objects from 2D images. However, this method requires additional steps of adding textual descriptions and generating the objects, which can be computationally intensive and require significant manual effort.

SyncDreamer

Examples of images created with SyncDreamer (source: SyncDreamer website)

SyncDreamer offers a third approach to creating 3D objects. It takes a 2D image of an object generates other 2D images of the same object from different angles. These images can then be used by another model, such as a NeRF, to create the 3D object. The main challenge lies in training the diffusion model to maintain consistency over the different angles of the object.

According to the researchers, the key idea behind SyncDreamer is “to extend the diffusion framework to model the joint probability distribution of multiview images.” The paper says that “modeling the joint distribution can be achieved by introducing a synchronized multiview diffusion model.” 

The model has multiple noise predictors. For example, if the model creates eight views of the input image, it has eight noise predictors and learns to generate eight images simultaneously. These predictors learn their distributions together and share information, which enables them to synchronize their intermediate states and ensure consistency across the images they create, hence the name “SyncDreamer.”

SyncDreamer can reconstruct shapes from both photorealistic images and hand drawings, making it a versatile tool for tasks such as scene reconstruction or prototyping. The authors state, “Our method is able to generate images that not only are semantically consistent with the input image but also maintain multiview consistency in colors and geometry.” 

The images generated by SyncDreamer are consistent enough to be used directly with another generative model such as a NeRF or NeuS to reconstruct them in 3D with impressive accuracy. While the reconstructed image may not be perfect, a 3D artist can refine it with little effort, making SyncDreamer a valuable tool for 3D object creation. 

SyncDreamer also has the ability to generate different plausible instances of the input image, providing more flexibility in creating objects. SyncDreamer can be combined with a text-to-image model such as Stable Diffusion or DALL-E, which produces the initial 2D image of the object. This combination allows designers to quickly generate ideas, iterate over them, and pass them on to 3D artists for refinement and integration into an asset library.

How SyncDreamer works

SyncDreamer architecture
SyncDreamer architecture (image source: arxiv.org)

The authors explain, “we formulate the generation process as a multiview diffusion model that correlates the generation of each view.” During training, the algorithm generates several images of the same object from different angles. It then selects one of the images at random, applies its corresponding noise predictor model, and compares it to the actual noise to tune its parameters. Throughout the training process, the algorithm synchronizes the noise predictors of the different views to ensure they learn the same distribution.

The model uses a UNet model to denoise the noisy input. To maintain consistency across the multi-view images, the model creates a module that concatenates the features of the images and projects them in 3D space. It then uses a three-dimensional convolutional neural network (CNN) to capture the spatial relationships of the features and can project them onto a two-dimensional space. These features, along with attention layers in the main diffusion network, ensure that the denoising steps are aligned with the predicted features. The authors refer to this architecture as “3D-aware feature attention UNet.”

The researchers trained their model on the Objaverse, a dataset that contains around 800,000 annotated 3D objects and scenes. For each set of multi-view images that the model produces, they use a NeuS model, a type of deep learning architecture that can reconstruct 3D objects from 2D images.

The examples shared by the authors show SyncDreamer working on a range of different artwork, including hand-drawn sketches and ink paintings. The impressive advances that generative AI has made in the past few years are a prelude to the fundamental shifts they can bring to many of the things we do every day.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.