This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
Intel recently unveiled a deep learning system that turns 3D rendered graphics into photorealistic images. Tested on Grand Theft Auto 5, the neural network showed impressive results. The game’s developers have already done a great job of recreating Los Angeles and southern California in detail. But with Intel’s new machine learning system, the graphics turn from high-quality synthetic 3D to real-life depictions (with very minor glitches).
And what’s even more impressive is that the Intel’s AI is doing it at a relatively high framerate as opposed to photorealistic render engines that can take minutes or hours for a single frame. And this is just the preliminary results. The researchers say they can optimize the deep learning models to work much faster.
Does it mean that real-time photorealistic game engines are on the horizon, as some analysts have suggested? I would not bet on it yet, because several fundamental problems remain unsolved.
Deep learning for image enhancement
Before we can evaluate the feasibility of running real-time image enhancement, let’s have a high-level look at the deep learning system Intel has used.
The researchers at Intel have not provided full implementation details about the deep learning system they have developed. But they have published a paper on arXiv and posted a video on YouTube that provide useful hints on the kind of computation power you would need to run this model.
The full system, displayed below, is composed of several interconnected neural networks.
The G-buffer encoder transforms different render maps (G-buffers) into a set of numerical features. G-buffers are maps for surface normal information, depth, albedo, glossiness, atmosphere, and object segmentation. The neural network uses convolution layers to process this information and output a vector of 128 features that improve the performance of the image enhancement network and avoid artifacts that other similar techniques produce. The G-buffers are obtained directly from the game engine.
The image enhancement network takes as input the game’s rendered frame and the features from the G-buffer encoder and generates the photorealistic version of the image.
The remaining components, the discriminator and the LPIPS loss function, are used during training. They grade the output of the enhancement network by evaluating its consistency with the original game-rendered frame and by comparing its photorealistic quality with real images.
Inference costs for image enhancement
First, let’s see that, if the technology becomes available, whether gamers will be able to run it on their computers. For this, we need to calculate inference costs, or how much memory and computing power you need to run the trained model. For inference, you’ll only need the G-buffer encoder and image enhancement network, and we can cut the discriminator network.
The enhancement network accounts for the bulk of the work. According to Intel’s paper, this neural network is based on HRNetV2, a deep learning architecture meant for processing high-resolution images. High-resolution neural networks produce fewer visual artifacts than models that down-sample images.
According to Intel’s paper, “The HRNet processes an image via multiple branches that operate at different resolutions. Importantly, one feature stream is kept at relatively high resolution (1/4 of the input resolution) to preserve fine image structure.”
This means that, if you’re running the game at full HD (1920×1080), then the top row layers will be processing inputs at 480×270 pixels. The resolution halves on each of the lower rows. The researchers have changed the structure of each block in the neural network to also compute inputs from the G-buffer encoder (the RAD layers).
According to Intel’s paper, the G-buffer’s inputs include “one-hot encodings for material information, dense continuous values for normals, depth, and color, and sparse continuous information for bloom and sky buffers.”
The researchers note elsewhere in their paper that the deep learning model can still perform well with a subset of the G-buffers.
So, how much memory does the model need? Intel’s paper doesn’t state the memory size, but according to the HRNetV2 paper, the full network requires 1.79 gigabytes of memory for a 1024×2048 input. The image enhancement network used by Intel has a smaller input size, but we also need to account for the extra parameters introduced by the RAD layers and the G-buffer encoder. Therefore, it would be fair to assume that you’ll need at least one gigabyte of video memory to run deep learning–based image enhancement for full HD games and probably more than two gigabytes if you want 4K resolution.
One gigabyte of memory is not much given that gaming computers commonly have graphics cards with 4-8 GB of VRAM. And high-end graphics cards such as the GeForce RTX series can have up to 24 GB of VRAM.
But it is also worth noting that 3D games consume much of the graphics card’s resources. Games store as much data as possible on video memory to speed up render times and avoid swapping between RAM and VRAM, an operation that incurs a huge speed penalty. According to one estimate, GTA 5 consumes up to 3.5 GB of VRAM at full HD resolution. And GTA was released in 2013. Newer games such as Cyberpunk 2077, which have much larger 3D worlds and more detailed objects, can easily gobble up to 7-8 GB of VRAM. And if you want to play at high resolutions, then you’ll need even more memory.
So basically, with the current mid- and high-end graphics cards, you’ll have to choose between low-resolution photorealistic quality and high-resolution synthetic graphics.
But memory usage is not the only problem deep learning–based image enhancement faces.
Delays caused by non-linear processing
A much bigger problem, in my opinion, is the sequential and non-linear nature of deep learning operations. To understand this problem, we must first compare 3D graphics processing with deep learning inference.
Three-dimensional graphics rely on very large numbers of matrix multiplications. A rendered frame of 3D graphics starts from a collection of vertices, which are basically a set of numbers that represent the properties (e.g., coordinates, color, material, normal direction, etc.) of points on a 3D object. Before every frame is rendered, the vertices must go through a series of matrix multiplications that map their local coordinates to world coordinates to camera space coordinates to image frame coordinates. An index buffer bundles vertices into groups of threes to form triangles. These triangles are rasterized—or transformed into pixels— and every pixel then goes through its own set of matrix operations to determine its color based on material color, textures, reflection and refraction maps, transparency levels, etc.
This sounds like a lot of operations, especially when you consider that today’s 3D games are composed of millions of polygons. But there are two reasons you get very high framerates when playing games on your computer. First, graphics cards have been designed specifically for parallel matrix multiplications. As opposed to the CPU, which has at most a few dozen computing cores, graphics processors have thousands of cores, each of which can independently perform matrix multiplications.
Second, graphics transformations are mostly linear. And linear transformations can be bundled together. For instance, if you have separate matrices for world, view, and projection transformations, you can multiply them together to create one matrix that performs all three operations. This cuts down your operations by two-thirds. Graphics engines also use plenty of tricks to further cut down operations. For instance, if an object’s bounding box falls out of the view frustum (the pyramid that represents the camera’s perspective), it will be excluded from the render pipeline altogether. And triangles that are occluded by others are automatically removed from the pixel rendering process.
Deep learning also relies on matrix multiplications. Every neural network is composed of layers upon layers of matrix computations. This is why graphics cards have become very popular among the deep learning community in the past decade.
But unlike 3D graphics, the operations of deep learning can’t be combined. Layers in neural networks rely on non-linear activation functions to perform complicated tasks. Basically, this means that you can’t compress the transformations of several layers into a single operation.
For instance, say you have a deep neural network that takes a 100×100 pixel input image (10,000 features) and runs it through seven layers. A graphics card with several thousand cores might be able to process all pixels in parallel. But it will still have to perform the seven layers of neural network operations sequentially, which can make it difficult to provide real-time image processing, especially on lower-end graphics cards.
Therefore, another bottleneck we must consider is the number of sequential operations that must take place. If we consider the top layer of the image enhancement network there are 16 residual blocks that are sequentially linked. In each residual block, there are two convolution layers, RAD blocks, and ReLU operations that are sequentially linked. That amounts to 96 layers of sequential operations. And the image enhancement network can’t start its operations before the G-buffer encoder outputs its feature encodings. Therefore, we must add at least the two residual blocks that process the first set of high-resolution features. That’s eight more layers added to the sequence, which brings us to at least 108 layers of operations for image enhancement.
This means that, in addition to memory, you need high clock speeds to run all these operations in time. Here’s an interesting quote from Intel’s paper: “Inference with our approach in its current unoptimized implementation takes half a second on a GeForce RTX 3090 GPU.”
The RTX 3090 has 24 GB of VRAM, which means the slow, 2 FPS render rate is not due to memory limitations but rather due to the time it takes to sequentially process all the layers of the image enhancer network. And this isn’t a problem that will be solved by adding more memory or CUDA cores, but by having faster processors.
Again, from the paper: “Since G-buffers that are used as input are produced natively on the GPU, our method could be integrated more deeply into game engines, increasing efficiency and possibly further advancing the level of realism.”
Integrating the image enhancer network into the game engine would probably give a good boost to the speed, but it won’t result in playable framerates.
For reference, we can go back to the HRNet paper. The researchers used a dedicated Nvidia V100, a massive and extremely expensive GPU specially designed for deep learning inference. With no memory limitation and no hindrance by other in-game computations, the inference time for the V100 was 150 milliseconds per input, which is ~7 fps, not nearly enough to play a smooth game.
Development and training neural networks
Another vexing problem is the development and training costs of the image-enhancing neural network. Any company that would want to replicate Intel’s deep learning models will need three things: data, computing resources, and machine learning talent.
Gathering training data can be very problematic. Luckily for Intel, someone had solved it for them. They used the Cityscapes dataset, a rich collection of annotated images captured from 50 cities in Germany. The dataset contains 5,000 finely annotated images. According to the dataset’s paper, each of the annotated images required an average of 1.5 hours of manual effort to precisely specify the boundaries and types of objects contained in the image. These fine-grained annotations enable the image enhancer to map the right photorealistic textures onto the game graphics. Cityscapes was the result of a huge effort supported by government grants, commercial companies, and academic institutions. It might prove to be useful for other games that, like Grand Theft Auto, take place in urban settings.
But what if you want to use the same technique in a game that doesn’t have a corresponding dataset? In that case, it will be up to the game developers to gather the data and add the required annotations (a photorealistic version of Rise of the Tomb Raider, maybe?).
Compute resources will also pose a challenge. Training a network of the size of the image enhancer for tasks such as image segmentation would be feasible with a few thousand dollars—not a problem for large gaming companies. But when you want to do a generative task such as photorealistic enhancement, then training becomes much more challenging. It requires a lot of testing and tweaking of hyperparameters, and many more epochs of training, which can blow up the costs. Intel tuned and trained their model exclusively for GTA 5. Games that are similar to GTA 5 might be able to slash training costs by finetuning Intel’s trained model on the new game. Others might need to test with totally new architectures. Intel’s deep learning model works well for urban settings, where objects and people are easily separable. But it’s not clear how it would perform in natural settings, such as jungles and caves.
Gaming companies don’t have machine learning engineers, so they’ll also have to outsource the task or hire engineers, which adds more costs. The company will have to decide whether the huge costs of adding photorealistic render are worth the added gaming experience.
Intel’s photorealistic image enhancer shows how far you can push machine learning algorithms to perform interesting feats. But it will take a few more years before the hardware, the companies, and the market will be ready for real-time AI-based photorealistic rendering.