By Nick Ni
The field of artificial intelligence moves swiftly, with the pace of innovation only accelerating. While the software industry has been successful in deploying AI in production, the hardware industry – including automotive, industrial, and smart retail – is still in its infancy in terms of AI productization. Major gaps still exist that hinder AI algorithm proofs-of-concept (PoC) from becoming real hardware deployments. These drawbacks are largely due to small data problems, “non-perfect” inputs, and ever-changing “state-of-the-art” models. How can software developers and AI scientists overcome these challenges? The answer lies in adaptable hardware.
Internet giants, such as Google and Facebook, routinely collect and analyze massive amounts of data every day. They then use this data to create AI models that have acceptable performance off the bat. In such cases, the hardware used to train the models is very different from the hardware used to run the models.
On the other hand, in the hardware industry, the availability of big data is much more limited, resulting in less mature AI models. Therefore, there is a major push to collect more data and run “online models,” where training and inference are performed on the same deployed hardware to continuously improve the accuracy.
To address this, adaptive computing—such as field-programmable gate arrays (FPGA) and adaptable system-on-chip (SoC) devices that are proven on the edge—can run both inference and training to constantly update themselves to the newly captured data. Traditional AI training requires the cloud or large on-premise data centers and takes days and weeks to perform. The real data, on the other hand, is generated mostly at the edge. Running both AI inference and training in the same edge device not only improves the total cost of ownership (TCO) but also reduces latency and security bleaches.
While it’s becoming easier to publish an AI model PoC to show something like better accuracy of COVID-19 detection using X-ray images as an example, these PoCs are almost always based on well-cleaned-up input pictures. In real-life, camera and sensor inputs from medical devices, robots, and moving cars will have random distortion such as dark images and various angled objects. These inputs first need to be processed by sophisticated preprocessing to clean up and reformat before they can be fed into AI models. Postprocessing is very important to make sense of the AI model outputs and calculate the proper decision making.
Indeed, some chips may be very good at AI inference acceleration, but they almost always accelerate only a portion of a full application. Using smart retail as an example, pre-process includes many-stream video decode followed by conventional computer vision algorithms to resize, reshape and format convert the videos. Post-processing also includes object tracking and database look-up. End customers care less about the speed the AI inference runs at but whether they can meet the video stream performance and/or real-time responsiveness of the full application pipeline. FPGAs and adaptable SoCs have a proven track record of accelerating these pre- and post-processing algorithms using domain-specific architectures (DSAs). Plus, adding an AI inference DSA will allow the whole system to be optimized to meet the product requirements from end-to-end.
Constantly changing “state-of-the-art” models
The AI research community is arguably the most active with new AI models being invented daily by top researchers around the world. These models improve accuracy, reduce computational requirements, and address new types of AI applications. This fast innovation continues to put pressure on existing semiconductor hardware devices, demanding newer architecture to efficiently support the modern algorithms. Standard benchmarks, such as MLPerf, prove that state-of-the-art CPUs, GPUs, and AI ASIC chips fall well below 30 percent of the vendor advertised performance when running real-life AI workloads. This is constantly pushing the need for new DSA to keep up with the innovation.
There are several recent trends that are pushing the need for new DSAs. Depthwise convolution is an emerging layer that requires large memory bandwidth and specialized internal memory caching to be efficient. Typical AI chips and GPUs have fixed L1/L2/L3 cache architecture and limited internal memory bandwidth resulting in very low efficiency.
Researchers are constantly inventing new custom layers that today’s chips do not support natively. Because of this, they need to be run on host CPUs without acceleration, often becoming the performance bottleneck.
Sparse Neural Network is another promising optimization where networks are heavily pruned, sometimes up to 99 percent, by trimming network edges, removing fine-grained matrix values in convolution, etc. However, to run this efficiently in hardware, you need specialized sparse architecture, plus an encoder and decoder for these operations which most chips simply do not have.
Binary / Ternary are the extreme optimizations, transforming all math operations to bit-manipulation operations. Most AI chips and GPUs only have 8 bit, 16 bit, or floating-point calculation units so you will not gain any performance or power efficiency by doing extreme low precisions. FPGAs and adaptable SoCs are perfect as a developer can develop the perfect DSA and reprogram the existing device for the very workload for the product. As a proof point, the latest MLPerf included a submission by Xilinx, collaborating with Mipsology, that achieved 100 percent of the hardware datasheet performance using the ResNet-50 standard benchmark.
No hardware expertise? No problem
Historically, the biggest challenge for FPGAs and adaptable SoCs has been the need for hardware expertise to implement and deploy DSAs. The good news is that now there are tools – like the Vitis unified software platform – that support C++, Python, and popular AI frameworks like TensorFlow and PyTorch, closing the gap for software and AI developers.
In addition to more development in software abstraction tools, open-source libraries, such as the Vitis hardware-accelerated libraries, are significantly boosting adoption within the developer community. In the most recent design contest, Xilinx was able to attract more than 1,000 developers and published many innovative projects, from a hand-gesture-controlled drone to reinforcement learning using a binarized neural network. Importantly, most of the projects submitted were by software and AI developers who had no previous experience with FPGAs. This is proof that the FPGA industry is taking the right steps to enable software and AI developers to solve real-world AI productization challenges.
Up until recently, unlocking the power of hardware adaptability was unattainable for the average software developer and AI scientist. Specific hardware expertise was previously required but thanks to new open-source tools, software developers are now empowered with adaptable hardware. With this new ease of programming, FPGAs and adaptable SoCs will continue to become more accessible to hundreds of thousands of software developers and AI scientists, making these devices the hardware solution of choice for next-generation applications. Indeed, DSAs will represent the future of AI inference with software developers and AI scientists harnessing hardware adaptability for their next-generation applications.
About the author
Nick Ni is the Director of AI Products – software and ecosystem at Xilinx. Ni has a master’s degree in Computer Engineering from the University of Toronto and holds over 10 patents and publications.