Blog

Is it possible you’ve been collecting data for no reason?

May 9, 2019

Artificial intelligence has ignited a fire of excitement throughout the business world. The idea of systems that can identify, categorize and comprehend inputs allowing computers to effectively see, hear, feel and speak, suggests virtually unlimited possible applications.

We are just scratching the surface in terms of what can be accomplished in AI, but face daunting challenges in implementation. The AI revolution is picking up in some industries, but slow to seep into others. Why is that?

There are several reasons for the fragmented adoption of AI for the enterprise. First, it takes time for new ideas to percolate through organizations and for companies to carefully consider their options and the best course of action. Secondly, there are structural obstacles for many organizations such as cost, training and education, and the biggest challenge of all: access to abundant, reliable, and well-organized data.

The reason for this is that modern AI algorithms are based on supervised learning, in which raw data is paired with labels. With sufficient samples, these algorithms infer the relationship between data and labels and can then be used to predict labels from new data. For example, in order for an AI system to recognize cats in an image, it needs a large, representative training set of image data in which the cat in the image is labeled by human annotators.

Given this paradigm shift, companies of all stripes have begun the important process of analyzing the potential for artificial intelligence to transform their businesses. A natural place for them to start is to look at what data they have access to or that they might be able to easily collect. You may have heard the phrase “data is the new oil”, which illustrates this data-first perspective.

But identifying and collecting data is only the first step. In order to prove out the potential for a proposed AI system, a business must prepare that data by labeling, annotating and categorizing it and then use the prepared data to test and iterate on various deep learning algorithms.

The main problem with this approach is that data collection and preparation are hugely costly and time consuming processes. Data labeling by humans is generally the most accurate and often the only way to do it. Research shows that data scientists spend up to 80 percent of their time simply cleaning data, a huge waste of their talent and productivity.

If we take a hypothetical labeling task where every label takes thirty seconds to complete, and each image has three significant features that require labeling, then a labeled dataset of 1 million images would take a single person over 150 months to finish. While the task can be accelerated by employing multiple data labelers, the cost element remains regardless. The additional issue introduced with multiple data labelers is inter-labeler accuracy and quality. In addition to the cost and limited scalability of the current labeling approach, humans are incapable of labeling many key attributes (distance to object, 3D position, partially obstructed objects) required for emerging applications like AR/VR and autonomous vehicle perception and navigation.

Another problem with this data-first approach to AI development and testing is that the desired functionality of the system can be somewhat of an afterthought that arises out of the development process. In our experience, this can lead to missed targets, project creep and results that are out of touch with the needs of your actual business units and your people on the ground.

There is another way, however. At Neuromation, we have pioneered a Synthetic Data technique that can dramatically lower the cost and increase the speed of prototyping and developing new AI applications for your business.

Synthetic Data that is used to train deep learning computer vision models generally takes the form of digitally created images, video or 3D environments. Much in the same way that humans can effectively learn to pilot airplanes by using a flight simulator instead of practicing in actual planes, an AI system can be trained using Synthetic Data rather than real data. With this technique, the time and cost associated with labeling are virtually eliminated as pixel-perfect labels are provided by the Synthetic Data generator instantaneously and at no additional cost. Furthermore, the incremental cost of each additional image is nearly zero, allowing prototyping, testing and model development to be carried out using extremely large datasets.

One example of the benefit of prototyping AI systems with Synthetic Data that we have seen in our own enterprise practice was a major retailer considering the use of camera systems for inventory management, customer analytics and customer/product interactions. By prototyping with Synthetic Data, they were able to easily understand the relative value of the number, type and location of cameras in their locations without having to go through a prolonged process of building representative hardware, acquiring data under various configurations, labeling the images and building various models.

The benefits of Synthetic Data go far beyond faster and more efficient prototyping, as well. Overfitting is a common problem for AI systems that occurs when you have too much of a certain label in your training data. Bias is another pervasive problem stemming from collected data that does not adequately represent the full range of differences that can occur in reality. Synthetically generated datasets can guarantee a well balanced dataset, eliminating these issues.

Synthetic data also allows for reliable generation of edge cases, which can be extremely difficult or impossible to capture in real life. Another example from our own work is the creation of an AI system to accurately identify rare diseases that used synthetic generation of rare disease symptoms under various environmental conditions. Given these diseases can occur in some cases in only one in 10 million people, such data would be virtually impossible to collect otherwise. Other verticals in which edge cases are important are autonomous vehicles (accidents, environmental conditions), manufacturing (QA of defects), and infrastructure (identification of rare faults).

So, before you launch a time-consuming and expensive process of data collection and preparation you should consider whether Synthetic Data could allow you to prototype, test and iterate potential AI applications far more quickly, cheaply and accurately. Then, once your project is up and running and moving in the right direction, you can more confidently begin the process of real data capture and preparation as well as look at hybrid strategies where real data is complemented and enhanced by Synthetic Data for improved accuracy and balance.

Train your LLMs to choose between RAG and internal memory automatically

What OpenELM language models say about Apple’s generative AI strategy

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

Is it possible you’ve been collecting data for no reason?

Like this:

Leave a ReplyCancel reply

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks