This article is part of our series that explores the business of artificial intelligence
The latest winner of the growing interest in enterprise AI is Databricks, a startup that has just secured $1.6 billion in series H funding at an insane valuation of $38 billion. This latest round of investment comes only months after Databricks raised another $1 billion.
Databricks is one of several companies that offer services and products for unifying, processing, and analyzing data stored in different sources and architectures. The category also includes Snowflake, which made a massive IPO last year and has a market cap of $90 billion, and C3.ai, another enterprise AI company that went public last year.
Why are investors enamored with companies like Databricks? Because they are addressing some of the biggest challenges standing in the way of companies that are trying to launch machine learning projects to cut down the costs of operations, improve products and user experience, and increase revenue.
There’s a lot of excitement around what companies like Databricks can do for the enterprise AI market. But whether the huge valuation is justified or a byproduct of the hype surrounding the market remains to be seen. Given the structure of these companies and their business models, it’s not clear how they will continue to sustain the growth that investors expect and whether they can withstand the long-term and inevitable competition that tech giants will bring.
Addressing data problems
Many companies are trying to improve data-driven operations and launch machine learning projects, but have a hard time harnessing their data infrastructure. Thanks to scalable cloud services, companies have been able to collect massive amounts of data without making upfront investments in IT infrastructure and talent.
But putting this data to use is easier said than done. At large companies that have been around for a while, data is usually spread across different systems and stored under different standards. They have a combination of classic schema-based data warehouses and schema-less data lakes, stored on company servers and in the cloud. Different data stores might use different conventions to register similar information, making them incompatible with each other. Some databases might contain sensitive information, which poses challenges to making them available to different data science and business intelligence teams.
All of this makes it very hard to consolidate the data and prepare it for consumption by machine learning models and business intelligence tools. In fact, different surveys show that the top barriers in applied machine learning projects are related to data engineering tasks and talent.
This is the problem that companies like Databricks are addressing. Databricks’s founders include the developers of Apache Spark, Delta Lake, and MLflow, three open-source projects that have become key components of machine learning projects running on very big and disparate data sources. Apache Spark is an analytics engine that processes large amounts of data in various formats. Delta Lake is a storage layer that brings together data lakes and data warehouses together in an architecture that can be queried like a classic database. MLflow is a tool for managing machine learning pipelines and keeping track of different versions of models.
Lakehouse, Databricks’s main cloud service, uses all these projects to bring different sources of data together and enable data scientists and analysts to run workloads from a single platform.
The company’s unified platform makes it easy for business intelligence and machine learning teams to collaborate and share workspaces. It reduces the load of data engineering by providing unified access to disparate data sources. Under the hood, it can take care of problems such as incompatible schemas, anonymization, and switching between streaming and batch data.
Like other services in the same category, Databricks’s platform supports Microsoft Azure, Amazon Web Services, and Google Cloud, the cloud infrastructure that most enterprises use to store their data. This gives Databricks the advantage of leveraging the sturdy and scalable infrastructure of major cloud providers and obviates the need for its customers to migrate their data (but also comes with some risk to its business, which I’ll discuss later).
Databricks’s services have great value for organizations with large stores of untapped data.
For example, AstraZeneca used the Databricks’s platform to unify hundreds of internal and public data sources. This resulted in faster and smoother queries, better collaboration between teams, and faster operations, which is crucial to an industry that spends billions of dollars and years of research on finding promising hypotheses and running experiments.
HSBC used the platform to improve its fraud detection system and recommendation engine. The bank was able to consolidate 14 databases into a single Delta Lake that it made available to its data science and machine learning teams. The Delta Lake was set up to take care of some of the legal and regulatory requirements, such as anonymizing customer data before sending it to machine learning models. The improved data pipelines resulted in orders of magnitude improvement in operation speed, and it helped the machine learning teams to speed up the development, training, and tuning of models. The overall result was an improved customer experience and a 4.5X increase in user engagement on the bank’s mobile app PayMe.
A look at Databricks’s competitors shows a similar trend. C3.ai’s customers include oil-and-gas giants, government agencies, large manufacturers, and healthcare companies. Snowflake is serving supermarket and restaurant chains, packaged food and beverage companies, and healthcare organizations.
There’s also appeal for enterprise data management and AI services among tech companies, but the market is limited to companies that can’t set up their own data pipelines or are in the initial phases of machine learning projects. Most big tech companies have in-house talent and tools to tailor their data infrastructure to their needs and make optimal use of open-source and cloud services. An interesting case study is Twitter’s use of on-premise and cloud-based data management services to run machine learning workloads.
A competitive market
In its latest funding round, Databricks reported $600 million annual recurring revenue (ARR), up from $425 million in 2020. This is the exciting kind of growth that has drawn investors to pour even more money into the company. Databricks’s $38 billion valuation is largely due to investors betting on the company’s ability to sustain this pace of growth.
But there are several challenges that Databricks and its peers must overcome.
First, the market is very competitive. As Databricks CEO Ali Ghodsi told TechCrunch, “[Data lakehouses are] a new category, and we think there’s going to be lots of vendors in this data category. So it’s a land grab. We want to quickly race to build it and complete the picture.”
In some markets, companies take advantage of network effects or superior data to keep their customers locked in and maintain the edge over competitors. In the data-processing industry, the dynamics of the market are different. While Databricks provides a very useful technology, it’s not something that other companies can’t copy. And since the company’s technology builds on top of major cloud providers, there will be little barrier for customers to switch to competitors.
This means that success will be largely dependent on customer acquisition strategy of the market players and their ability to retain customers through continued innovation.
Growth will also depend largely on the kind of customers the company will acquire. Databricks announced in its latest round of funding that it has 5,000 customers. Since the company hasn’t filed for IPO yet, we don’t know the details of its financials. But if the competition is any indication, a few very large customers will account for a large part of its revenue. For example, C3.ai earned 36 percent of its revenue in 2020 from Baker Hughes and Engie. And according to the S-1 filing of Snowflake, nearly 30 percent of its revenue in the first half of 2020 came from 153 of its 3,000 customers.
These companies will grow as long as they can acquire big new customers that are willing to spend large amounts. But once the market becomes saturated, growth will plateau. Then, they will have to upsell to existing customers with new services, which is very difficult, or snatch customers from each other by providing more competitive prices, which will drive down revenue. The loss of every big customer will have a dramatic impact on the financials of each of these companies.
The future of the market
The competitive nature of the market will have the positive effect of driving enterprise AI companies to innovate at a rapid pace. But at some point, the market will face fierce competition from big tech companies.
All three cloud providers have products that can evolve into the kind of services Databricks provides. Google has BigQuery, Microsoft has Azure Synapse, and Amazon has Redshift.
Once the market matures, expect the cloud giants to make their move to get their share. Given their deep pockets, the big three can either buy the smaller data management companies or buy their customers at more competitive prices.
Of special concern for these companies is Microsoft, which already has a big penetration in the non-tech markets where Databricks and others are thriving, thanks to its enterprise collaboration tools.
Microsoft is also in partnership with Databricks, and a considerable number of Databricks’s large customers are on the Azure Databricks platform. And Microsoft has a history of turning partnerships into acquisitions.
In discussions with the media, Ghodsi did not rule out the possibility of an IPO. But I wouldn’t be surprised if his company ends up becoming a Microsoft subsidiary.