Blog

BLOOM can set a new culture for AI research—but challenges remain

July 18, 2022

BLOOM large language model — Image credit: 123RF (with modifications)

This article is part of our coverage of the latest in AI research.

This week, the BigScience research project released BLOOM, a large language model that, at first glance, it looks like another attempt to reproduce OpenAI’s GPT-3.

But what makes BLOOM different from other LLMs is the effort that went into researching, developing, training, and releasing the machine learning model.

While in recent years, big tech companies have hidden LLMs like closely guarded trade secrets, BigScience has put transparency in openness at the center of BLOOM since the beginning of the project.

The result is a large language model that is highly accessible for research and study and available to everyone. The open-source and open-collaboration example that BLOOM has set can be very beneficial to the future of research in LLMs and other areas of artificial intelligence. But some of the challenges that are inherent to large language models remain to be solved.

What is BLOOM?

BLOOM stands for “BigScience Large Open-science Open-access Multilingual Language Model.” From the figures, it doesn’t look much different from GPT-3 and OPT-175B. It is a very large transformer model with 176 billion parameters that has been trained on 1.6 terabytes of data, including natural language and software source code.

Like GPT-3, it can perform many tasks with zero- and few-shot learning, including text-generation, summarization, question-answering, and programming.

But what makes BLOOM significant is the organization behind it and the process that went into building it.

BigScience is a research project that was bootstrapped in 2021 by Hugging Face, the popular hub for machine learning models. According to its website, the project “aims to demonstrate another way of creating, studying, and sharing large language models and large research artefacts in general within the AI/NLP research communities.”

In this regard, BigScience takes “inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are useful for the entire research community.”

In the span of a year, starting from May 2021, more than 1,000 researchers from 60 countries and more than 250 institutions worked together at BigScience to create BLOOM.

Transparency, openness, and inclusivity

While most major LLMs have been trained exclusively on English text, BLOOM’s training corpus includes 46 natural languages and 13 programming languages. This makes it useful for the many regions where English is not the main language.

BLOOM is also a break from the de facto reliance on big tech to train models. One of the main problems of LLMs is the prohibitive costs of training and tuning them. This hurdle has made 100-billion-parameter LLMs the exclusive domain of big tech companies with deep pockets. Recent years have seen AI labs gravitate toward big tech to gain access to subsidized cloud compute resources and fund their research.

In contrast, BigScience got a 3 million euro grant from the Centre National de la Recherche Scientifique (French National Center for Scientific Research) to train BLOOM on the supercomputer Jean Zay. There were no deals to give commercial companies exclusive license to the technology, and no commitment to commercialize the model and turn it into a profitable product.

Furthermore, the team has been completely transparent about the entire process of training the model. They have published the dataset, the meeting notes, discussions, and code, as well as the logs and technical details of training the model.

Researchers are studying the model’s data and metadata and publishing interesting findings.

I've been playing with the training dataset behind the extremely cool new BLOOM model from @BigscienceW and @huggingface. Here's a sample of 10 million chunks from the English-language corpus, about 1.25% (!!) of the total. Encoded with `all-distilroberta-v1`, then UMAP to 2d. pic.twitter.com/a00zBWw83c
— David McClure (@clured) July 12, 2022

And of course, the trained model itself is available for download on Hugging Face’s platform, which relieves researchers of the pain of spending millions of dollars on training.

Last month, Facebook open-sourced one of its LLMs under some restrictions. However, the level of transparency that BLOOM brings is unprecedented and will hopefully set a new standard for the industry.

“BLOOM is a demonstration that the most powerful AI models can be trained and released by the broader research community with accountability and in an actual open way, in contrast to the typical secrecy of industrial AI research labs,” said BLOOM Training co-lead, Teven Le Scao.

Challenges remain

While the efforts of the BigScience to bring openness and transparency to AI research and large language models are commendable, the inherent challenges of the field remain unchanged.

LLM research is trending toward bigger and bigger models, which will further increase the costs of training and running them. BLOOM was trained on 384 Nvidia Tesla A100 GPUs (~$32,000 each). Larger models will require even larger compute clusters. BigScience has declared that it will continue to create other open-source LLMs, but it remains to be seen how it will fund its growingly costly research. (OpenAI, which started out as a non-profit organization, ended up becoming a for-profit organization that sells products and relies on funding from Microsoft.)

Another problem that remains to be solved is the huge costs of running the models. The compressed BLOOM model is 227 gigabytes large. Running it requires specialized hardware with hundreds of gigabytes of VRAM. For comparison, GPT-3 requires a computing cluster that is the equivalent of Nvidia DGX 2, which is priced at around $400,000. Hugging Face plans to launch an API platform that enables researchers to use the model for around $40 per hour, which is not a small cost.

The costs of running BLOOM will also affect the applied ML community, startups and organizations that want to build products powered by LLMs. Currently, the GPT-3 API offered by OpenAI is much more attuned to product development. It will be interesting to see which directions BigScience and Hugging Face will take to enable developers to create products on top of their valuable research.

In this regard, I’m looking forward to the smaller versions of the model that BigScience plans to release in the future. Contrary to the way they are often portrayed in the media, LLMs still follow the “no free lunch” theorem. This means that when it comes to applied ML, a more compact model that has been finetuned for a specific task is more efficient than a very large model that has average performance on many tasks. An example is Codex a modified version of GPT-3 that provides superb programming assistance at a fraction of GPT-3’s size and costs. GitHub is currently offering Copilot, a product built on Codex, at $10 per month.

With the new culture that BLOOM hopes to establish, it will be interesting to see which directions academic and applied AI will take in the future.

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

BLOOM can set a new culture for AI research—but challenges remain

What is BLOOM?

Transparency, openness, and inclusivity

Challenges remain

Like this:

Leave a ReplyCancel reply

What is BLOOM?

Transparency, openness, and inclusivity

Challenges remain

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks