AI

AI2 drops biggest open dataset yet for training language models

Comment

Dolma logo
Image Credits: AI2

Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection.

Dolma, as the dataset is called, is intended to be the basis for the research group’s planned open language model, or OLMo (Dolma is short for “Data to feed OLMo’s Appetite). As the model is intended to be free to use and modify by the AI research community, so too (argue AI2 researchers) should be the dataset they use to create it.

This is the first “data artifact” AI2 is making available pertaining to OLMo, and in a blog post, the organization’s Luca Soldaini explains the choice of sources and rationale behind various processes the team used to render it palatable for AI consumption. (“A more comprehensive paper is in the works,” they note at the outset.)

Although companies like OpenAI and Meta publish some of the vital statistics of the datasets they use to build their language models, a lot of that information is treated as proprietary. Apart from the known consequence of discouraging scrutiny and improvement at large, there is speculation that perhaps this closed approach is due to the data not being ethically or legally obtained: for instance, that pirated copies of many authors’ books are ingested.

Thousands of authors sign letter urging AI makers to stop stealing books

You can see in this chart created by AI2 that the largest and most recent models only provide some of the information that a researcher would likely want to know about a given dataset. What information was removed, and why? What was considered high versus low-quality text? Were personal details appropriately excised?

Chart showing different datasets’ openness or lack thereof. Image Credits: AI2

Of course it is these companies’ prerogative, in the context of a fiercely competitive AI landscape, to guard the secrets of their models’ training processes. But for researchers outside the companies, it makes those datasets and models more opaque and difficult to study or replicate.

AI2’s Dolma is intended to be the opposite of these, with all its sources and processes — say, how and why it was trimmed to original English language texts — publicly documented.

AI2 is developing a large language model optimized for science

It’s not the first to try the open dataset thing, but it is the largest by far (3 billion tokens, an AI-native measure of content volume) and, they claim, the most straightforward in terms of use and permissions. It uses the “ImpACT license for medium-risk artifacts,” which you can see the details about here. But essentially it requires prospective users of Dolma to:

  • Provide contact information and intended use cases
  • Disclose any Dolma-derivative creations
  • Distribute those derivatives under the same license
  • Agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation

For those who worry that despite AI2’s best efforts, some personal data of theirs may have made it into the database, there’s a removal request form available here. It’s for specific cases, not just a general “don’t use me” thing.

If that all sounds good to you, access to Dolma is available via Hugging Face.

More TechCrunch

The American sports betting market produced $10.9 billion in revenue in 2023 for casinos, sportsbooks and iGaming, according to the American Gambling Association. One of the reasons this industry is…

DubClub wants amateur sports betters to win more

New climate tech VC firms have emerged in recent years, but existing ones are also raising larger funds. Founded in 2007, Dutch firm SET Ventures is one of the latter.…

Dutch clean energy investor SET Ventures lands new €200 million fund, which will go toward digital tech

Revefi connects to a company’s data stores and databases (e.g. Snowflake, Databricks and so on) and attempts to automatically detect and troubleshoot data-related issues.

Revefi seeks to automate companies’ data operations

If you build an AI search product, you compete with Google. But Google has a lot easier time answering queries with a single, simple answer, such as “how many is…

You.com ‘refocuses’ from AI search to deeper productivity agents with new $50M round

Featured Article

reMarkable’s Paper Pro adds color, light and more but keeps the focus on ‘focus’

The $499 Paper Pro — a new naming convention to indicate it is a higher-end alternative to the now-$379 reMarkable 2, not a direct successor — is momentous for its addition of both color and a “frontlight,” though both features are what you might call muted.

reMarkable’s Paper Pro adds color, light and more but keeps the focus on ‘focus’

Good news for Microsoft: The U.K.’s antitrust regulator says that the tech titan’s high-profile acquihire of the team behind AI startup Inflection doesn’t cause competition concerns, and thus it won’t…

UK regulator greenlights Microsoft’s Inflection acquihire, but also designates it a merger

In the summer of 2023, Lyft was contemplating the sale of its micromobility business after receiving strong interest from prospective buyers. Today, the ride-hail company is doubling down on its…

Why Lyft’s CEO says ‘it would be insane’ not to go all in on bikeshare

Spotify is launching daylist globally. It’s a personalized playlist that evolves throughout the day depending on your listening habits. This rollout comes after the company introduced it first to English-speaking…

Spotify launches its evolving playlist, daylist, globally

Digital lending platforms have become an easy and swift alternative source of credit for microenterprises and individuals overlooked by traditional banking institutions. These platforms have turned into a lifeline for…

Impact investors FMO and BlueOrchard back Ghana’s digital lender Fido in $30M Series B round

Indian online pharmacy startup PharmEasy, once valued at a lofty $5.6 billion, is still about 92% below its peak valuation, according to new estimates by its investor Janus Henderson. The…

PharmEasy still 92% below its peak $5.6 billion valuation, investor estimates

Palm launched in 2023 with the goal of making cash management for enterprise treasury teams easier.

From their experiences at Uber and PayPal, Palm founders want to make moving cash easier for big companies

Canva, the design platform, is increasing prices steeply for some customers. And it’s blaming the move in part on generative AI. In the U.S., some Canva Teams subscribers on older…

Canva has increased prices for its Teams product

Featured Article

Apple Event 2024: iPhone 16, Apple Intelligence and all the other expected ‘Glowtime’ reveals

Apple’s Glowtime iPhone event will include the iPhone 16, but may also feature new AirPods, a new Apple Watch and possibly even new Macs.

Apple Event 2024: iPhone 16, Apple Intelligence and all the other expected ‘Glowtime’ reveals

Snap is testing a “simplified version of Snapchat,” CEO Evan Spiegel wrote in a letter to employees published on Snap’s website Tuesday. The CEO says the simplified version aims to…

Snap CEO says the company is testing a ‘simplified’  Snapchat

Prevention is better than cure, as the saying goes. Today, a splashy startup that has taken that concept to heart — literally and figuratively — is expanding. Neko Health was…

Neko Health, the body-scanning AI health startup from Spotify’s Daniel Ek, opens in London

The Federal Trade Commission (FTC) published a report about increasing fraud at Bitcoin ATMs. These ATMs allow people to turn their cash into crypto, but they’ve become a tool for…

Bitcoin ATMs are a hotbed for scams, FTC says

Volkswagen is taking its ChatGPT voice assistant experiment on the road. Or more specifically, to vehicles it sells in the United States.  The German automaker announced in January at CES…

Volkswagen is rolling out its ChatGPT assistant to the US

From idea to IPO, Disrupt charts startups at every stage on the roadmap to their next breakthrough. TechCrunch will gather some of the startup world’s leading companies — but our…

Learn startup best practices with MongoDB, Venture Backed, InterSystems and others at Disrupt 2024

Android introduced five updates on Tuesday as part of its latest release of the mobile operating system. Available for smartphones, tablets and Wear OS watches, the new features include audio…

Android’s latest update improves text-to-speech, Circle to Search, earthquake alerts and more

Google announced on Tuesday it’s releasing Android 15 and making its source code available ahead of the coming consumer launch, which will bring the new mobile operating system to supported…

Android 15 will be available on supported Pixel devices in the coming weeks

As new users downloaded the app, Bluesky jumped to becoming the app to No. 1 in Brazil over the weekend, ahead of Meta’s X competitor, Instagram Threads.

Bluesky continues to soar, adding 2M more new users in a matter of days

Welcome to TechCrunch Fintech! This week, we’re looking at a new real estate startup that’s making big waves with its offering, Klarna and Affirm’s financials, a neobank focused on immigrants…

The flat-rate real estate startup that’s got big players worried and BNPL’s turning a corner

Instagram’s latest feature aims to boost user interaction within Stories. The social media platform now allows followers to comment on each other’s Stories, making the experience more community-focused, akin to…

As more Instagram users engage with Stories, the app adds a comments feature

Curious about how top venture capitalists are positioning themselves for the next wave in the crypto market?  Dragonfly Capital’s Haseeb Qureshi, Galaxy Ventures’ Will Nuelle, and NFX’s Morgan Beller will…

Dragonfly Capital, Galaxy Ventures and NFX share insights on crypto scaling and strategy at TechCrunch Disrupt 2024

Get ready for TechCrunch Disrupt 2024, our signature event for startups of all stages, happening at Moscone West in San Francisco from October 28-30. This year, we’re expecting a massive…

Announcing the final agenda for the Builders Stage at TechCrunch Disrupt 2024

Spotter, the startup that provides financial solutions to content creators, announced Tuesday the launch of its new AI-powered creative suite. Dubbed Spotter Studio, the solution aims to support YouTubers throughout the…

Spotter launches AI tools to help YouTubers brainstorm video ideas, thumbnails and more

This second fund is significant because Gupta expanded it beyond a corporate fund with one main LP — Prudential Financial — into one supported by a number of financial and…

Former Citi, Battery VC has new $378M fund to back financial services and enterprise startups

The oil and fracking giant says it is “working to identify effects” of the ongoing cyberattack on its oil and fracking operations.

Halliburton confirms data was stolen in ongoing cyberattack

Is Elon’s rumble in the Amazonian jungle on course for a technical knockout? Over the weekend, the Brazilian high court voted to uphold a ban on X that another judge issued…

Elon Musk’s Brazil battle wages on

Flexible green methanol, which is made without fossil fuels, could rid carbon pollution from a range of industries.

Oxylus Energy strikes ‘beautiful balance’ to make e-fuels for aviation and shipping