AI

AI2 open sources text-generating AI models — and the data used to train them

Comment

Futuristic digital blockchain background. Abstract connections technology and digital network. 3d illustration of the Big data and communications technology.
Image Credits: v_alex / Getty Images

The Allen Institute for AI (AI2), the nonprofit AI research institute founded by late Microsoft co-founder Paul Allen, is releasing several GenAI language models it claims are more “open” than others — and, importantly, licensed in such a way that developers can use them unfettered for training, experimentation and even commercialization

Called OLMo, an acronym for “Open Language Models,” the models and the dataset used to train them, Dolma — one of the largest public datasets of its kind — were designed to study the high-level science behind text-generating AI, according to AI2 senior software engineer Dirk Groeneveld.

“‘Open’ is an overloaded term when it comes to [text-generating models],” Groeneveld told TechCrunch in an email interview. “We expect researchers and practitioners will seize the OLMo framework as an opportunity to analyze a model trained on one of the largest public data sets released to date, along with all the components necessary for building the models.”

Open source text-generating models are becoming a dime a dozen, with organizations from Meta to Mistral releasing highly capable models for any developer to use and fine-tune. But Groeneveld makes the case that many of these models can’t really be considered open because they were trained “behind closed doors” and on proprietary, opaque sets of data.

By contrast, the OLMo models, which were created with the help of partners including Harvard, AMD and Databricks, ship with the code that was used to produce their training data as well as training and evaluation metrics and logs.

In terms of performance, the most capable OLMo model, OLMo 7B, is a “compelling and strong” alternative to Meta’s Llama 2, Groeneveld asserts — depending on the application. On certain benchmarks, particularly those touching on reading comprehension, OLMo 7B edges out Llama 2. But in others, particularly question-answering tests, OLMo 7B is slightly behind.

The OLMo models have other limitations, like low-quality outputs in languages that aren’t English (Dolma contains mostly English-language content) and weak code-generating capabilities. But Groeneveld stressed that it’s early days.

“OLMo is not designed to be multilingual — yet,” he said. “[And while] at this stage, the primary focus of the OLMo framework [wasn’t] code generation, to give a head start to future code-based fine-turning projects, OLMo’s data mix currently contains about 15% code.”

I asked Groeneveld whether he was concerned that the OLMo models, which can be used commercially and are performant enough to run on consumer GPUs like the Nvidia 3090, might be leveraged in unintended, possibly malicious ways by bad actors. A recent study by Democracy Reporting International’s Disinfo Radar project, which aims to identify and address disinformation trends and technologies, found that two popular open text-generating models, Hugging Face’s Zephyr and Databricks’ Dolly, reliably generate toxic content — responding to malevolent prompts with “imaginative” harmful content.

Groeneveld believes that the benefits outweigh the harms in the end.

“[B]uilding this open platform will actually facilitate more research on how these models can be dangerous and what we can do to fix them,” he said. “Yes, it’s possible open models may be used inappropriately or for unintended purposes. [However, this] approach also promotes technical advancements that lead to more ethical models; is a prerequisite for verification and reproducibility, as these can only be achieved with access to the full stack; and reduces a growing concentration of power, creating more equitable access.”

In the coming months, AI2 plans to release larger and more capable OLMo models, including multimodal models (i.e. models that understand modalities beyond text), and additional datasets for training and fine-tuning. As with the initial OLMo and Dolma release, all resources will be made available for free on GitHub and the AI project hosting platform Hugging Face.

More TechCrunch

The CrowdStrike outage that hit early Friday morning and knocked out computers running Microsoft Windows has grounded flights globally. Major U.S. airlines including United Airlines, American Airlines and Delta Air…

CrowdStrike outage: How your plane, train and automobile travel may be affected

Prior to the ban, Trump’s team used his channel to broadcast some of his campaigns. With the ban now lifted, his channel can resume doing so.

Twitch reinstates Trump’s account ahead of the 2024 presidential election

Featured Article

Faulty CrowdStrike update causes major global IT outage, taking out banks, airlines and businesses globally

Security giant CrowdStrike said the outage was not caused by a cyberattack, as businesses anticipate widespread disruption.

Faulty CrowdStrike update causes major global IT outage, taking out banks, airlines and businesses globally

This week, Google is in discussions to pay $23 billion for cloud security startup Wiz, SoftBank acquires Graphcore, and more.

M&A activity heats up with Wiz, Graphcore, etc.

CrowdStrike competes with a number of vendors, including SentinelOne and Palo Alto Networks but also Microsoft, Trellix, Trend Micro and Sophos, in the endpoint security market.

CrowdStrike’s rivals stand to benefit from its update fail debacle

The IT outage may have an unexpected effect on the climate: clearer skies and maybe lower temperatures this evening

CrowdStrike chaos leads to grounded aircraft — and maybe an unusual weather effect

There’s a man in Florida right now who wants to propose to his girlfriend while they’re on a beach vacation. He couldn’t get the engagement ring before he flew down…

The CrowdStrike outage is a plot point in a rom-com 

Here’s everything you need to know so far about the global outages caused by CrowdStrike’s buggy software update.

What we know about CrowdStrike’s update fail that’s causing global outages and travel chaos

This serves as an example for how easy it is to spread inaccurate information online during a time of immense global confusion and panic.

From the Sphere to false cyberattack claims, misinformation runs rampant amid CrowdStrike outage

Today is the final chance to save up to $800 on TechCrunch Disrupt 2024 tickets. Disrupt Deal Days event will end tonight at 11:59 p.m. PT. Don’t miss out on…

Last chance today: Secure major savings for TechCrunch Disrupt 2024!

Indian fintech Paytm’s struggles won’t seem to end. The company on Friday reported that its revenue declined by 36% and its loss more than doubled in the first quarter as…

Paytm loss widens and revenue shrinks as it grapples with regulatory clampdown

J. Michael Cline, the co-founder of Fandango and multiple other startups over his multi-decade career, died after falling from a Manhattan hotel, New York’s Deputy Commissioner of Public Information tells…

Fandango founder dies in fall from Manhattan skyscraper

Venture capital giant a16z fixed a security vulnerability in one of the firm’s websites after being warned by a security researcher.

Researcher finds flaw in a16z website that exposed some company data

Apple on Thursday announced its upcoming lineup of immersive video content for the Vision Pro. The list includes behind-the-scenes footage of the 2024 NBA All-Star Weekend, an immersive performance by…

Apple Vision Pro debuts immersive content featuring NBA players, The Weeknd and more

Biden centering Musk in his campaign is a notable escalation, considering he spent most of his presidency seemingly pretending the billionaire didn’t exist.

Elon Musk is now a villain in Joe Biden’s presidential campaign

Waymo would need a ground transportation permit to operate at SFO, which has yet to be approved.

Waymo wants to bring robotaxis to SFO, emails show

When Tade Oyerinde first set out to fundraise for his startup, Campus, a fully accredited online community college, it was incredibly difficult. VCs have backed for-profit education companies in the…

Why it made sense for an online community college to raise venture capital

Canadian private equity firm PartnerOne paid $28.2 million for HeadSpin, a mobile app testing startup whose founder was sentenced for fraud earlier this year, according to documents viewed by TechCrunch.…

PE firm PartnerOne paid $28M for HeadSpin, a fraction of its $1.1B valuation set by ICONIQ and Dell Technologies Capital

Meta has suspended the use of its AI assistant after Brazil’s National Data Protection Authority (ANPD) banned the company from training its AI models on personal data from Brazilians. The…

Meta puts a halt to training its generative AI tools in Brazil 

ChatGPT, OpenAI’s text-generating AI chatbot, has taken the world by storm since its launch in November 2022. What started as a tool to hyper-charge productivity through writing essays and code…

ChatGPT: Everything you need to know about the AI-powered chatbot

The Mumbai-based firm said one of its multisig wallets had suffered a security breach, and it was temporarily pausing all withdrawals from the platform.

WazirX halts withdrawals after losing $230 million, nearly half its reserves

This week’s TechCrunch Mobility looks at Fisker scoring a win, an AV startup rebooting in Texas, why Elon is pushing the Tesla robotaxi reveal and more.

Fisker scores a win, an AV startup reboots in Texas, and why Elon pushed the Tesla robotaxi reveal

Apple Intelligence was designed to leverage things that generative AI already does well, like text and image generation, to improve upon existing features.

What is Apple Intelligence, when is it coming and who will get it?

The European Union’s president, Ursula von der Leyen, was confirmed in the role for another five years Thursday after parliamentarians voted overwhelmingly to re-elect her. The scale of her support…

The EU just re-elected its president for another five years — here’s what that means for tech

Olivia DeRamus is flipping the script: What if scrolling through social media didn’t make us miserable? What if, especially for women, social media could actually make us feel more supported?…

Communia bets social media can be good for you

TikTok is partnering with the music distribution service DistroKid to fast-track the creation of artist accounts for members. The ByteDance-owned short video platform introduced an Artist Account feature last year…

TikTok fast-tracks artist account creation for DistroKid members

Ford is still pushing forward on electrification, notably by increasing hybrid options.

Ford’s EV plans are in flux once again as it invests $3B into its biggest trucks

OpenAI introduced GPT-4o mini on Thursday, its latest small AI model. The company says GPT-4o mini, which is cheaper and faster than OpenAI’s current cutting-edge AI models, is being released…

OpenAI unveils GPT-4o mini, a smaller and cheaper AI model

Featured Article

USPS shared customer postal addresses with Meta, LinkedIn and Snap

The U.S. Postal Service confirmed it took action to “remediate” the data sharing following a TechCrunch investigation.

USPS shared customer postal addresses with Meta, LinkedIn and Snap

The automotive industry is in the midst of dramatic technological change as companies seek out new ways to make money beyond building and selling gas-powered cars. And GM CEO and…

GM CEO Mary Barra is coming to TechCrunch Disrupt 2024