AI

Tokens are a big reason today’s generative AI falls short

Comment

LLM word with icons as vector illustration. AI concept of Large Language Models
Image Credits: Getty Images

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today’s transformer models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

This “fuzziness” creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today’s models are bad at math.

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That’s also the reason models aren’t great at solving anagram problems or reversing words.

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

Barring a tokenization breakthrough, it seems new model architectures will be the key.

More TechCrunch

There’s a man in Florida right now who wants to propose to his girlfriend while they’re on a beach vacation. He couldn’t get the engagement ring before he flew down…

The CrowdStrike outage is a plot point in a rom-com 

Here’s everything you need to know so far about the global outages caused by CrowdStrike’s buggy software update.

What we know about CrowdStrike’s update fail that’s causing global outages and travel chaos

The CrowdStrike outage that hit early Friday morning and knocked out computers running Microsoft Windows has grounded flights globally. Major U.S. airlines including United Airlines, American Airlines and Delta Air…

CrowdStrike outage: How your plane, train and automobile travel may be affected

Featured Article

Faulty CrowdStrike update causes major global IT outage, taking out banks, airlines and businesses globally

Security giant CrowdStrike said the outage was not caused by a cyberattack, as businesses anticipate widespread disruption.

Faulty CrowdStrike update causes major global IT outage, taking out banks, airlines and businesses globally

This serves as an example for how easy it is to spread inaccurate information online during a time of immense global confusion and panic.

From the Sphere to false cyberattack claims, misinformation runs rampant amid CrowdStrike outage

Today is the final chance to save up to $800 on TechCrunch Disrupt 2024 tickets. Disrupt Deal Days event will end tonight at 11:59 p.m. PT. Don’t miss out on…

Last chance today: Secure major savings for TechCrunch Disrupt 2024!

Indian fintech Paytm’s struggles won’t seem to end. The company on Friday reported that its revenue declined by 36% and its loss more than doubled in the first quarter as…

Paytm loss widens and revenue shrinks as it grapples with regulatory clampdown

J. Michael Cline, the co-founder of Fandango and multiple other startups over his multi-decade career, died after falling from a Manhattan hotel, New York’s Deputy Commissioner of Public Information tells…

Fandango founder dies in fall from Manhattan skyscraper

Venture capital giant a16z fixed a security vulnerability in one of the firm’s websites after being warned by a security researcher.

Researcher finds flaw in a16z website that exposed some company data

Apple on Thursday announced its upcoming lineup of immersive video content for the Vision Pro. The list includes behind-the-scenes footage of the 2024 NBA All-Star Weekend, an immersive performance by…

Apple Vision Pro debuts immersive content featuring NBA players, The Weeknd and more

Biden centering Musk in his campaign is a notable escalation, considering he spent most of his presidency seemingly pretending the billionaire didn’t exist.

Elon Musk is now a villain in Joe Biden’s presidential campaign

Waymo would need a ground transportation permit to operate at SFO, which has yet to be approved.

Waymo wants to bring robotaxis to SFO, emails show

When Tade Oyerinde first set out to fundraise for his startup, Campus, a fully accredited online community college, it was incredibly difficult. VCs have backed for-profit education companies in the…

Why it made sense for an online community college to raise venture capital

Canadian private equity firm PartnerOne paid $28.2 million for HeadSpin, a mobile app testing startup whose founder was sentenced for fraud earlier this year, according to documents viewed by TechCrunch.…

PE firm PartnerOne paid $28M for HeadSpin, a fraction of its $1.1B valuation set by ICONIQ and Dell Technologies Capital

Meta has suspended the use of its AI assistant after Brazil’s National Data Protection Authority (ANPD) banned the company from training its AI models on personal data from Brazilians. The…

Meta puts a halt to training its generative AI tools in Brazil 

ChatGPT, OpenAI’s text-generating AI chatbot, has taken the world by storm since its launch in November 2022. What started as a tool to hyper-charge productivity through writing essays and code…

ChatGPT: Everything you need to know about the AI-powered chatbot

The Mumbai-based firm said one of its multisig wallets had suffered a security breach, and it was temporarily pausing all withdrawals from the platform.

WazirX halts withdrawals after losing $230 million, nearly half its reserves

This week’s TechCrunch Mobility looks at Fisker scoring a win, an AV startup rebooting in Texas, why Elon is pushing the Tesla robotaxi reveal and more.

Fisker scores a win, an AV startup reboots in Texas, and why Elon pushed the Tesla robotaxi reveal

Apple Intelligence was designed to leverage things that generative AI already does well, like text and image generation, to improve upon existing features.

What is Apple Intelligence, when is it coming and who will get it?

The European Union’s president, Ursula von der Leyen, was confirmed in the role for another five years Thursday after parliamentarians voted overwhelmingly to re-elect her. The scale of her support…

The EU just re-elected its president for another five years — here’s what that means for tech

Olivia DeRamus is flipping the script: What if scrolling through social media didn’t make us miserable? What if, especially for women, social media could actually make us feel more supported?…

Communia bets social media can be good for you

TikTok is partnering with the music distribution service DistroKid to fast-track the creation of artist accounts for members. The ByteDance-owned short video platform introduced an Artist Account feature last year…

TikTok fast-tracks artist account creation for DistroKid members

Ford is still pushing forward on electrification, notably by increasing hybrid options.

Ford’s EV plans are in flux once again as it invests $3B into its biggest trucks

OpenAI introduced GPT-4o mini on Thursday, its latest small AI model. The company says GPT-4o mini, which is cheaper and faster than OpenAI’s current cutting-edge AI models, is being released…

OpenAI unveils GPT-4o mini, a smaller and cheaper AI model

Featured Article

USPS shared customer postal addresses with Meta, LinkedIn and Snap

The U.S. Postal Service confirmed it took action to “remediate” the data sharing following a TechCrunch investigation.

USPS shared customer postal addresses with Meta, LinkedIn and Snap

The automotive industry is in the midst of dramatic technological change as companies seek out new ways to make money beyond building and selling gas-powered cars. And GM CEO and…

GM CEO Mary Barra is coming to TechCrunch Disrupt 2024

Amazon’s Prime Day event clocked record sales this year, as U.S. consumers spent $14.2 billion across July 16 and 17, according to Adobe Analytics. This marks an 11% jump from…

Amazon Prime Day 2024 sales hit record $14.2 billion

The now-scrapped moon mission would have been the U.S. space agency’s first resource-mapping mission off planet Earth.

NASA cancels $450M Viper moon mission, dashing ice prospecting dreams

Marathon Fusion emerges from stealth with a plan to help fusion power plants extract tritium, a necessary fuel.

Without this company’s technology, future fusion power plants might never light up

A security researcher found that some traffic lights controllers are exposed on the internet and could be manipulated.

Hackers could create traffic jams thanks to flaw in traffic light controller, researcher says