Feb 29, 2024 · Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why.
Dec 13, 2023 · We provide experimental evidence that gradient descent struggles to fit classification problems with heavy-tailed imbalanced classes.
We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficul- ties in optimization dynamics. When training with gradient ...
Feb 29, 2024 · Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks.
Feb 29, 2024 · Adam has been shown to outperform gradient de- scent in optimizing large language transformers empirically, and by a larger margin than on ...
People also ask
Why does Adam outperform gradient descent on language models a heavy tailed class imbalance problem?
Is Adam a gradient descent algorithm?
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key ...
Adam outperforms gradient descent on language models: A heavy-tailed class imbalance problem. Robin Yadav, Frederik Kunstner, Mark Schmidt, Alberto Bietti.
This repository contains the code for the paper Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models.
Jul 15, 2024 · Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We ...