×
Feb 29, 2024 · Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why.
We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficul- ties in optimization dynamics. When training with gradient ...
Feb 29, 2024 · Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks.
Feb 29, 2024 · Adam has been shown to outperform gradient de- scent in optimizing large language transformers empirically, and by a larger margin than on ...
People also ask
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key ...
Adam outperforms gradient descent on language models: A heavy-tailed class imbalance problem. Robin Yadav, Frederik Kunstner, Mark Schmidt, Alberto Bietti.
This repository contains the code for the paper Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models.
Jul 15, 2024 · Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We ...
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models. from x.com
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key ...