Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training

But with hyperparameter values not easy to find!

Oct 09, 2024

∙ Paid

AdamW has become the optimizer of choice for fine-tuning large language models (LLMs). However, numerous alternatives have been introduced to either reduce memory consumption or improve the learning curves. This week, in The Kaitchup, I analyzed various AdamW variants, focusing on their memory usage, training times, and learning curve characteristics.

Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Benjamin Marie

October 7, 2024

Read full story

Recently, EPFL and Apple proposed AdEMAMix, a new contender to replace AdamW. The authors claim AdEMAMix requires only half the training tokens to match AdamW’s performance, making it significantly more efficient.

Can AdEMAMix Realistically Replace AdamW?

For an optimizer to serve as a viable alternative to AdamW, I believe it should meet the following criteria:

Achieve lower training loss on the same dataset (AdEMAMix appears to meet this).
Be implemented in a popular framework such as Transformers, ensuring compatibility with a wide range of LLMs (AdEMAMix has been integrated since September 2024).
Display robustness to hyperparameter adjustments to maintain flexibility and efficiency across different models.

AdEMAMix can achieve lower losses than AdamW, and its integration into Transformers includes support for quantization and paging, making it accessible for smaller GPUs as well. One major advantage of AdamW, however, is its resilience with default hyperparameters—search for better hyperparameter values might be beneficial but often unnecessary, as the default beta1 and beta2 values perform nearly optimally in many cases. By contrast, alternatives like AdEMAMix might require more careful hyperparameter tuning, which could offset their cost-effectiveness in certain scenarios.

In this article, we’ll review AdEMAMix’s strengths, explore its weaknesses, and test it on recent LLMs, such as Llama 3.2. We’ll draw learning curves and compare them directly with AdamW.

For an implementation of LLM (Llama 3.2) fine-tuning with AdEMAMix, check out the notebook here:

Get the notebook (#12)

The Salt - Curated AI

Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training

But with hyperparameter values not easy to find!

Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Can AdEMAMix Realistically Replace AdamW?

This post is for paid subscribers