The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training

Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training

But with hyperparameter values not easy to find!

Benjamin Marie's avatar
Benjamin Marie
Oct 09, 2024
∙ Paid
6

Share this post

The Salt - Curated AI
The Salt - Curated AI
Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training
Share
Generated with Grok

AdamW has become the optimizer of choice for fine-tuning large language models (LLMs). However, numerous alternatives have been introduced to either reduce memory consumption or improve the learning curves. This week, in The Kaitchup, I analyzed various AdamW variants, focusing on their memory usage, training times, and learning curve characteristics.

Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Benjamin Marie
·
October 7, 2024
Read full story

Recently, EPFL and Apple proposed AdEMAMix, a new contender to replace AdamW. The authors claim AdEMAMix requires only half the training tokens to match AdamW’s performance, making it significantly more efficient.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Can AdEMAMix Realistically Replace AdamW?

For an optimizer to serve as a viable alternative to AdamW, I believe it should meet the following criteria:

  1. Achieve lower training loss on the same dataset (AdEMAMix appears to meet this).

  2. Be implemented in a popular framework such as Transformers, ensuring compatibility with a wide range of LLMs (AdEMAMix has been integrated since September 2024).

  3. Display robustness to hyperparameter adjustments to maintain flexibility and efficiency across different models.

AdEMAMix can achieve lower losses than AdamW, and its integration into Transformers includes support for quantization and paging, making it accessible for smaller GPUs as well. One major advantage of AdamW, however, is its resilience with default hyperparameters—search for better hyperparameter values might be beneficial but often unnecessary, as the default beta1 and beta2 values perform nearly optimally in many cases. By contrast, alternatives like AdEMAMix might require more careful hyperparameter tuning, which could offset their cost-effectiveness in certain scenarios.

In this article, we’ll review AdEMAMix’s strengths, explore its weaknesses, and test it on recent LLMs, such as Llama 3.2. We’ll draw learning curves and compare them directly with AdamW.

For an implementation of LLM (Llama 3.2) fine-tuning with AdEMAMix, check out the notebook here:

Get the notebook (#12)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share