The End of Transformers with Layer Norms?

The Weekly Salt #60

Benjamin Marie

Mar 19, 2025

This week, we read two excellent papers that propose significant modifications to the standard LLM neural architecture:

⭐Transformers without Normalization
⭐ Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

⭐: Papers that I particularly recommend reading.

New code repositories (list of all repositories):

BD3-LMs: Block diffusion for LLMs

Transformers without Normalization

Normalization layers have become a fundamental part of modern neural networks, originally popularized by Batch Norm in 2015 for improving convergence in vision models. Over the years, numerous variants have been introduced, with Layer Norm emerging as the go-to choice for Transformers.

These layers are now standard in nearly all architectures, primarily because they stabilize and accelerate training. As networks grow deeper and wider, normalization is widely seen as essential, with most new architectures rethinking other components like attention and convolutions but keeping normalization untouched.

This paper challenges that assumption by proposing Dynamic Tanh (DyT), a simple alternative to Layer Norm (LN) in Transformers. LN works by scaling inputs and squashing extreme values in a way that resembles a tanh-like curve. DyT mimics this behavior with a straightforward element-wise transformation: DyT(x) = tanh(αx), where α is a learnable parameter that adjusts the input scaling. Unlike traditional normalization layers, DyT does not rely on activation statistics, making it a computationally lightweight replacement.

Replacing Layer Norm with DyT is straightforward, and experiments show that models with DyT train stably and reach comparable or even better performance, often without requiring hyperparameter tuning. The results suggest that normalization layers may not be as indispensable as previously thought. Additionally, preliminary benchmarks indicate that DyT improves both training and inference speed, making it a strong candidate for efficiency-focused neural network design.

They compared with large models using the RMSNorm:

DyT seems to scale well while being significantly more efficient, for training and inference:

I don't see any issues with their experiments, and the implementation seems straightforward. I would expect it to be widely adopted in the next generation of LLMs, but since people tend to stick with familiar, proven methods, we'll have to wait and see.

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Diffusion models have become a dominant approach for generating images, videos, and even discrete data like text and biological sequences. Compared to autoregressive models, they offer advantages in controllability and generation speed. However, diffusion models typically generate fixed-length sequences, lack efficient KV caching due to their bidirectional nature, and still lag behind autoregressive methods in terms of perplexity.

This paper introduces Block Discrete Denoising Diffusion Language Models (BD3-LMs), a hybrid approach that combines the strengths of both diffusion and autoregressive models. Instead of generating tokens individually, BD3-LMs use a block diffusion process, where each block of tokens is conditioned on previous blocks, making the method semi-autoregressive. This design enables variable-length sequence generation, improves efficiency by allowing KV caching, and enhances sample quality. However, training BD3-LMs implies high gradient variance, which degrades performance. To address this, the authors propose custom noise schedules that reduce gradient variance and improve training stability.

The authors released their code here:

GitHub: kuleshov-group/bd3lms

This approach seems to work and might bring back the idea of using diffusion in LLMs.

However, I’m unsure about the validity of their evaluation of BD3-LMs. Let’s take this table of results:

They compared the perplexity of models trained by different people, under different settings, and with varying architectures. However, the paper lacks sufficient details to ensure these numbers are truly comparable. As we discussed in a previous article, comparing perplexities across different models is almost always meaningless":

Why Can't We Compare the Perplexity of Two Different Models?

Benjamin Marie

July 24, 2024

Read full story

The Salt - Curated AI

Why Can't We Compare the Perplexity of Two Different Models?

Discussion about this post