Sitemap - 2024 - The Salt - Curated AI

Mixture-of-Experts: Mixture-of-Head Attention and Embedding Model

Related Articles

Cancelling Attention Noise with Differential Transformer

Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training

Cross Capabilities of LLMs and Contextual Document Embeddings

LLMs Can Follow Instructions Without Instruction Tuning

Qwen2-VL: How Does It Work?

SCoRe: Teach LLMs to Self-Correct

New Advances in Linear-time Sequence Modeling

Efficient Long Context Generalization with LongRecipe

Q-GaLore: Pre-Train 7B Parameter LLMs from Scratch on a 16 GB GPU

Enhanced SSM Training Through Initialization with a Pre-trained Transformer

Add Code to Your Training Data for Better LLMs

Towards Back-propagation of Self-attention in Almost Linear Time

Better Mixture of Experts with a Layerwise Recurrent Router

Generating More Useful Synthetic Data To Train LLMs as Evaluators

How Generative LLMs Achieve Top MMLU Scores without Generating Anything

Pre-training LLMs of Multiple Sizes, Simultaneously

Compress LLMs with Pruning and Knowledge Distillation using Minitron

Why Can't We Compare the Perplexity of Two Different Models?

More Evidence that Ternary LLMs Are Good Enough

Multimodal Self-instruct and Leakage of Code Benchmarks

CriticGPT: How OpenAI Is Improving GPT-4 with GPT-4

RAG and Long-context LLMs, When Do They Perform Better?

Pre-train LLMs on Millions of Synthetic Instructions

LiveBench: Finally a Contamination-Free LLM Benchmark?

Instruction Pre-training for Better Instruct LLMs

SimPO: A Reference-free Preference Optimization

MaltMul-Free LLMs and Neural Algorithmic Reasoners

FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs

Mixture-of-Agents: Combining the Feedback of Several LLMs

A Good Week for the State Space Neural Architecture

MoRA: A High-Rank Alternative to LoRA

Smaller KV Cache with Cross-Layer Attention

DeepSeek-V2: A Huge LLM with Efficient Inference

Online RLHF Is Still the Best Method for LLM Alignment

Decode-decoder and How to Detect Under-trained Tokens

Prometheus 2 and Simple Methods to Extend the Context Length of LLMs

Compress the KV Cache with SnapKV

Jamba: The New Hybrid Transformer/Mamba

Speculative Decoding for Multimodal Models

Transform LLMs into Text Embeddings with LLM2Vec

Ada-instruct: Generate Complex Instruction Datasets for Supervised Fine-tuning

ReFT: Fine-tuning Representations Rather than Weights

Stepwise DPO and Ineffective Deeper Layers

Efficient and Robust Prompt Compression for LLMs

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

BurstAttention for Very Long Sequences and Faster Speculative Decoding with ReDrafter

Better Mamba and Anoter RoPE Improvement

LongRoPE: Towards Unlimited Context Length for the Transformer

Griffin and Hawk: Local Attention for Efficient Language Models

LongRoPE and Language Models as Universal Regressors

Length Generalization for Transformers

State Space Models Are Bad at Copying

RWKV: As Good as the Transformer But Faster?

AI Notebooks

Sparse PEFT and Better LLM Meta-Evaluation

Towards Token-Free LLMs?

An In-Depth Evaluation of Gemini and Mixtral-8x7B

Curated List of AI Code Repositories

Asynchronous Local-SGD and Self-Rewarding LLMs

Insights from the Falcons