Sitemap - 2024 - The Salt - Curated AI
Mixture-of-Experts: Mixture-of-Head Attention and Embedding Model
Cancelling Attention Noise with Differential Transformer
Evaluating AdEMAMix: A New Optimizer for Faster, More Efficient LLM Training
Cross Capabilities of LLMs and Contextual Document Embeddings
LLMs Can Follow Instructions Without Instruction Tuning
SCoRe: Teach LLMs to Self-Correct
New Advances in Linear-time Sequence Modeling
Efficient Long Context Generalization with LongRecipe
Q-GaLore: Pre-Train 7B Parameter LLMs from Scratch on a 16 GB GPU
Enhanced SSM Training Through Initialization with a Pre-trained Transformer
Add Code to Your Training Data for Better LLMs
Towards Back-propagation of Self-attention in Almost Linear Time
Better Mixture of Experts with a Layerwise Recurrent Router
Generating More Useful Synthetic Data To Train LLMs as Evaluators
How Generative LLMs Achieve Top MMLU Scores without Generating Anything
Pre-training LLMs of Multiple Sizes, Simultaneously
Compress LLMs with Pruning and Knowledge Distillation using Minitron
Why Can't We Compare the Perplexity of Two Different Models?
More Evidence that Ternary LLMs Are Good Enough
Multimodal Self-instruct and Leakage of Code Benchmarks
CriticGPT: How OpenAI Is Improving GPT-4 with GPT-4
RAG and Long-context LLMs, When Do They Perform Better?
Pre-train LLMs on Millions of Synthetic Instructions
LiveBench: Finally a Contamination-Free LLM Benchmark?
Instruction Pre-training for Better Instruct LLMs
SimPO: A Reference-free Preference Optimization
MaltMul-Free LLMs and Neural Algorithmic Reasoners
FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs
Mixture-of-Agents: Combining the Feedback of Several LLMs
A Good Week for the State Space Neural Architecture
MoRA: A High-Rank Alternative to LoRA
Smaller KV Cache with Cross-Layer Attention
DeepSeek-V2: A Huge LLM with Efficient Inference
Online RLHF Is Still the Best Method for LLM Alignment
Decode-decoder and How to Detect Under-trained Tokens
Prometheus 2 and Simple Methods to Extend the Context Length of LLMs
Compress the KV Cache with SnapKV
Jamba: The New Hybrid Transformer/Mamba
Speculative Decoding for Multimodal Models
Transform LLMs into Text Embeddings with LLM2Vec
Ada-instruct: Generate Complex Instruction Datasets for Supervised Fine-tuning
ReFT: Fine-tuning Representations Rather than Weights
Stepwise DPO and Ineffective Deeper Layers
Efficient and Robust Prompt Compression for LLMs
Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?
BurstAttention for Very Long Sequences and Faster Speculative Decoding with ReDrafter
Better Mamba and Anoter RoPE Improvement
LongRoPE: Towards Unlimited Context Length for the Transformer
Griffin and Hawk: Local Attention for Efficient Language Models
LongRoPE and Language Models as Universal Regressors
Length Generalization for Transformers
State Space Models Are Bad at Copying
RWKV: As Good as the Transformer But Faster?
Sparse PEFT and Better LLM Meta-Evaluation
An In-Depth Evaluation of Gemini and Mixtral-8x7B
Curated List of AI Code Repositories