In The Weekly Salt, I review and analyze interesting AI papers published last week in plain English.
Reviewed this week
⭐Repeat After Me: Transformers are Better than State Space Models at Copying Transformers are Better than State Space Models at Copying
LiPO: Listwise Preference Optimization through Learning-to-Rank
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
⭐: Papers that I particularly recommend reading.
New code repositories:
None were added this week.
I maintain a curated list of AI code repositories here:
This research uncovers that Generative State Space Models (GSSMs) lose the ability to accurately retrieve and replicate specific segments of the input context, a task at which transformers excel. This finding reminds me of sequence-to-sequence RNNs which also poorly perform at copying the input.
Through theoretical examination, this work demonstrates that transformers can replicate sequences of lengths exponentially related to their number of heads. In contrast, GSSMs cannot inherently copy sequences longer than their latent state size, revealing a fundamental limitation in their architecture.
Moreover, they evaluated pre-trained models’ ability to remember and access input context, comparing Pythia transformers and Mamba GSSMs of comparable sizes. Their results show that Pythia models surpass Mamba GSSMs in memory-intensive tasks such as copying and context retrieval, despite Mamba models achieving lower perplexity in language modeling tasks.
This is a clear weakness of GSSMs but I would assume that it will be fixed very soon by some architectural modifications. For instance, when RNN was the main architecture, this limitation could be fixed by adding a pointer generator network to the architecture and it worked very well in NLP tasks requiring to copy the input, such as summarization.
LiPO: Listwise Preference Optimization through Learning-to-Rank
An increasing number of papers are exploring simpler alternatives to Reinforcement Learning from Human Feedback (RLHF), often focusing on a pairwise ranking optimization approach.
For instance, Direct Preference Optimization (DPO) uses pairwise human preference data to optimize a logistic loss without needing an explicit reward model.
Despite these advancements, most efforts have not moved beyond pairwise preferences, even though human preferences are frequently ranked in list form to reduce effort, as seen with Instruct-GPT and OpenAssistant.
This work introduces a listwise ranking approach to LLM alignment, drawing from the Learning-to-Rank field. Listwise optimization is generally more effective than pairwise methods for ranking tasks.
Their analysis presents the first examination of ranking objectives within a Listwise Preference Optimization (LiPO) framework for LLM preference optimization. It introduces a new method, LiPO-λ, which outperforms existing methods by applying state-of-the-art ranking objectives with sophisticated weighting for listwise data.
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
This study revisits the creation of instruction-tuning datasets, inspired by recent findings that Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) tend to produce longer outputs.
The researchers experimented with a simple yet effective strategy: selecting the 1,000 longest responses from a larger dataset to create a compact, high-quality instruction-tuning dataset.
This approach yielded surprising results. Fine-tuning Llama 2 7B on these longest responses from Alpaca outperformed other models in direct comparisons and on the AlpacaEval 2.0 benchmark.
Further enhancements were made by refining the quality and style of the Alpaca-1k-longest responses using GPT-3.5-Turbo.
Additionally, the team explored the models' performance on factual knowledge and reasoning tasks, finding their models performed comparably or better than those fine-tuned on larger datasets.
In conclusion, when building an instruction dataset, you should select the longest examples of the highest quality.
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
In this research, the authors investigate the issue of indirect data contamination within proprietary LLMs, specifically focusing on OpenAI's GPT-3.5 and GPT-4.
Through a systematic review of 255 studies, they identify data leakage, where data could potentially be used for further training of these models, thus potentially biasing their evaluation. They highlight various evaluation malpractices found in the scientific literature, such as inadequate comparisons to other models, inconsistencies in evaluation scales, and the absence of transparency regarding code and data access.
The research presents several key findings, including that approximately 42% of the papers reviewed contributed to data leakage, affecting around 4.7 million benchmark samples across 263 benchmarks.
It also critiques the evaluation methods used in these papers, pointing out issues that affect the reproducibility and fairness of research findings. In my opinion, this study stands as a significant effort to quantify and address data leakage in LLMs, offering resources and guidelines to enhance future research integrity.
This study only puts numbers behind what we already knew: LLMs are poorly evaluated. Given that most companies use their own LLM evaluation as marketing material rather than for scientific purposes, I am pessimistic that it will change in the future.
If you have any questions about one of these papers, write them in the comments. I will answer them.
Note: All the figures above are extracted from their corresponding paper.