Reviewed this week:
⭐The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Efficient LLM Scheduling by Learning to Rank
Efficient Detection of Toxic Prompts in Large Language Models
⭐: Papers that I particularly recommend reading.
New code repositories:
No new code repository made it to the list this week.
I maintain a curated list of AI code repositories here:
⭐The Mamba in the Llama: Distilling and Accelerating Hybrid Models
While Transformers are powerful and have driven the success of LLMs like GPT, Llama, and Mistral, recent linear RNN models such as Mamba, Mamba 2, RWKV, and Griffin have shown better performance in controlled experiments at small to medium scales, particularly in terms of faster inference (up to 5× higher throughput). However, the best Transformers still outperform these models on downstream tasks.
The training times for linear RNNs and highly optimized Transformers are similar, meaning that scaling either type of model requires significant computational resources.
In this paper, the authors first demonstrate that by reusing weights from attention layers, they can successfully distill a large Transformer into a hybrid-linear RNN with minimal additional computation. This approach allows them to preserve much of the original model's performance, and they propose a modified Mamba architecture that can be directly initialized from the attention block of a pre-trained model.
To further improve the distillation process, the authors introduce a multistage distillation approach resulting in better perplexity and improved downstream performance compared to traditional distillation methods.
Additionally, the authors develop a hardware-aware speculative sampling algorithm and a fast kernel for speculative decoding on Mamba and hybrid architectures. This development achieves a throughput of over 300 tokens per second for a Mamba 7B model. They also demonstrate that speculative decoding can be effectively applied to their hybrid architecture, further enhancing efficiency.
This paper is a must-read!
They didn’t release any code or models.
Efficient LLM Scheduling by Learning to Rank
As demand surges, efficient scheduling of LLM tasks is crucial to ensure high-quality service, minimizing latency for users while maximizing overall system throughput.
Traditional first-come-first-serve (FCFS) scheduling often leads to significant delays, particularly under high load, due to Head-Of-Line (HOL) blocking. Although shortest-job-first (SJF) and shortest-remaining-time-first (SRTF) scheduling algorithms are known to reduce average latency, they are rarely implemented because they require knowledge of request lengths, which are typically assumed to be difficult to predict.
The paper challenges this assumption, arguing that precise knowledge of request lengths isn't necessary; instead, just knowing the relative order of request lengths can be sufficient for effective scheduling.
To measure how closely a predicted schedule aligns with the ideal SJF/SRTF schedule, the authors propose using Kendall's Tau, a rank correlation coefficient. They demonstrate that higher similarity to the ideal schedule, as indicated by Kendall's Tau, generally results in lower latency and improved performance.
To optimize LLM scheduling, the authors introduce a learning-to-rank approach. They show that a small auxiliary model can be trained to rank LLM requests by their expected generation lengths, allowing for more efficient on-the-fly scheduling. This method approximates the SRTF/SJF schedule more robustly and with less complexity than directly predicting request lengths. The proposed approach is easy to integrate into existing systems and significantly improves performance, reducing latency in chatbot services by 2.8× and increasing throughput in batch data generation by 6.5×.
They will release their code here:
GitHub: hao-ai-lab/vllm-ltr
Efficient Detection of Toxic Prompts in Large Language Models
LLMs are vulnerable to misuse, where malicious users craft toxic prompts to generate harmful content. These users may employ techniques like "jailbreaking" to bypass safety measures and elicit offensive responses from LLMs. Addressing this challenge is critical for maintaining the safety and integrity of LLM-based applications.
To detect toxic prompts, existing techniques are divided into blackbox and whitebox approaches. Blackbox methods, such as Google's Perspective API and OpenAI's Moderation API, focus on detecting toxic content in prompts but struggle with the diversity and disguise of toxic inputs. Whitebox methods, like PlatonicDetector and PerplexityFilter, exploit internal model states for better detection but are computationally intensive, limiting their scalability for real-time applications. This creates a need for a more efficient, scalable solution to detect toxic prompts effectively.
The authors of this paper propose “ToxicDetector”, a lightweight, grey-box method designed to efficiently detect toxic prompts by analyzing the maximum inner product embedding values from the last token of each LLM layer. It uses readily available embeddings from the LLM, detects toxic inputs even when disguised, and maintains efficiency through a streamlined process involving simple inner product calculations and a final classification step using a multilayer perceptron (MLP).
ToxicDetector operates by automatically generating toxic concept prompts, which serve as benchmarks for identifying toxicity. It extracts embedding vectors from input prompts and compares them with these benchmarks to form a feature vector, which is then classified as toxic or benign. This method proves to be computationally efficient and scalable, suitable for real-time applications.
If you have any questions about one of these papers, write them in the comments. I will answer them.