Speculative Decoding with Token-Specific Difficulty

The Weekly Salt #46

Benjamin Marie

Dec 03, 2024

Reviewed this week:

⭐Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
LongKey: Keyphrase Extraction for Long Documents

⭐: Papers that I particularly recommend reading.

New code repositories (list of all repositories):

LongKey: Keyphrase Extraction for Long Documents

I’m also writing a review of the TÜLU 3 paper. We will see how AI2 built TÜLU 3, one of the most advanced LLMs available today. I’ll share my analysis in two parts: the first, covering the datasets, will be published this week; the second, focusing on the post-training recipe, will follow next week. Stay tuned!

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

Speculative decoding is a technique that improves the efficiency of large language models (LLMs) by using a smaller draft model to generate sequences and a larger expert model to verify them.

Fast Speculative Decoding with Llama 3.2 and vLLM

Benjamin Marie

October 14, 2024

Read full story

This method avoids the need for token-by-token autoregressive generation. It accelerates inference without compromising quality. While previous approaches often use fixed draft lengths, they fail to account for token-specific difficulty, which varies depending on the complexity of the content being generated.

To address this limitation, this work introduces SVIP (Self-VerIfication length Policy), a dynamic draft length policy for speculative decoding. SVIP adapts the draft sequence length based on the entropy of the draft model. It generates longer drafts for simpler tokens and stops early for more challenging ones. This approach is guided by the observation that the draft model's entropy correlates with token acceptance rates. This is intuitive.

Their experiments demonstrate that SVIP improves speculative decoding across various LLMs and benchmarks. It achieves over 20% speedup on SpecBench and up to 60% improvement in long-form text generation for Pythia 6.9B.

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Multimodal Large Language Models (MLLMs as they call them in the paper, but I prefer VLMs, for visual-language models, since only visual and language modalities are involved) combine visual and textual inputs to process mixed-modality instructions, leveraging the advanced capabilities of pre-trained LLMs. However, the quadratic computational complexity of the attention associated with sequence length limits their practical deployment, as visual contexts often involve a significant number of visual tokens, which are more abundant but also more redundant than textual tokens. I studied their impact on memory consumption with Pixtral (article published in The Kaitchup):

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Benjamin Marie

September 19, 2024

Read full story

This has prompted efforts to improve inference efficiency by reducing visual token quantities without losing critical information.

This study introduces a unified "filter-correlate-compress" paradigm to address inefficiencies in existing token reduction methods. This paradigm organizes the process into three interpretable and modular stages, making it possible to decompose, understand, and extend various token reduction approaches. By providing a consistent framework, it enables the transfer of design choices for developing new methods, improving both flexibility and performance.

Building on this framework, the study presents FiCoCo, a trio of complementary token reduction variants for different phases of MLLM inference. FiCoCo leverages intermediate outputs during inference to optimize token reduction, achieving significant computational and memory savings. Experiments across 10 multimodal benchmarks show that FiCoCo variants outperform most training-free methods and even surpass some training-based methods.

I wonder how this work performs against the “pixel shuffle” strategy used by Hugging Face to make SmolVLM. It is also very effective to reduce the number of image tokens.

LongKey: Keyphrase Extraction for Long Documents

Extracting important information from text documents is essential for effective information retrieval, especially with the massive amount of data available online and within organizations. Keyphrase Extraction (KPE) helps by identifying keyphrases in a document’s content to make it easier to retrieve and manage information. Keywords and keyphrases, which can represent the main ideas or specific details of a document, are often treated as the same, regardless of their length.

Different methods are used for KPE. Unsupervised techniques like TF-IDF, RAKE, and TextRank rely on term frequency, word co-occurrence, or graph-based analysis to identify important terms. Supervised approaches like KeyBERT and PatternRank use pre-trained language models, such as BERT, combined with techniques like cosine similarity or part-of-speech tagging to improve results. More advanced methods, like JointKPE and HyperMatch, use fine-tuned models and innovative strategies, such as hyperbolic distance or graph-based enhancements, to achieve more accurate results, particularly for longer and more complex documents.

While KPE has been widely studied, most research focuses on short texts like abstracts or news articles. Longer documents present more challenges like complex structures, varied content, and the limitations of existing language models in handling efficiently large amounts of text. To address these issues, this paper introduces LongKey, a framework specifically designed for extracting keyphrases from long documents. LongKey improves on previous methods by using models like Longformer, capable of processing up to 96,000 tokens, and a new embedding strategy that captures and integrates context across an entire document.

While this work doesn’t use recent models (longformer, BERT), it is very interesting as few papers have tackled this important problem recently, especially with LLMs.

The code is available here:

GitHub: jeohalves/longkey

The Salt - Curated AI

Fast Speculative Decoding with Llama 3.2 and vLLM

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Discussion about this post