Multimodal Self-instruct and Leakage of Code Benchmarks

The Weekly Salt #26

Benjamin Marie

Jul 16, 2024

Reviewed this week

⭐On Leakage of Code Generation Evaluation Datasets
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Self-Recognition in Language Models
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

⭐: Papers that I particularly recommend reading.

New code repositories:

Lookback Lens: Detection of contextual hallucinations

I maintain a curated list of AI code repositories here:

Get the list

On Leakage of Code Generation Evaluation Datasets

As we saw in a previous article, data contamination is a significant problem in LLM evaluation.

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Benjamin Marie

March 20, 2024

Read full story

In code generation, it is an even bigger problem as benchmarks for code generation are scarce, as demonstrated by this new work by Cohere.

Achieving competitive results on these benchmarks offers considerable scientific and economic benefits. However, this widespread use has led to data leakage and contamination beyond the intended evaluation scope, undermining their reliability.

If models are trained on or selected based on this evaluation data, it compromises the integrity of the evaluation.

Cohere presents evidence that contamination, defined here as any instance where evaluation datasets have leaked into model training, has affected most LLMs. The primary form of contamination occurs through the inclusion of these datasets in training data, which they demonstrate is likely.

Another form of contamination arises indirectly through the use of synthetic data, which is particularly common in improving code capabilities by generating additional training data. Additionally, what is common to all LLMs but good to always point out is that the final model selection is usually overly influenced by performance on these benchmarks, leading to overfitting in some way.

They also propose the Less Basic Python Problems (LBPP) benchmark, designed to be a more challenging, yet portable, code generation test set. They claim that they made it in a way that minimizes the risk of leakage.

Hugging Face dataset: CohereForAI/lbpp

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

Most prior studies addressing hallucinations in LLMs focus on scenarios where hallucinations stem from the models' inherent knowledge.

These approaches typically use LLM representations, such as hidden states, MLP outputs, and attention outputs, to detect and mitigate hallucinations. In contrast, contextual information is crucial for identifying contextual hallucinations, as attention maps provide a meaningful measure of the model's focus on the context during generation.

To leverage this, this work hypothesizes that contextual hallucinations are linked to how much an LLM attends to the provided context. They introduce the notion of "lookback ratio," calculated as the ratio of attention weights on the given context versus newly generated tokens. This ratio is computed for each attention head at every time step, and then they train a linear classifier, “Lookback Lens”, to detect contextual hallucinations based on these features.

According to the authors, it performs comparably to, and sometimes better than, more complex hallucination detectors that use hidden states or entailment models trained on annotated datasets.

They released their tool to detect contextual hallucinations:

GitHub: voidism/Lookback-Lens

Self-Recognition in Language Models

Can LLMs recognize their own outputs among outputs generated by other LLMs?

This work explores whether similar questions can measure LM self-recognition. The process involves: instructing LMs to generate self-recognition questions, collecting answers from a panel of LMs, and prompting them to identify their own answers.

Testing ten state-of-the-art LMs, the authors found no consistent evidence of general self-recognition. LMs often preferred answers from "stronger" models over their own and exhibited consistent preferences for certain models. They also discovered that position bias significantly affects LM decision-making, with implications for LM benchmarks using multiple-choice formats.

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

LLMs have been shown to be particularly useful for generating synthetic data for training other LLMs. In The Salt, we already reviewed two different approaches for synthetic generation: Ada-instruct, which I found doesn’t perform very well, and Microsoft’s approach for instruction pre-training:

Ada-instruct: Generate Complex Instruction Datasets for Supervised Fine-tuning

Benjamin Marie

April 10, 2024

Read full story

Pre-train LLMs on Millions of Synthetic Instructions

Benjamin Marie

July 3, 2024

Read full story

Phi LLMs are very good examples of very good small LLMs that were trained on synthetic data.

Fine-tune Phi-3 Medium on Your Computer

Benjamin Marie

June 3, 2024

Read full story

These approaches only generate synthetic text, for multimodal models, such as visual-language models (VLMs) we would need also to generate images.

The authors of this paper focus on tasks exploiting abstract images requiring reasoning. They propose a multi-modal self-instruct strategy designed to produce abstract images and aligned reasoning instructions for various everyday scenarios, including roadmaps, dashboards, 2D planar layouts, charts, relation graphs, flowcharts, and visual puzzles.

The process begins with the strategy autonomously proposing a creative idea for visual scenarios, such as using a step-by-step flowchart to demonstrate how to attend an academy conference or designing a roadmap. Then, it generates detailed instructions to visualize this idea. After synthesizing the desired image, LLMs self-instruct to generate high-quality Q&A pairs for this visual content. The entire process is fully completed by the LLM with minimal demonstration examples.

If you have any questions about one of these papers, write them in the comments. I will answer them.

The Salt - Curated AI

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Ada-instruct: Generate Complex Instruction Datasets for Supervised Fine-tuning

Pre-train LLMs on Millions of Synthetic Instructions

Fine-tune Phi-3 Medium on Your Computer

Discussion about this post