Ada-instruct: Generate Complex Instruction Datasets for Supervised Fine-tuning

Complex, yes, but are they good enough?

Apr 10, 2024

∙ Paid

Chat models are pre-trained large language models (LLM) fine-tuned on instruction datasets. These datasets are made of prompts/questions, or instructions, paired with a correct answer. While we can now easily find instructions datasets publicly available for fine-tuning, they are often limited to particular domains and lack complexity.

Recent research has explored using LLMs like ChatGPT for generating extensive training datasets from minimal initial samples, employing a technique known as “self-instruct” (Wang et al., 2023). This method involves prompting ChatGPT to create instructions and their corresponding answers sequentially. However, a notable limitation of the self-instruct approach is its difficulty in generating complex instructions. Despite attempts to prompt LLMs for more challenging content, results typically include oversimplified instructions.

Ada-Instruct has been proposed to overcome this limitation, in this paper:

Ada-Instruct: Adapting Instruction Generators for Complex Reasoning

This few-shot instruction generation framework only requires a few examples of complex instructions to fine-tune open LLMs for generating complex instruction datasets.

In this article, I review Ada-instruct and explain how it works. The method is very simple. We will see how to use an Ada-instruct LLM to generate our own dataset of complex instructions. While Ada-instruct LLMs generate complex instructions, they are often incorrect or repetitive and thus require heavy filtering.

I made a notebook experimenting with NousResearch/Genstruct-7B, a model trained with the Ada-instruct methodology. You can get it here:

Get the notebook (#3)

Keep reading with a 7-day free trial

Subscribe to The Salt - Curated AI to keep reading this post and get 7 days of free access to the full post archives.