Pre-train LLMs on Millions of Synthetic Instructions

Microsoft's method to synthetize instruction datasets

Jul 03, 2024

∙ Paid

Microsoft introduced instruction pre-training to explore supervised multitask learning for pre-training. Instead of directly pre-training LLMs on raw corpora, instruction pre-training augments the raw text with instruction-response pairs generated by an instruction synthesizer.

According to the evaluation conducted by Microsoft, LLMs pre-trained with instructions are significantly better for a wide range of tasks compared to LLMs that underwent standard pre-training.

In this article, I review Microsoft’s work on instruction pre-training. I explain how millions of instructions can be generated with open LLMs using Microsoft’s proposed approach. Then, I show how to use the code released by Microsoft to synthesize instructions. The resulting synthetic instructions can be used for pre-training and continued pre-training.

I also made a notebook showing how to use the instruction synthesizer to generate instructions for the Finance domain that you can then use for training Finance chat models:

Get the notebook (#8)

The Salt - Curated AI

Pre-train LLMs on Millions of Synthetic Instructions

Microsoft's method to synthetize instruction datasets

This post is for paid subscribers