Pre-train LLMs on Millions of Synthetic Instructions
Microsoft's method to synthetize instruction datasets
Microsoft introduced instruction pre-training to explore supervised multitask learning for pre-training. Instead of directly pre-training LLMs on raw corpora, instruction pre-training augments the raw text with instruction-response pairs generated by an instruction synthesizer.
According to the evaluation conducted by Microsoft, LLMs pre-trained with instructions are significantly better for a wide range of tasks compared to LLMs that underwent standard pre-training.
In this article, I review Microsoft’s work on instruction pre-training. I explain how millions of instructions can be generated with open LLMs using Microsoft’s proposed approach. Then, I show how to use the code released by Microsoft to synthesize instructions. The resulting synthetic instructions can be used for pre-training and continued pre-training.
I also made a notebook showing how to use the instruction synthesizer to generate instructions for the Finance domain that you can then use for training Finance chat models: