The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Pre-train LLMs on Millions of Synthetic Instructions

Pre-train LLMs on Millions of Synthetic Instructions

Microsoft's method to synthetize instruction datasets

Benjamin Marie's avatar
Benjamin Marie
Jul 03, 2024
∙ Paid
6

Share this post

The Salt - Curated AI
The Salt - Curated AI
Pre-train LLMs on Millions of Synthetic Instructions
Share
Generated with DALL-E

Microsoft introduced instruction pre-training to explore supervised multitask learning for pre-training. Instead of directly pre-training LLMs on raw corpora, instruction pre-training augments the raw text with instruction-response pairs generated by an instruction synthesizer.

According to the evaluation conducted by Microsoft, LLMs pre-trained with instructions are significantly better for a wide range of tasks compared to LLMs that underwent standard pre-training.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review Microsoft’s work on instruction pre-training. I explain how millions of instructions can be generated with open LLMs using Microsoft’s proposed approach. Then, I show how to use the code released by Microsoft to synthesize instructions. The resulting synthetic instructions can be used for pre-training and continued pre-training.

I also made a notebook showing how to use the instruction synthesizer to generate instructions for the Finance domain that you can then use for training Finance chat models:

Get the notebook (#8)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share