The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Add Code to Your Training Data for Better LLMs
Copy link
Facebook
Email
Notes
More

Add Code to Your Training Data for Better LLMs

But not too much!

Benjamin Marie's avatar
Benjamin Marie
Aug 28, 2024
∙ Paid

Share this post

The Salt - Curated AI
The Salt - Curated AI
Add Code to Your Training Data for Better LLMs
Copy link
Facebook
Email
Notes
More
Share
Generated with DALL-E

Including code in the pre-training data of large language models (LLMs) has become standard practice, even before these models were explicitly used for code generation tasks. Today, code in various programming languages (such as Python, Java, and HTML) constitutes a significant portion of the pre-training data for state-of-the-art LLMs like Llama 3, Gemma 2, and Qwen2, improving their ability to perform code generation tasks effectively.

However, the impact of including code in the training data on non-code tasks—such as natural language generation, reasoning, and world knowledge—has not been thoroughly investigated.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Does the inclusion of code in LLM training data improve or degrade their performance on non-code tasks?

Cohere addressed this question in their recent paper:

To Code, or Not To Code? Exploring the Impact of Code in Pre-training

In this article, I review Cohere's paper. We will explore how Cohere found that incorporating code into the training data is crucial for enhancing LLMs' performance in language generation tasks and improving their world knowledge. We will also see how this work identified an optimal proportion of code in the training data: falling below this threshold results in suboptimal LLM performance while exceeding it makes the LLM better at programming tasks but less effective in non-code tasks.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More