Including code in the pre-training data of large language models (LLMs) has become standard practice, even before these models were explicitly used for code generation tasks. Today, code in various programming languages (such as Python, Java, and HTML) constitutes a significant portion of the pre-training data for state-of-the-art LLMs like Llama 3, Gemma 2, and Qwen2, improving their ability to perform code generation tasks effectively.
However, the impact of including code in the training data on non-code tasks—such as natural language generation, reasoning, and world knowledge—has not been thoroughly investigated.
Does the inclusion of code in LLM training data improve or degrade their performance on non-code tasks?
Cohere addressed this question in their recent paper:
To Code, or Not To Code? Exploring the Impact of Code in Pre-training
In this article, I review Cohere's paper. We will explore how Cohere found that incorporating code into the training data is crucial for enhancing LLMs' performance in language generation tasks and improving their world knowledge. We will also see how this work identified an optimal proportion of code in the training data: falling below this threshold results in suboptimal LLM performance while exceeding it makes the LLM better at programming tasks but less effective in non-code tasks.