Un disastrous overtraining: the reasons why AIs make mistakes by using an abundance of data

Researchers from prestigious American universities are shedding light on a concerning phenomenon regarding the training of artificial intelligences (AIs): overfitting, which often results from an excessive amount of data. This study reveals that more data do not necessarily mean better performance. On the contrary, excessive training can lead to disappointing results, even a deterioration of a model’s capabilities. In this article, we will explore the reasons why an abundance of data can lead to errors and malfunctions in the functioning of AIs.

Table des matières

The concept of overfitting

Overfitting occurs when the AI model is trained on a volume of data that is too large, exceeding a certain optimal threshold. While one might think that a larger quantity of data allows for better training of the model, research shows that it can have the opposite effect. Indeed, overfitting can make the model too sensitive to fluctuations in the data, leading to amplified errors.

Increased sensitivity to data

Scientists have observed that excessive training leads to a gradual sensitivity. As the number of tokens, which are portions of data, increases, the model becomes increasingly vulnerable. This fragility raises the risk of errors, particularly when adjustments are applied or external elements are integrated. Tests have revealed that models trained on fewer data often perform better than those that have been overfitted.

The performance of AI models and the inflection point

The inflection point is a term referring to the moment when training no longer yields benefits but, on the contrary, begins to harm the quality of the model. This situation is often reached beyond 2.5 trillion tokens for certain smaller models, like the OLMo-1B. When a model reaches this critical stage, the gains achieved through training are negated by internal instability that manifests itself as a deterioration in performance.

A revealing study

A study conducted by scientists from Carnegie Mellon, Stanford, Harvard, and Princeton highlighted the astonishing finding that a model trained with less data could outperform its overfitted counterpart in specific tests such as AlpacaEval and ARC. This result illustrates the idea that the quality of the data and its relevance to learning objectives are essential, rather than simply the quantity of data used.

The consequences of overfitting

The effects of overfitting are diverse and devastating to the performance of artificial intelligences. Researchers warn that the misinterpretation of data and the inadequacy of pre-training tasks can lead to catastrophic overfitting. Models become unable to generalize learning to new data, thereby limiting their usefulness in real-world situations.

Reflections on training AIs

Scientists do not advocate for a total abandonment of pre-training, but emphasize the importance of rethinking the training strategy. They recommend special attention to model sizing and the definition of learning objectives. Rather than a rush for quantity, they suggest optimizing the quality of training by carefully examining the entire learning pipeline, which would help avoid the pitfalls of overfitting.