Current language model training leaves large parts of the internet on the table

2026-03-02

Summary

A study by researchers from Apple, Stanford, and the University of Washington reveals that language models, which learn from internet-scraped text, are significantly impacted by the HTML extractors used to collect this data. Different extractors like Resiliparse, Trafilatura, and JusText pull varying content from web pages, affecting the quantity and quality of data used in training. By combining multiple extractors, token yield can increase by up to 71% without compromising benchmark performance, suggesting that existing data processing methods leave much valuable content untapped.

Why This Matters

The research highlights a crucial aspect of language model training that is often overlooked: the choice of HTML extractor. This detail can dramatically alter the amount and type of internet data used to train models, impacting their effectiveness and efficiency. Understanding this can lead to more comprehensive and representative training datasets, which is vital as the availability of high-quality internet data dwindles.

How You Can Use This Info

Professionals involved in AI development or data management should consider using a combination of HTML extractors to maximize data collection from the internet. This approach can help create richer datasets, potentially leading to more robust and capable AI models. Additionally, being aware of the limitations and biases introduced by data extraction tools can inform better decision-making in AI projects and data strategy.

Read the full article