Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
Thijs Dalhuijsen
Thijs Dalhuijsen is a Senior Software Engineer at Common Crawl. He works on backend systems, automation, and data infrastructure to power large-scale web access and analysis.