Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data
Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.
Strategic stability exists when neither side thinks it can improve its strategic outcome by striking first.

Strategic stability exists when neither side thinks it can improve its strategic outcome by striking first.
What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.

What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.
AI+Science: Accelerating Discovery is an interdisciplinary conference bringing together researchers across physics, mathematics, chemistry, biology, neuroscience, and more to examine how AI is reshaping scientific discovery. Experts will separate hype from reality, spotlighting where AI is already enabling genuine breakthroughs and where its limits and risks remain.

AI+Science: Accelerating Discovery is an interdisciplinary conference bringing together researchers across physics, mathematics, chemistry, biology, neuroscience, and more to examine how AI is reshaping scientific discovery. Experts will separate hype from reality, spotlighting where AI is already enabling genuine breakthroughs and where its limits and risks remain.
The Common Crawl Foundation is dedicated to preserving humanity's knowledge and making it accessible through its free public web dataset, a vital resource since 2008. As AI development accelerates, concerns have emerged regarding the accessibility and transparency of public web data, impacting open datasets in three key ways: robots.txt exclusions, legal demands, and "bot defenses." Two of these are not visible in public and are not very well understood. In this seminar, Common Crawl will present insights from a new data product that utilizes Common Crawl's crawl metadata to visually explore these three problems, advocating for greater transparency and informed solutions for the future of public web data.