Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Your browser does not support the video tag.
eventSeminar

Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data

Status
Past
Date
Wednesday, October 22, 2025 12:00 PM - 1:15 PM PST/PDT
Location
353 Jane Stanford Way, Stanford, CA, 94305 | Room 119
Topics
Privacy, Safety, Security
Attend Virtually
Overview
Watch Event Recording

Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.

The Common Crawl Foundation is dedicated to preserving humanity's knowledge and making it accessible through its free public web dataset, a vital resource since 2008. As AI development accelerates, concerns have emerged regarding the accessibility and transparency of public web data, impacting open datasets in three key ways: robots.txt exclusions, legal demands, and "bot defenses." Two of these are not visible in public and are not very well understood. In this seminar, Common Crawl will present insights from a new data product that utilizes Common Crawl's crawl metadata to visually explore these three problems, advocating for greater transparency and informed solutions for the future of public web data.

Overview
Watch Event Recording
Share
Link copied to clipboard!
Event Contact
Stanford HAI
stanford-hai@stanford.edu
More from HAI and SDS seminars
  • Inside the 2026 AI Index Report | Stanford HAI
    SeminarMay 20, 202612:00 PM - 1:15 PM
    May
    20
    2026

    The AI Index, currently in its ninth year, tracks, collates, distills, and visualizes data relating to artificial intelligence.

Related Events

Eyck Freymann | AI and Strategic Stability: A Framework for U.S.–China Technology Competition
SeminarMay 27, 202612:00 PM - 1:15 PM
May
27
2026

Strategic stability exists when neither side thinks it can improve its strategic outcome by striking first.

Seminar

Eyck Freymann | AI and Strategic Stability: A Framework for U.S.–China Technology Competition

May 27, 202612:00 PM - 1:15 PM

Strategic stability exists when neither side thinks it can improve its strategic outcome by striking first.

Ashesh Rambachan | From Next-Token Prediction to Automatic Induction of Automata
Apr 13, 202612:00 PM - 1:00 PM
April
13
2026

Sequence data is ubiquitous in economics — job histories in labor economics, diagnosis and treatment sequences in health economics, strategic interactions in game theory. Generative sequence models can learn to predict these sequences well, but their complexity makes it hard to extract interpretable economic insights from their predictions.

Event

Ashesh Rambachan | From Next-Token Prediction to Automatic Induction of Automata

Apr 13, 202612:00 PM - 1:00 PM

Sequence data is ubiquitous in economics — job histories in labor economics, diagnosis and treatment sequences in health economics, strategic interactions in game theory. Generative sequence models can learn to predict these sequences well, but their complexity makes it hard to extract interpretable economic insights from their predictions.

Caroline Meinhardt, Thomas Mullaney, Juan N. Pava, and Diyi Yang | How Can AI Support Language Digitization and Digital Inclusion?
SeminarApr 15, 202612:00 PM - 1:15 PM
April
15
2026

What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.

Seminar

Caroline Meinhardt, Thomas Mullaney, Juan N. Pava, and Diyi Yang | How Can AI Support Language Digitization and Digital Inclusion?

Apr 15, 202612:00 PM - 1:15 PM

What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.