Skip to main content Skip to secondary navigation
Page Content

Strengthening AI Accountability Through Better Third Party Evaluations

At a recent Stanford-MIT-Princeton workshop, experts highlight the need for legal protections, standardized evaluation practices, and better terminology to support third-party AI evaluations.

Image
computer and magnifying glass

Millions of people worldwide use general purpose AI systems: They use ChatGPT to write documents, Claude to analyze data, and Stable Diffusion to generate images. While these AI systems offer significant benefits, they also pose serious risks, like producing non-consensual intimate imagery, facilitating production of bioweapons, and contributing to biased decisions. Third party AI evaluations are crucial for assessing these risks because they are independent from company interests and incorporate diverse perspectives and expertise that better reflect the wide range of real-world applications.

While software security has developed reporting infrastructure, legal protections and incentives (e.g., bug bounties) to incentivize third party evaluation, this is not yet the case for general purpose AI systems. This is why researchers at Stanford HAI's Center for Research on Foundation Models, Massachusetts Institute of Technology (MIT), Princeton’s Center for Information Technology Policy, and Humane Intelligence convened leaders from academia, industry, civil society and government for an October 28, 2024 virtual workshop to articulate a vision for third party AI evaluations.

Key takeaways from the workshop, reflecting areas of agreement among many speakers, included the need for more legal and technical protections for third party evaluations – reinforcing scholars' call for more legal and technical protections (also known as "safe harbors") for third party AI evaluators – along with the need for more standardization and coordination of evaluation processes and shared terminology.

The workshop spanned three sessions exploring evaluations in practiceevaluations by design, and evaluation law and policy, beginning with a keynote from Rumman Chowdhury, CEO at Humane Intelligence.

You can watch the full workshop here.

The Need for Independent Oversight 

In her keynote, Chowdhury compared the status quo to a new Gilded Age in that it is characterized by major economic disruption and a lack of protections for users and citizens. She stressed the need for independent oversight: The standard practice where "companies write their own tests and they grade themselves" can result in biased evaluations and limit standardization, information sharing, and generalizability beyond specific settings.

In contrast, third party evaluators can have more depth, breadth, and independence in their assessments. Chowdhury contended that while the software security space and penetration testing may offer some lessons like legal protections for third party evaluators, AI evaluations are more complex than software testing: AI systems are probabilistic and it is difficult to precisely identify negative impacts from these systems and the mechanisms by which they occur. This, in turn, makes mitigation challenging. She called for more legal protections for third party evaluators, building a robust talent pipeline and engaging multiple stakeholders, including lawyers, AI specialists, and auditors.

Current AI Evaluation Practices

The first panel featured presentations by Nicholas Carlini, Research Scientist at Google DeepMind; Lama Ahmad, Technical Program Manager and Red Teaming Lead at OpenAI; Avijit Ghosh, Applied Policy Researcher at Hugging Face; and Victoria Westerhoff, Director of the AI Red Team at Microsoft.

Carlini shared insights about his experience evaluating AI models, such as attacks that lead a foundation model to divulge personal information from the training dataset or that involve stealing parts of a production language model. Carlini started out researching software vulnerabilities as a penetration tester, but later shifted to machine learning. He shared that while penetration testing is a standardized procedure with rules around who to disclose vulnerabilities to and when, this is not true for AI vulnerabilities where his process is as ad hoc as "write research paper, upload to arXiv". For example, it is not clear who to disclose to when a vulnerability resides in an AI model API, and thus affects not just the model developer, but also any deployer using that API. He expressed a wish for better established norms.

Ahmad described how third party evaluations are conducted at OpenAI, distinguishing three forms of evaluations. First, the company solicits external red teaming as part of OpenAI’s Red Teaming Network that aims to discover novel risks or stress tests systems and results in new evaluations. Second, it conducts third party assessments that specialize in particular issue areas to provide subject matter expertise, such as partnerships with AI safety institutes. Third, the company conducts independent research that promotes an ecosystem of better evaluations and methods for alignment. Examples in this area include OpenAI’s efforts to fund research on democratic inputs to AI and its collaboration with Stanford’s Center for Research on Foundation Models on its Holistic Evaluation of Language Models. Ahmad argued that all three forms are needed in the face of a rapidly evolving technological landscape. She also highlighted the challenge of building trust given the growth of the third party evaluation ecosystem and the lack of clear frameworks laying out which third party evaluators are trustworthy and in what areas.

Ghosh presented the coordinated flaw disclosures (CFD) framework for AI vulnerabilities as an analog to the coordinated vulnerability disclosure (CVD) framework for software vulnerabilities. This framework aims to address the unique challenges of AI systems, which include the complexity of ethical and safety questions, scalability issues, and the lack of a unified enumeration of products and weaknesses. The CFD framework has multiple components, including extended model cards, automated verification, an adjudication process and a dynamic scope that adapts to emerging common uses of AI.

Westerhoff described lessons from the work of the internal AI red team she leads at Microsoft. She highlighted the importance of diverse teams to enable diverse testing, especially in light of the multi-modality of models. To achieve this diversity, her team collaborates with teams across Microsoft and external experts, such as experts on specific types of harms. She has found that AI systems are susceptible to similar attacks that humans are susceptible to, and that understanding a product well helps to understand its potential harms. Westerhoff described her team’s development of the open source red teaming framework Python Risk Identification Tool for generative AI (PyRIT), which is part of her general aim to contribute to the field through greater transparency, training, tooling, and reporting mechanisms for third party evaluators. She expressed the hope that going forward, industry researchers would share more insights with each other, including on red teaming techniques.

The Design of Evaluations Can Draw From Audits and Software Security

The second panel provided insights on the design of third party evaluations, featuring presentations by Deb Raji, Fellow at the Mozilla Foundation and PhD Candidate at Berkeley; Casey Ellis, Chairman and Founder of Bugcrowd; Jonathan Spring, Deputy Chief AI Officer at CISA; and Lauren McIlvenny, Technical Director and CERT Threat Analysis Director at the AI Security Incident Response Team (AISIRT) at CMU.

Raji highlighted a number of challenges that make AI audits challenging based on her experiences as an auditor and researcher. She described that AI auditors face a variety of different barriers, including a lack of protection against retaliation, little standardization of the audit and certification process, and little accountability of the audited party. In her role at Mozilla, she shaped the development of open-source audit tooling, and argued that more capacity-building and transparency is needed when it comes to audit tools. In terms of institutional design of audit systems, she pointed to case studies from other domains such as the FDA that may be instructive for the AI space. She called for an expansion of the audit scope to include entire sociotechnical systems, a wider variety of methodologies and tooling, and maintaining clear audit targets and objectives, while also recognizing the limits of audits and calling for a prohibition on certain uses of AI systems in some cases.

As the founder of Bugcrowd, a company that offers bug bounty and vulnerability disclosure programs, Ellis contended that discovering AI vulnerabilities is very similar to discovering software vulnerabilities. He mentioned the need to build a consensus around common language use to shape law and policy effectively. For example, software security researchers tend to use the term "security," whereas AI researchers tend to use the term "safety." Further, in his view, vulnerabilities will be an issue in any system — they are unavoidable, so the best we can do is to prepare for them. He argued that AI can be used as a tool to increase attacker productivity, as a target to exploit weaknesses in a system, and as a threat via unintended security consequences because of the use and integration of AI.

Drawing on his work as Deputy Chief AI Officer at the U.S. Cybersecurity and Infrastructure Security Agency (CISA), Spring laid out seven concrete goals for security by design: (1) increase use of multi-factor authentication, (2) reduce default passwords, (3) reduce entire classes of vulnerability, (4) increase the installation of security patches by customers, (5) publish a vulnerability disclosure policy, (6) demonstrate transparency in vulnerability reporting, and (7) increase the ability for customers to gather evidence of cybersecurity instructions affecting their products. He made the case that how the community addresses AI vulnerabilities should align with, and be integrated with, how cybersecurity vulnerabilities are managed, including as part of CISA’s related work.

In line with Spring, McIlvenny stressed the similarities between software security and AI safety. She described the longstanding practices of CVD for software vulnerabilities, noting that AI vulnerabilities are more complex because vulnerabilities can reside anywhere across an AI system – not only in the AI model, but also other software surrounding it – and that the culture of the AI community does not share the software engineering history of cybersecurity practices. She stressed that AI is software, and software engineering matters: "If secure development practices were followed, about half of those vulnerabilities that I've seen in AI systems would be gone." She also highlighted the need for coordination – between those who discover a problem and those who can fix it – and urged participants to focus on fixing problems rather than arguing about definitions. 

Shaping Law and Policy to Protect Evaluators

The third panel focused on the law and politics of third party AI evaluations, featuring Harley Geiger, Coordinator at the Hacking Policy Council; Ilona Cohen, Chief Legal and Policy Officer at HackerOne; and Amit Elazari, Co-Founder and CEO at OpenPolicy.

Taking a big picture view, Geiger said that law and policy will shape the future of AI evaluation, and existing laws on software security need to be updated to accommodate AI security research. He noted gaps in the current legal landscape, explaining that while existing software covers security testing for software, laws need to explicitly cover non-security AI risks – such as bias, discrimination, toxicity, and harmful outputs – to prevent researchers from facing undue liability. He stressed the importance of consistent terminology and clarifying legal protections for third party evaluation and the developing a process for flaw disclosure, and mentioned that real-life examples of how research has been chilled would be very helpful in advocating for legal protections.

Cohen agreed on the importance of shared terminology, noting that she uses the term "AI security" to describe a focus on protecting AI systems from the risks of the outside world, whereas she uses "AI safety" to describe a focus on protecting the outside world from AI risks. She shared success stories of third party AI evaluations, noting that HackerOne’s programs have helped minimize legal risks for participants and made foundation models safer. Cohen argued that red teaming is more effective the more closely it mimics real-world usage. She also provided a brief overview of the fractured landscape of AI regulation.

Rounding out the speakers' remarks, Elazari stressed the unique moment we are in given that we are still shaping the definitions and concepts such as AI and red teaming. She highlighted the power of policy work, noting that for security research, she sees success in expanded safe harbor agreements alongside a movement for bug bounties and vulnerability disclosures. For the AI space, she argued that "we have this gap between an amazing moment where there is urgency around embracing red teaming, but not enough focus on creating the protections [for third party evaluators] that we need." She closed with a powerful message urging participants to help shape law and policy for the better: "This is a call to action to you [...] There is a lot of power in the community coming together."

Watch the full conference video

Ruth E. Appel is a researcher at Stanford University and holds a Master's in CS and PhD in Political Communication from Stanford, and an MPP from Sciences Po Paris.

More News Topics