November 17 Agenda: Workshop on Sociotechnical AI Safety
NOVEMBER 17
Time |
Agenda Item |
8:30am |
Check-In/Breakfast Available |
9:00am |
Welcome Back |
9:00-10:00am |
System Safety for Responsible ML Development: Translating and Expanding System Theoretic Process Analysis Speaker: Shalaleh Rismani (MILA) Discussant: Julian Michael (NYU) Abstract: The increased recognition and understanding of algorithmic harm and recent developments in the regulation of Machine Learning (ML) systems have created the need for companies to develop strategies and methods of conducting risk management for their products. As these regulations are operationalized, companies need to develop ways of implementing the necessary processes and tools to be able to meet the incoming regulations and standards. Scholars and practitioners have recognized the potential value of safety engineering frameworks to provide a systematic approach to identifying and managing potential sources of algorithm harm and facilitating the integration of various responsible ML practices such as impact assessments and auditing. In this work, we draw attention to this body of work and focus on empirically evaluating System Theoretic Process Analysis (STPA), a hazard analysis framework based on system theory. We propose a set of guidelines for conducting hazard analysis for ML systems based on extending the existing STPA approach referred to as STPA4ML. This new analytic framework considers that:
We illustrate the utility of the method using three different ML systems (transformer-based text-to-image models, linear regression, reinforcement learning) and two different applications (creative practice and medicine). We discovered a broad range of failures and hazards (i.e., functional, social, and ethical) by analyzing interactions (i.e., between different ML models in the product, between the ML product and user, and between development teams) and processes (i.e., preparation of training data or workflows for using an ML service/product). Our findings underscore the value of examining algorithmic harms at the system level – i.e., beyond an ML model – as the need for impact and conformity assessments continues to increase. |
10:00-10:30am |
Break |
10:30-11:30am |
Deep Risk Mapping Speaker: Tegan Maharaj (Toronto) Discussant: Deb Raji Abstract: As an increasing number of nations and world powers release policy guidlines on AI, a trend has emerged of ensuring responsible development via risk assessments. This is very worrying; there is very little work on AI risk assessment, it's unclear what it would look like in either theory or practice, and very unclear whether techniques currently exist that could achieve the stated objective(s). Drawing on black feminist, decolonial, and environmentalist theory, illustrated mathematically, I argue that in both cases, the breakdown and simplification of risk has had pervasive negative consequences. Specifically, I argue that risks arising from (1) interaction of already-identified risks, i.e. intersectional risks, (2) changes to distribution of risk (while expected value remains approximately constant) (3) agentic behaviour (i.e. actions taken which change the assumed underlying random process) (4) systemic or emergent effects (5) unquantifiable or less-easily-quantifiable phenomena (including non-monetary/non-economic) are neglected, under-analyzed, and under-estimated. I propose a framework for risk identification and analysis called deep risk mapping. The framework consists of guidelines, techniques, visualizations, and directions for novel theory based on combining agent-based modelling with deep learning as a decision-support tool. Roughly speaking, the framework consists in using a humans-in-the-loop deep-learning-based approach to training agent-based models to simulate possible futures. In contrast to many technical frameworks, I emphasize responsible research methods and human training and communication as integral to the deep risk mapping framework. I use two case studies to identify ways in which the framework addresses these 5 categories of risk, illustrating where possible the differences between 'classic' vs the proposed methodology for risk mapping, as well as limitations and directions for further study. |
11:30-11:45am |
Break |
11:45-12:45pm |
You Can’t Have AI Safety Without Inclusion Speaker: Dylan Hadfield-Mennell (MIT) Discussant: Atoosa Kasirzadeh (Edinburgh) Abstract: It has long been observed that specifying goals for agents is a challenging problem. As Kerr's classic 1975 paper 'On the Folly of Rewarding A while Hoping for B' observes, many reward systems often create incentives for undesired behavior. This concern motivates work in AI alignment: how can we specify incentives for AI systems such that optimization induces behavior that reliably accomplishes our subjective goals? In this talk, I will discuss how brittle alignment arises as a natural consequence of incomplete goal specification. I will present a theoretical model that shows sufficient conditions such that uncontrolled optimization of any goal that fails to measure features of value eventually produces worse outcomes than no optimization at all. Next, I will show how the same theoretical result applies to questions of inclusion in value specification: if we reframe the model such that the different features of value are how different people define value, then optimizing an incomplete goal can be expected to harm excluded people. As a result, technology that aligns an agent with a single person or organization's values is dangerous. I will conclude with a discussion of research directions for multi-stakeholder alignment and discuss the need for decentralized value learning and specification. |
12:45-1:45pm |
Lunch |
1:45-2:45pm |
STELA: Community-Centred Approach to Value Elicitation for AI Alignment Speaker: Iason Gabriel and Nahema Marchal (Google DeepMind) Discussant: Meg Young (Data & Society) Abstract: Value alignment, the process of ensuring that artificial intelligence (AI) systems are aligned with human values and goals, is a critical issue in AI research. Existing scholarship has mainly studied how to encode moral values into agents to guide their behaviour. Less attention has been given to the normative questions of whose values and norms AI systems should be aligned with, and how should these choices be made? To tackle these questions, this paper presents the STELA process (SocioTEchnical Language agent Alignment), a methodology resting on sociotechnical traditions of participatory, inclusive, and community-centred processes. For STELA, we conduct a series of deliberative discussions with four historically underrepresented groups in the United States in order to understand their diverse priorities and concerns when interacting with AI systems. The results of our research suggest that community-centred deliberation on the outputs of large language models is a valuable tool for eliciting latent normative perspectives directly from differently situated groups. In addition to having the potential to engender an inclusive process that is robust to the needs of communities, this methodology can provide rich contextual insights for AI alignment. |
2:45-3:15pm |
Break |
3:15-4:15pm |
Collective Constitutional AI: Aligning a Language Model with Public Input Speaker: Deep Ganguli (Anthropic) Discussant: Ting-An Lin (Stanford) Abstract: TBC |
4:15-4:30pm |
Break |
4:30-5:30pm |
Entangled Preferences: The History and Risks of Reinforcement Learning and Human Feedback Speaker: Nathan Lambert (AI2) Discussant: Kristian Lum (University of Chicago) Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process, the training and utilization of a model of human preferences that acts as a reward function for optimization operates at the intersection of many stakeholders and academic disciplines, and is poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF’s foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function. |
5:30-5:45pm |
Break |
5:45-6:30pm |
Closing Roundtable Discussion |
This workshop is supported by the ANU Machine Intelligence and Normative Theory Lab, the Institute for Human-Centered AI, Stanford, and the McCoy Family Center for Ethics in Society, Stanford.