Skip to main content Skip to secondary navigation
Page Content

November 17 Agenda: Workshop on Sociotechnical AI Safety

Overview          Speakers          November 16 Agenda

NOVEMBER 17

Time

Agenda Item

8:30am

Check-In/Breakfast Available

9:00am

Welcome Back

9:00-10:00am

System Safety for Responsible ML Development: Translating and Expanding System Theoretic Process Analysis

Speaker: Shalaleh Rismani (MILA)

Discussant: Julian Michael (NYU)

Abstract: The increased recognition and understanding of algorithmic harm and recent developments in the regulation of Machine Learning (ML) systems have created the need for companies to develop strategies and methods of conducting risk management for their products. As these regulations are operationalized, companies need to develop ways of implementing the necessary processes and tools to be able to meet the incoming regulations and standards. Scholars and practitioners have recognized the potential value of safety engineering frameworks to provide a systematic approach to identifying and managing potential sources of algorithm harm and facilitating the integration of various responsible ML practices such as impact assessments and auditing. In this work, we draw attention to this body of work and focus on empirically evaluating System Theoretic Process Analysis (STPA), a hazard analysis framework based on system theory. We propose a set of guidelines for conducting hazard analysis for ML systems based on extending the existing STPA approach referred to as STPA4ML. This new analytic framework considers that:

  • Harms and losses from ML-based systems go beyond traditional losses/harms considered for safety-critical systems.

  • ML-based systems are unique along two dimensions of output complexity and evolving capability. 

  • The intersection between ML development pipeline, ML product, and software development processes can be a source of harms

We illustrate the utility of the method using three different ML systems (transformer-based text-to-image models, linear regression, reinforcement learning) and two different applications (creative practice and medicine). We discovered a broad range of failures and hazards (i.e., functional, social, and ethical) by analyzing interactions (i.e., between different ML models in the product, between the ML product and user, and between development teams) and processes (i.e., preparation of training data or workflows for using an ML service/product). Our findings underscore the value of examining algorithmic harms at the system level – i.e., beyond an ML model – as the need for impact and conformity assessments continues to increase.

10:00-10:30am

Break 

10:30-11:30am

Deep Risk Mapping

Speaker: Tegan Maharaj (Toronto)

Discussant: Deb Raji

Abstract: As an increasing number of nations and world powers release policy guidlines on AI, a trend has emerged of ensuring responsible development via risk assessments. This is very worrying; there is very little work on AI risk assessment, it's unclear what it would look like in either theory or practice, and very unclear whether techniques currently exist that could achieve the stated objective(s).

A risk is variously defined as a chance of harm [1], statistically expected loss [2], measurable uncertainty [3], effect of uncertainty on objectives [4], et al. Some of these notions have specific associated quantifications or qualitative models. For example, a widely-used quantification defines risk as the probability of an event multiplied by its magnitude [5]. This formulation guides qualitative risk assessment as well; in general the work of risk analysis is conceptually and practically separated into phases of identifying (characterizing, specifying, threat modelling, describing, simplifying, modeling, separating) vs. assessing (quantifying, categorizing, assigning probabilities). Another example, the statistically expected loss, is foundational to empirical risk minimization, the standard training framework for deep learning, i.e. modern AI systems such as ChatGPT. 

Drawing on black feminist, decolonial, and environmentalist theory, illustrated mathematically, I argue that in both cases, the breakdown and simplification of risk has had pervasive negative consequences. Specifically, I argue that risks arising from (1) interaction of already-identified risks, i.e. intersectional risks, (2) changes to distribution of risk (while expected value remains approximately constant) (3) agentic behaviour (i.e.  actions taken which change the assumed underlying random process) (4) systemic or emergent effects (5) unquantifiable or less-easily-quantifiable phenomena (including non-monetary/non-economic) are neglected, under-analyzed, and under-estimated.

I propose a framework for risk identification and analysis called deep risk mapping. The framework consists of guidelines, techniques, visualizations, and directions for novel theory based on combining agent-based modelling with deep learning as a decision-support tool. Roughly speaking, the framework consists in using a humans-in-the-loop deep-learning-based approach to training agent-based models to simulate possible futures. In contrast to many technical frameworks, I emphasize responsible research methods and human training and communication as integral to the deep risk mapping framework.

I use two case studies to identify ways in which the framework addresses these 5 categories of risk, illustrating where possible the differences between 'classic' vs the proposed methodology for risk mapping, as well as limitations and directions for further study.

11:30-11:45am

Break 

11:45-12:45pm

You Can’t Have AI Safety Without Inclusion

Speaker: Dylan Hadfield-Mennell (MIT)

Discussant: Atoosa Kasirzadeh (Edinburgh)

Abstract: It has long been observed that specifying goals for agents is a challenging problem. As Kerr's classic 1975 paper 'On the Folly of Rewarding A while Hoping for B' observes, many reward systems often create incentives for undesired behavior. This concern motivates work in AI alignment: how can we specify incentives for AI systems such that optimization induces behavior that reliably accomplishes our subjective goals? In this talk, I will discuss how brittle alignment arises as a natural consequence of incomplete goal specification. I will present a theoretical model that shows sufficient conditions such that uncontrolled optimization of any goal that fails to measure features of value eventually produces worse outcomes than no optimization at all. Next, I will show how the same theoretical result applies to questions of inclusion in value specification: if we reframe the model such that the different features of value are how different people define value, then optimizing an incomplete goal can be expected to harm excluded people. As a result, technology that aligns an agent with a single person or organization's values is dangerous. I will conclude with a discussion of research directions for multi-stakeholder alignment and discuss the need for decentralized value learning and specification. 

12:45-1:45pm

Lunch

1:45-2:45pm

STELA: Community-Centred Approach to Value Elicitation for AI Alignment

Speaker: Iason Gabriel and Nahema Marchal (Google DeepMind)

Discussant: Meg Young (Data & Society)

Abstract: Value alignment, the process of ensuring that artificial intelligence (AI) systems are aligned with human values and goals, is a critical issue in AI research. Existing scholarship has mainly studied how to encode moral values into agents to guide their behaviour. Less attention has been given to the normative questions of whose values and norms AI systems should be aligned with, and how should these choices be made? To tackle these questions, this paper presents the STELA process (SocioTEchnical Language agent Alignment), a methodology resting on sociotechnical traditions of participatory, inclusive, and community-centred processes. For STELA, we conduct a series of deliberative discussions with four historically underrepresented groups in the United States in order to understand their diverse priorities and concerns when interacting with AI systems. The results of our research suggest that community-centred deliberation on the outputs of large language models is a valuable tool for eliciting latent normative perspectives directly from differently situated groups. In addition to having the potential to engender an inclusive process that is robust to the needs of communities, this methodology can provide rich contextual insights for AI alignment.

2:45-3:15pm

Break 

3:15-4:15pm

Collective Constitutional AI: Aligning a Language Model with Public Input

Speaker: Deep Ganguli (Anthropic)

Discussant: Ting-An Lin (Stanford)

Abstract: TBC

4:15-4:30pm

Break 

4:30-5:30pm

Entangled Preferences: The History and Risks of Reinforcement Learning and Human Feedback

Speaker: Nathan Lambert (AI2)

Discussant: Kristian Lum (University of Chicago)

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process, the training and utilization of a model of human preferences that acts as a reward function for optimization operates at the intersection of many stakeholders and academic disciplines, and is poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF’s foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function.

5:30-5:45pm

Break 

5:45-6:30pm

Closing Roundtable Discussion

This workshop is supported by the ANU Machine Intelligence and Normative Theory Lab, the Institute for Human-Centered AI, Stanford, and the McCoy Family Center for Ethics in Society, Stanford.