I was once asked by a colleague in the Philosophy Department here at Stanford if robot musicians will ever exist, to which I replied that they may — someday — but only if we first figure out what it means to have robot philosophers. The exchange was admittedly a bit tongue-in-cheek, but it revealed a blind-spot in the way we talk about the future of AI: in our tendency to ask whether or when a given task will be taken over by automation, it is easy to ignore the deeper issue of what such a takeover would mean.
This may an understandable oversight when we’re thinking about manufacturing, clerical work or even driving a car. We’re less concerned with how these tasks are accomplished, and more concerned with the outcome — generally measured in cost, speed and safety. But when we imagine “automating” a pursuit like music making, we’re forced to balance the product of work with something deeper — the meaning we derive from the process of doing it.
Of course, automation is only accelerating in the age of AI, and it’s natural to ask how far it will go. But I believe this vision of robot musicians and philosophers suggests we have a deeper question to answer first: which values should guide it?
The Artful Design of Technology
As an Associate Professor at Stanford University’s CCRMA (Center for Computer Research in Music and Acoustics — pronounced “karma”), I’d like to think I have a unique perspective on a question like this. I design programming languages for music, instruments and musical toys like Ocarina for the iPhone, direct the Stanford Laptop Orchestra, and explore VR/AR design for music. I am a part of the Stanford Human-centered AI initiative, and my students and I research the design of systems that fundamentally infuse technology and human interaction, in the pursuit of new tools for musical and other forms of human creativity. Supported by a 2016 Guggenheim Fellowship, I wrote Artful Design: Technology in Search of the Sublime, a comic book about the shaping of technology — and how technology shapes us.
Full Automation: AI as a Big Red Button
Let’s start by exploring why total automation may not be the endgame for AI we tend to think it is.
As Michael Polanyi once noted about the tacit nature of human knowledge, we know more than we can tell. This is part of what gives AI, and deep learning in particular, its incredible allure; its ability to spot patterns in complex phenomena that defy rule-based descriptions means it can essentially understand them “for us”. Unfortunately, it also makes it tempting to think of AI as a “Big Red Button” — a technology that reliably delivers the right answers while hiding the process that leads to them.
Take the notion of artistic style, for instance. Most of us would agree it’s “a thing”, but would find it tricky to define, and even more difficult — if not impossible — to explicitly program. We might recognize painting “in the style” of Van Gogh, or writing “in the style” of Hemingway, but we’d likely describe such qualities in generalities, and find them nontrivial to emulate ourselves. These are cases of knowing more than we are able to tell — and where deep artificial neural networks really shine.
For example, through style-transfer deep artificial neural networks, made possible by Leon Gatys et al.’s original work on neural style transfer and leading to projects like Google’s Deep Art, one can infuse the “style” of Starry Night onto photos — including, for instance, Cool Cat — creating a kind of hybrid image we might call “Deep Cat”. Technologically, this technique is remarkable, but it also highlights a critical difference between the notions of style and artistic meaning: while the former has been successfully harnessed in this example, the latter has been utterly side-stepped. For an even more extreme example, take a look at the following:
It’s nearly comical to view these images side-by-side. On the left is a snapshot of a blissful moment, worlds apart from the angst captured on the right: an existential meltdown, a reaction to the absurdity of modern life, or however you interpret Edvard Munch’s The Scream of Nature. But once combined, all sense of meaning goes out the window. Is our happy couple still on vacation? And are they really not alarmed by the river of lava behind them, or the fiery firmament above them?!
The style-infused vacation photo is pleasant to look at, but its meaning is underwhelming, casual, kitschy. The Scream, in contrast, is a work of Art, inviting us to reflect, to feel. One image simply says “oh, I’ve been there,’’ while the other screams, silently, “I’ve been there.” It’s the difference between style — something AI is developing an impressive grasp of — and meaning — something even we humans still struggle with.
This brings us to the three primary shortcomings that define a “Big Red Button” (BRB) system.
First, to improve the BRB system’s output, you’d either have to start over with different input — or start waaay over and modify the system itself.
Second, BRB systems offer little user control. What if you wanted to specify how much of Starry Night’s style to impart unto your cat photo? Or mix and match styles in various balances? This kind of flexibility underlies our notions of experimentation, craft, and creativity — and are fundamental to the very concept of a tool.
Finally, and perhaps most importantly, there’s the fact that we don’t just value the product of our work; we often value the process. For example, while we enjoy eating ready-made dishes, we also enjoy the act of cooking for its intrinsic experience — taking raw ingredients and shaping them into food. Or take music; we may have access to more of it than ever before in the form of recordings — many of which represent the very pinnacle of the art — but we haven’t stopped singing, playing and composing for ourselves. From the earliest days of radio and recording to digital music, and now streaming, there’s remains—through various technological innovations—an intrinsic joy to the activity of making music.
It’s clear there is something worth preserving in many of the things we do in life, which is why automation can’t be reduced to a simple binary between “manual” and “automatic.” Instead, it’s about searching for the right balance between aspects that we would find useful to automate, versus tasks in which it might remain meaningful for us to participate. As easy as it can be to embrace the extremes — to rush into automating everything or to insist on automating nothing — ideal solutions often exist somewhere in between, as a duality between automation and human interaction, between autonomous technology and the tools we wield.
A Different Approach: Designing with a Human in the Loop
What if, instead of thinking of automation as the removal of human involvement from a task, we imagined it as the selective inclusion of human participation? The result would be a process that harnesses the efficiency of intelligent automation while remaining amenable to human feedback, all while retaining a greater sense of meaning.
Essentially, the human-in-the-loop approach reframes an automation problem as a Human-Computer Interaction (HCI) design problem. In turn, we’ve broadened the question of “how do we build a smarter system?” to “how do we incorporate useful, meaningful human interaction into the system?”
This kind of design is at the center of research in fields like Interactive Machine Learning, in which intelligent systems are designed to augment or enhance the human, serving as a tool to be wielded through human interaction. We can see this type of research expressed in the works of Alison Parrish, poetical engineer, as well as at the Stanford HAI launch event in March 2019, which highlighted collaborative social systems (Michael Bernstein), computers that learn to help (Emma Brunskill), ambient intelligence in AI-assisted hospitals (Serena Yeung), and interaction design for autonomy (Dorsa Sadigh).
Examples of Human-in-Loop Design
A student in my Thinking Matters course at Stanford, Design that Understands Us, proposed a tool that uses AI to translate legal documents into more “human readable” forms. But there was a twist: the design included a single slider that allows the user to control the “level of jargon-ness”. Setting this slider to one extreme would lead the system to essentially regurgitate the original document, unchanged, and — at the other extreme — a vastly more colloquial version of it. Simple as it may seem, it’s this gradient of choices between the extremes that make the proposed system truly useful as a tool — with which one can experiment, learn (e.g., by seeing the same document with different extents of legal “jargon”), and tailor to one’s preference and need. A simple, well-articulated slider, in this case, makes the difference between a Big Red Button and a human-in-the-loop tool. It represents an entirely different ethos about what the system is meant to be. (More whimsically, one can imagine a similar tool that works in reverse: e.g., taking normal writing and making it look like a legal document — or, taking a legal document and further “jargon-izing” it. Not practical, but potentially fun and illuminating!)
In my own field — nestled somewhere between computer science, design, and music — there are researchers whose work embody this human-in-the-loop ethos as well. Among the most prominent of these is Dr. Rebecca Fiebrink, a professor at Creative Computing Institute at University of the Arts London, Goldsmiths, University of London, and world expert at the intersections of AI, HCI, and Music. Her Ph.D. Thesis (Princeton, 2011), Real-time Human Interaction with Supervised Learning Algorithms for Music Composition and Performance, is a pioneering work that brought together the aforementioned disciplines, articulating a new way to think about designing systems with AI and human in the loop, with far-reaching implications well beyond music. She is the author of Wekinator: software for real-time, interactive machine learning. One can adapt Wekinator to iteratively, efficiently, and incrementally train tools by example. Humans can continually refine the system by “showing” the system new examples of control mappings for musical instruments, video games, or any other task that has input, interaction, and output. It reframes machine learning tasks as HCI tasks — something we might even call “Human-AI-Interaction.”
This shift in thinking may seem trivial, but it can be incredibly powerful. For example, imagine the task of separating a musical recording into its individual tracks, like vocals, guitars, bass and drums. This has all sorts of applications, from music production to forensics to karaoke. It’s also extremely difficult, with professional tools still delivering imperfect results despite decades of attempts at designing the perfect algorithm.
But what if the goal of a “perfect algorithm” is the wrong way to think about it? That’s the question that inspired Nick Bryan, a former Ph.D. student of mine, to try a new approach with his thesis project. He recognized that humans are still the reigning champions when it comes to picking out the thump of a bass note, the crack of a snare drum or the familiar timbre of a vocalist — even in the midst of a fully mixed recording. So he designed an interactive system that combined machine learning, audio signal processing, and human interaction. It all came together through an interface inviting a human user to draw annotations directly onto a visualization of a waveform, roughly pointing out where, in both time and a range of frequencies from bass to treble, a given instrument or vocal is most prominently heard. These annotations aren’t meant to be precise, of course; they’re simply a hint that tells the underlying algorithm where to focus its efforts. Moreover, this is an iterative process where a user can listen to the intermediate result at each step and further refine the separation by providing additional examples by drawing. It adds genuine structure and context to an otherwise meaningless stream of data — and helped Nick’s selectively automated tool outperformed its fully automated rivals. Once again, even a little human involvement, in the right place, can go a long, long way.
Benefits of Human-in-the-Loop
With all this talk of “value” and “meaning”, however, it’s important to remember how practical the benefits of human-in-the-loop design can be.
First, it means significant gains in transparency. Each step that incorporates human interaction demands the system be designed to be understood by humans to take the next action, and that there be some human agency in determining the critical steps. Ultimately human and AI undertake the task alongside one another, making it harder and harder for the process to remain hidden.
Next, they incorporate human judgment in effective ways. At the end of the day, AI systems are built to help humans. The value of such systems lies not solely in efficiency or correctness, but also in human preference and agency. Humans-in-the-loop system puts humans in the decision loop.
They also shift pressure away from building “perfect” algorithms. By incorporating human intelligence, judgement, and interaction into the loop, the automated aspects of the system is exempted from “getting everything right all at once” (as in a Big Red Button scenario). Because the system is built around human guidance, the system only needs to make meaningful progress to the next interaction point. In fact, it may be beneficial to do less rather than more at each step. As Rebecca Fiebrink demonstrates, Wekinator “only” uses a single-layer neural network — that’s all it really needs to be useful; as humans do the rest. Could it go deeper? Of course, but it doesn’t have to.
Finally, they often enable more powerful systems, not less. Human-in-the-loop design strategies can often improve the performance of the system compared to fully automated and fully manual systems. This aligns with the notion that a hybrid system can do no worse than fully automated systems — i.e., as the design allows, the human can defer to the rest of the system whenever they may choose to do — but the right kind of human interaction can render the system fundamentally better at what it is built to do. In other words, this is functional excellence, achieved through finding an ideal balance for a given situation.
If there is truth to this, then a critical first ingredient in designing any human-centered AI systems is having the awareness to ask: in what ways can the system I am designing incorporate human curation into the loop? At what junctures can human judgement and preferences improve the system in its effectiveness — and the experience of using it? What might an interaction model or “user interface” look like? How does the AI model support such an interaction? What do I need to tweak to make it work?
There is no one-size-fits-all method to designing human-in-loop AI systems. But perhaps we can articulate a few general design principles as “things to think with” when designing a system.
- Value human agency. Design AI systems that harness human preference, taste, and judgement.
- Granularity is a virtue. A Big Red Button is an all-or-nothing affair that offers little granularity of control. Break up the task to incorporate human interaction.
- Interfaces should extend us. Build tools — things we can learn to use — instead of oracles, which give us the right answers but withhold an explanation.
How Do We Want to Live with AI?
The broader question, at the end of the day, is perhaps not “how do we design more intelligent machines”, but rather, “how do we want to live with those machines”? How do we align our AI systems to the kind of world we would want to live in? Beyond simple automation, how might we use AI to augment human capabilities, incorporating human interaction, preference, and judgement, in order to design more useful and meaningful AI systems?
It may still be awhile before we have to decide where a robot songwriter or algorithmic playwright fits into our lives. But the question we’ll ask then is the same we should ask now — for creative and practical tasks alike: is the product really all that matters, or is there something special in the process too? I believe there is—and that there is a balance between automation and human interaction to be discovered in every situation—which is why I’d like to end by presenting a fourth design principle for the future of automation, a principle from Artful Design: Technology in Search of the Sublime:
Special thanks to Alex Varanese, Patricia Alessandrini, Nick Bryan, and Rebecca Fiebrink for their ideas, suggestions, and edits throughout the writing of this article.
Amershi, S., M. Cakmak, W. B. Knox, T. Kulesza. 2014. “Power to the People: The Role of Humans in Interactive Machine Learning.” A.I. Magazine(35):4. pp.105–120.
Bryan, N. J., G. J. Mysore, and G. Wang. 2014. “ISSE: An Interactive Source
Separation Editor.” ACM CHI Conference on Human Factors in Computing Systems. Toronto.
Bryan, N. J. 2014. Interactive Sound Source Separation. Ph.D. Thesis. Stanford University.
Fiebrink, R. 2011. Real-time Human Interactions with Supervised Learning Algorithms for Music Composition and Performance. Ph.D. Thesis. Princeton University.
Gatys, L.A., A. S. Ecker, M. Bethge. 2015. “A Neural Algorithm of Artistic Style.” Nature Communications.
Parrish, A. 2019. Portfolio. Website. Accessed March 2019.
Wang, G. 2018. Artful Design: Technology in Search of the Sublime (A MusiComic Manifesto). Stanford University Press.