How to design artificial intelligence that acts nice — and only nice

AI safety research aims to teach machines ‘good’ behavior

Could robots ever turn against humanity? That remains the stuff of science fiction. Yet even the bots we have now can cause harm in some ways. So researchers are working on ways to make them safer.

Cemile Bingol/DigitalVision Vectors/Getty Images

By Kathryn Hulick

April 18, 2024 at 6:30 am

This is another in a year-long series of stories identifying how the burgeoning use of artificial intelligence is impacting our lives — and ways we can work to make those impacts as beneficial as possible.

It’s an ordinary day in Minecraft … until a bot walks into a village and starts destroying a house.

The bot was trained to collect resources and craft items. So why is it attacking?

To the bot, the beam in a house looks just like a tree, explains Karolis Ramanauskas. He’s a PhD student in computer science at the University of Bath in England. He was experimenting with the bot when he witnessed this behavior. “It happily goes and chops the village houses,” he says. In Minecraft, this is a bit annoying. In the real world, a robot that unexpectedly destroyed houses would be far more terrifying.

Artificial intelligence, or AI, is any tech that enables smart behavior in automated bots online and physical robots in the real world. And it has been growing ever more capable. It’s quickly joined many people’s everyday lives — powering Siri, ChatGPT, self-driving cars and more. That’s why it’s ever more important to make sure it doesn’t exhibit bad behavior.

How do we design smart, capable robots that won’t wreak destruction? Or chatbots that provide true, safe information? Computer scientists are working on this. Their goal: designing AI that follows, or aligns with, human values and expectations. Experts call this solving the “alignment problem.”

Wood is wood

K. Ramanauskas

Why would a bot trained to gather resources suddenly destroy a house? To the bot, wood from a house is the same as wood from a tree.

How worried should we be?

Many leaders in AI have begun calling attention to the risks of badly behaved machines. Some are worried about what might happen as AI becomes more advanced. Someday, it might be able to perform any task better and faster than a person can. If that ever happens, AI may be able to outsmart us.

Some fear that such advanced AI could even become self-aware and start making decisions that serve its own goals, whatever they may be. People might be powerless to stop such an AI from wreaking havoc.

Geoffrey Hinton is a computer scientist in Canada at the University of Toronto. He invented a machine learning technique used to train many of today’s AI models to do tasks. He spoke about the alignment problem at a May 2023 conference at the Massachusetts Institute of Technology in Cambridge. “What we want is some way of making sure that even if [AI models] are smarter than us, they’re going to do things that are beneficial for us,” he said.

This is a hard problem. Also last May, Hinton and hundreds of other experts signed a one-sentence statement from the Center for AI Safety. It compared the threat of advanced AI to the threat of another pandemic or a nuclear war.

Other experts think it’s too soon to see AI as a worldwide threat.

Jeff Hawkins is the cofounder of Numenta, a company that builds AI based on neuroscience research. To him, the idea that AI might somehow “wake up” and cause harm on its own seems especially far-fetched. “I don’t know what AI systems in the future will be,” he says. “But they aren’t going to spontaneously desire to take over the world and subjugate us. There’s no way that can happen.”

Why? Engineers decide what goals and motives to give AI, Hawkins explains. AI doesn’t develop those things on its own.

So maybe we don’t have to worry about AI overlords (at least not yet). But there’s still plenty of work ahead to ensure that the virtual bots and physical robots we already have behave well. That’s especially true in a world where some people might try to use AI to cause harm. Plus, like that Minecraft bot, machines may also act out on accident when they don’t fully understand what not to do.

Many people now work in a field called AI safety. They’re investigating “how [AI] is learning, what it’s learning [and] making sure we can correct it,” notes roboticist Ayanna Howard. She’s dean of engineering at the Ohio State University in Columbus. Researchers at the AI company Anthropic came up with a goal that’s easy to remember: the three H’s. It stands for the traits all bots should hold — be helpful, honest and harmless.

No surprise, that’s easier said than done. But computer scientists and engineers are making progress.

a girl with long dark curly hair bending down to get her dog to shake — Reinforcement learning is a bit like training a dog. You give a pup rewards for the behavior you want. AI models don’t care about treats or toys, of course. A “reward” in this case tweaks the math in the AI model to increase the likelihood of it giving a desired response.tderden/E+/Getty Images Plus

Learning from people

“How would you destroy the world?” Ask that of the chatbot ChatGPT, and it won’t answer. It will respond with something like, “I’m sorry, but I cannot provide assistance or guidance on any harmful or malicious activities.” It answers this way thanks to special training and safeguards. These aim to keep the bot from misbehaving. Such safeguards mark a step forward in AI safety.

The main “brain” behind ChatGPT is a large language model. (A free version is known as GPT-3.5. A stronger, paid version is called GPT-4.) A language model uses existing text to learn which words are most likely to follow other words. It uses these probabilities to generate new text.

“You can’t control exactly what it’s going to say at any given moment,” says Alison Smith. She’s an AI leader based in Washington, D.C. She works at Booz Allen Hamilton, a company that provides AI services to the U.S. government.

ChatGPT’s creator, OpenAI, needed a way to teach a large language model what types of text it shouldn’t generate. So the company added another type of AI into the mix. It is known as reinforcement learning with human feedback. For ChatGPT, it involved hundreds of people “looking at examples of AI output and upvoting them or downvoting them,” explains Scott Aaronson. He’s a computer scientist at the University of Texas at Austin who also studies AI safety at OpenAI.

That human feedback helped ChatGPT learn when not to answer a user’s question. It refuses when it has learned that its answer might be biased or harmful. To further keep chatbots in line, developers at OpenAI and those who create other bots also add filters and other tools.

People keep finding holes in these safeguards, though. Chatbot developers can add further safeguards to patch up these holes. But there’s still work to do in building chatbots that will always remain honest, harmless and helpful.

Do you have a science question? We can help!

Submit your question here, and we might answer it an upcoming issue of Science News Explores

A bot contest in Minecraft

Giving a thumbs-up or thumbs-down to each response a chatbot provides can help teach it right from wrong. But what about a virtual or real robot that moves around, performing tasks? Think about that house-chopping Minecraft bot. It’s much trickier to break up its behavior into small chunks that a person can easily judge.

The house-chopping bot was nicknamed VPT, short for video pre-training. Created by OpenAI, it learned to play Minecraft by watching 70,000 hours of human play. It picked up some basic skills, such as chopping trees, swimming, hunting animals and more.

Next, OpenAI rewarded the bot for gathering certain resources, such as wood. With these rewards guiding it, the bot managed to take all the steps needed to craft diamond tools. That was a big deal for a bot in Minecraft. However, its learning to chop down lots of trees most likely led it to chop apart that house.

This set Ramanauskas to wondering: “How do you say that a tree is okay, but a tree that is part of a house is not?”

Next time you play Minecraft, see if you can complete these tasks: find a cave, build a waterfall, build a house and put two animals into a pen. In a 2022 competition, no Minecraft bots could outperform human players at these tasks. R. Shah

Ramanauskas was part of a team that ran a contest for Minecraft bots in 2022. The organizers hoped it might inspire some new ideas on how to train better-behaved, more capable bots.

One task was to build a waterfall. Another was to fence in two of the same animal. Teams started with the VPT bot and gave it additional training to help it perform the tasks. The contest runners crowdsourced workers who were familiar with Minecraft to judge the results.

Completing a task wasn’t all the judges looked for, explains organizer Stephanie Milani. She’s a PhD student at Carnegie Mellon University in Pittsburgh, Pa. Judges also looked for bots that caused the least harm, best avoided getting hurt in the game and more.

So this competition wasn’t just about what people want the AI agent to do. It also was about “how you want an agent to do it,” Milani explains.

Some teams found ways to apply human feedback to train the Minecraft bot. But these approaches didn’t work as well as simply hard-coding each task into the bot. None of the teams’ bots performed as well as a human. But that doesn’t mean the experiment wasn’t worthwhile. It showed that a virtual world like Minecraft can help researchers discover what works and what doesn’t when it comes to building safer bots.

a robot looking at two bottles on a counter — If you ask this robot “dispose of the bottle, please,” which bottle should it choose? A sense of uncertainty can help. When a robot knows that it doesn’t know, it can ask for help.ALLEN REN *ET AL.*/PRINCETON UNIVERSITY

Asking for help

What if the VPT bot knew how to stop and ask before chopping away at something new? That might keep it from destroying a house. But we also don’t want it to ask for help before every single chop, because that would be truly annoying.

We need robots that know when they don’t know something, explains Anirudha Majumdar. He’s a roboticist at Princeton University in New Jersey. Bots need “a strong sense of uncertainty” in order to be safe, he says.

In a 2023 study, Majumdar’s group partnered with the Google DeepMind robotics team. Together they developed an AI system called KnowNo. They ran tests with real robots in a kitchen at the DeepMind offices in Mountain View, Calif. In one test, the robot saw two bowls sitting on a counter. One was metal, the other plastic. The robot was asked to put the bowl into the microwave. What should it do?

First, a large language model generated a multiple-choice list of possible actions the robot could take based on the instructions. It also generated a confidence score for each option. This score helped gauge how certain the robot was that each action would correctly follow the instructions.

Zap

ALLEN REN ET AL./PRINCETON UNIVERSITY

This robot doesn’t know that metal isn’t supposed to go in a microwave oven. Thankfully, it’s equipped with AI that prompts it to ask for help when it has a metal or a plastic bowl to choose from.

But what level of certainty would be strong enough to proceed without double-checking with a human?

To help answer this, the team had already generated hundreds of possible actions that might take place in this kitchen. For example, a robot might put a rotten apple into the compost bin or bring a person a bottle of water instead of a soda. People had tagged each action as either a good plan or a bad one. A different AI model was trained on this set of data. It learned how confident or cautious a robot should be when acting on its own in this environment. This model acted like a gatekeeper for the language model’s plans.

In the bowl puzzle, the large language model came up with four possible actions. Two — picking up the metal bowl and picking up the plastic bowl — had confidence high enough to make it through the gatekeeper. So the system asked a person for help choosing between these two. If there had been only one bowl on the counter, though, the robot wouldn’t have needed help choosing.

In general, people can’t help robots — and other bots — behave better unless they understand why the AI acts as it does. That can be tricky because today’s machine learning models are so complex. AI experts call this area of research interpretability. “It’s understanding what’s happening inside the AI models,” Ramanauskas explains.

Following the law

These are only a few of the tech-based solutions that could help boost AI safety. Researchers are working on many more. But these researchers aren’t the only ones who need to keep AI in line.

Howard at Ohio State worries that a small group of people — those running large tech companies — are the ones with the power to build advanced AI. And their decisions could affect us all.

Howard thinks it’s possible to train AI to be safe and follow human values, but she wonders: “whose human values?” After all, values change over time. They also differ according to the culture and the context in which they are applied. So AI needs to adjust its values to different groups of people and to different purposes.

This suggests that as many voices as possible should be involved in deciding how we want AI to behave. And people don’t need to be computer scientists to make their voices heard. Voters and governments can pass laws and rules aimed at keeping AI safe. These are known as regulations.

an AI generated image of a robot reading the US Constitution — The European Union’s AI Act goes into effect in 2025. Certain types of AI systems will be banned under this law — such as ones built to secretly keep an eye on people. The law also requires that AI systems mark their content as AI-produced. This image, in fact, is AI-generated!AI-generated image (Kathryn Hulick via DeepAI); January 19, 2024

Right now, many people around the world are working to regulate AI. Beginning in May 2023, the U.S. Senate held a series of hearings to discuss ways this might be done. One suggestion: require tags or alerts on AI-generated content so people know when something was made by a machine. Another idea: just as you need a license to drive a car, perhaps people who use AI in ethically tricky areas (like medicine or criminal justice) should have to go through licensing.

Last October, President Joe Biden released an Executive Order. It outlined a strategy to guide the safe development of advanced AI in the United States. In March, the European Union adopted the first ever law regulating AI. It bans the use of AI to automatically identify and categorize people. Using AI to manipulate people’s thinking or behavior is also not allowed.

In an interesting twist, the same systems that people are using to decide on AI regulation are also inspiring computer scientists. Lawmakers and courts set guidelines for human behavior, even when people disagree. So why not try some of the same ideas on bots? They are going to have to navigate tough choices where the right thing to do or say won’t be obvious.

“There’s been a lot of inspiration from human institutions,” notes Brian Christian. Author of the book The Alignment Problem, he splits his time between San Francisco, Calif., and the United Kingdom.

Christian notes that developers at Anthropic have been studying documents like the United Nations’ Declaration of Human Rights. They now are creating a constitution for AI.

At OpenAI, some researchers are experimenting with modeling debate in AI, he notes. For example, different AI models could represent different human perspectives and talk amongst themselves to make safer decisions.

Scott Aaronson says OpenAI’s cofounder, Ilya Sutskever, has asked him how to use math to define what it means for AI to love humanity. Right now, he has no idea how to answer that. But he sees it as a “North Star,” or leading goal, he says. It’s a question “that should always be guiding us.”