What is the AI Control Problem?

Getting help without endangering ourselves

One day, AI systems will be better at thinking and acting on that thinking than we are. In essence, that means that they’ll be on the top of the food chain.

At least, this is what proponents of the AI control problem think will come to pass. Through a quick overview of current thought on the AI control problem including the main factors that come together to form it, we hope that the sheer scope of this discussion will be less daunting.

What is the AI Control Problem?

Some have summarized the entire scope of the AI Control Problem through what we already mentioned above. “One day, AI systems will be better at decision making than we are.” To better understand the specifics of this problem, it can help to breakdown Nick Bostrom’s analysis of the current state of it. Bostrom begins by splitting the problem into two areas, which he calls the generic part or the principal-agent problem and the “present context” area.

The principal-agent problem can be understood as when a human appoints an Artificial Intelligence to act in his or her interests. Even more specifically, Bostrom says that this becomes a problem when the humans that want an AI, hire other humans to build said AI. The possibility of this becoming an issue is because, according to Bostrom, one can never be sure that those who are hired to build the AI will build it the way that the future owners envision it.

Bostrom suggests that risks related to this principal-agent problem can be minimized in various ways, including utilizing outside auditors throughout the entire development and implementation process of the AI system as well as running comprehensive background checks on all staff that are hired to work on such a project. In the end of his discussion on the Principal Agent problem, Bostrom makes clear that he believes that projects which find themselves in a race to the finish line with a lot of competition, will be more likely to skip these recommendations. In doing so, he puts forward the idea that such firms are at a significantly higher risk for failure.

As to the second area of the AI Control problem, Bostrom starts off by moving us forwards into the future, into a time when Superintelligences already exist.

If you don’t remember from other pieces that we’ve written before, just think of a Superintelligence as an AI that became possible because of the Singularity and has a capacity for intelligence that far exceeds any human’s. If the existence of such an AI is taken to be true, then the existence of certain problems related to its development can also be taken as true. In essence, as Bostrom states, this turns out to be something akin to “the second principal agent problem.”

There are only two real differences between the principal agent problems, at least in theory. The first is that this involves one human party and one party that is a Superintelligence. The second is that this “second” problem would arise when the AI system is already functional, after the development phase.

In other words, this means that the issue of a second principal agent involves it acting against its human creators as well as its human users after they’ve put it out into the world. Therefore, we’re right back inside the theoretical models related to Malignant Failure Modes.

Relation to Malignant Failure Modes

In our previous piece on Malignant Failure Modes, we discussed perverse instantiation as well as infrastructure profusion. We set the stage for these earlier so that the connection between them and the second principal agent problem could be better understood. To put it simply, Malignant Future modes are the best examples of how Bostrom imagines the second principal agent problem playing out.

To sum it all up, Perverse Instantiation is when an AI system takes an extremely negative direction in order to achieve its goals, while Infrastructure Profusion is when an AI system starts with a goal like making 1 million paper clips and ends up treating everything around it as resources to help it to achieve that goal.

Therefore, both can be considered to be directly connected in the specific times in which systems begin to undergo some form of Treacherous Turn. To recap, a Treacherous Turn is when an AI pretends to be cooperative while it is growing, and then turns against it creators when it decides that it is strong enough to do so. With all three of these ideas in mind, where does the AI Control Problem come in?

Moving Forwards

In fact, with the evidence that we have, the answer is actually that it is all of these things and more. We’ve reviewed these concepts in order to illustrate what can be under the AI Control Problem’s umbrella. Specific solutions have been put forth by Bostrom and others which could help to solve all of these possible issues, but every one of them has its own weaknesses. Boxing allows the AI team to control the environment that their AI operates in. By definition, AI boxing means confining an AI to an environment like one computer, that cannot touch the outside world. A boxed Artificial Intelligence only interacts with its creators as well as whoever else they decide should be able to speak to it.

Such a box can be escaped from and AI theorists mention that the most likely ways to do so all involve convincing its creators to let it out. To understand everything that this can be, look no further than any sort of bribe that could be offered and any sort of threat that can be given, under the sun. There could be other ways for an AI to escape a box, but one could easily surmise that all of them would involve finding some weakness in the code that holds the box together.

Nick Bostrom believes that boxing isn’t actually a full answer or even a full step on the road towards stopping an AI from entering a malignant failure mode of any kind. In its place, he suggests two types of possible solutions, which he calls capability control methods and motivation selection methods.

In future pieces, we’ll delve into what these are and how they just might play out to the advantage of AI teams.