The Superalignment Fallacy
Society's moral choices are the result of an interplay between arguments, power, politics and randomness. Not from value sets that AI scientists can use to align their models to.
Alignment is considered crucial in the development of safe artificial general intelligence (AGI). If AGI were to live on the internet, it could affect everything, control everything, acquire physical resources and ultimately – intentionally or unintentionally – destroy human civilization. The obvious solution is to keep AGI ‘in a box’ without direct access to the internet but humanity is unlikely to do that in a coordinated way, because of competition, because many applications require internet access, and because AGI would be smart enough to escape. We therefore need to find a way to align AI with human values before it transforms into AGI with capabilities that exceed us. Sam Altman believes it needs the Chinese and all other big brains in the world to resolve this alignment problem, and OpenAI has dedicated a separate division with access to 20% of its computing resources to the problem.
Jan Leike who, along with OpenAI’s chief scientist and co-founder Ilya Sutskever, is leading its recently established superalignment group has set a goal of solving the alignment problem in four years. Practically, he is looking for automating and scaling alignment such that smart computers can monitor advanced AI applications and catch serious errors once smart AI becomes ubiquitous. The future AI supervision systems would also need to stay ahead of a continually advancing AI by training and calibrating and overseeing the next level up, who would then need to the same with the next advanced level, up the curve of intelligence and superintelligence.
The effectiveness of using Reinforcement Learning from Human Feedback (RLHF) in aligning the OpenAI transformer models has made Jan Leike optimistic enough to set this ambitious 4-year goal. Ask today’s ChatGPT about abortion, and although it won’t resolve controversy, it can provide a well-formulated and sensible answer. The cultural background of the people doing supervision and training will of course continue to influence the results . LLMs emerging from China will refrain from criticizing the Communistic Party, while some AGIs emerging from liberal companies might still be somewhat biased against right-wing positions as was the case with early versions of ChatGPT. While the Taliban would presumably tune these models towards extreme positions on women and gay rights. Nevertheless, the currently calibrated models produce well formulated arguments that accurately reflect both sides of sensitive issues and generally cover the opinions of large majorities of the population.
The goal for the next four years, however, is much more ambitious than avoiding jailbreaking, hallucinations, and biases in existing AI systems. Jan Leike is focused on the next generation models whose inner workings should be closer to that of humans, with a better real-world representation and perhaps more autonomous internal objectives. Those future close-to-AGI models need to align with human values or goals. In Jan Leike’s words in a recent extensive podcast interview: “And so I think our job is fundamentally trying to distinguish between two different AI systems: One that truly wants to help us, truly wants to act in accordance with human intent, truly wants to do the things that we want it to do; and the other one just pretends to want all of these things when we’re looking…”. In the interview he discusses the fundamental difficulties beyond the already very difficult issue of overseeing evolving intelligence at scale. One is the need to avoid an Ex-Machina scenario where we are being misled into believing the AI is peaceful and innocent, while its real intentions are more malicious. The second related problem is a powerful AI that pursues solutions it thinks humanity wants but that don’t really benefit society on a deeper level. We don’t want a paperclip maximizer and we don’t want the AI that disempowers humans to achieve its goals. The AGI should know what to do even or “in particular, in situations where humans might not exactly know what they want”. Ideally, we can inspect the AIs intentions by looking internally (referred to as mechanistic interpretability), and if that turns out to be impossible, then we need other tests or indications that make us confident about our ability to keep all the various current and future AI systems ‘super’ aligned with the goals and values of humanity.
The superalignment problem was introduced to the public by Bostrom with the book Superintelligence and has been pursued with much passion by Eliezer Yudkowsky. The worry about misalignment, despite vast differences in opinion on the best approach and whether pauses or additional regulation are required, is shared by many including Bill Gates, Elon Musk, Deepmind CEO Demis Hassabis, Sam Altman, and recently retired AI ‘Godfather’ Geoffrey Hinton. Most appear to believe superalignment between humans and AGI is possible in principle but worry that we currently have far too little understanding on how to do that, with too little time to catch-up to rapidly evolving AI capabilities. As Geoff Hinton mentions in an interview after he resigned from Google “if we can put the goals in, maybe we would be ok but my big worry is, sooner or later someone will wire into them the ability to create their own subgoals”. Similarly, on their website the company Anthropic, set-up by ex Open.AI employees because they felt Open.AI was not doing enough to address trust and safety, explains the pessimistic scenario as one where we can’t dictate our values to a system smarter than we are, implicitly assuming ‘our values’ exists out there, but ensuring they will become and remain the basis for AGI is the major risk.
Despite widely different approaches and different perspectives on the the urgency of the problem, OpenAI, Anthropic, Facebook, Google and others all seem to believe we are at risk that at some point we might be faced with AGI that develops rapidly while solutions to superalignment are still being explored. Little attention is given to the reality that ‘human values’ or ‘human morality’ are not concepts that lend themselves to general alignment, super or otherwise, no matter what approach will be taken, even if only western values are included.
Little attention is given to the reality that ‘human values’ or ‘human morality’ are not concepts that lend themselves to general alignment, super or otherwise.
1. Fundamental moral dilemmas
Values are unlike math problems where more intelligence or computing power might one day find an optimal outcome. No amount of computing power or intelligence will give ‘the answer’ to whether the self-driving car is allowed to avoid hitting a kid that ignored a red light by killing the old lady waiting. No level of superintelligence would have an ‘answer’ to the ideal level of taxes, what gun laws should look like, or how many rights which animal deserves. A single individual human already struggles to be both logically consistent and follow his internal moral intuitions, as shown by thought experiments about trolleys, lifeboats, and population increase. Well-known moral philosopher Peter Singer is quick to admit he does not have a real answer to some of these problems either while he regularly adjusts his logical framework and opinions on specific problems. But if Peter Singer, who has not only spent most his life thinking about these issues, but also follows a highly logical and rational system of utilitarianism, does not know how to resolve these issues, what chance do people following more abstract or intuitive systems have? In practice, people are following highly inconsistent moral systems. We get deeply upset when watching someone abuse a dog, while we eat hamburgers. People spend large amounts to pay respect to a deceased person while condemning the people who care deeply for a one-month-old fetus. We help spoiled kids of friends, whom we don’t even like, with jobs, denying that opportunity to an unknown person who deserves and needs it. We change our views when context or different personal circumstances change. Every day of our lives we accept the equivalent of girls drowning in shallow ponds because of our need for a Starbucks coffee. We are proud to take pay cuts to work less and spend that time with our nothing-left-to-wish-for children, instead of earning to improve the lives of kids who need it. And the inconsistencies only increase when you move up from an individual to family, to social groups, to countries, and to international organizations. Each time a new set of values enters the mix, new dilemmas are created. What is obvious for you, or your social group becomes highly contested for the larger group. Immigration, abortion, the right to draw religious figures, the role of women, the importance of honor, respect, and insult. Consistency and agreement can appear superficially by focusing on vague concepts like peace, mutual respect, prosperity, and human rights but once you translate such principles into actions controversy will reemerge, as we can observe daily on the news.
Aligning our inconsistent moral intuitions into one system or set of values that reflects ‘humanity’ is not an option available to us, or to future generations, nor to a future superintelligence. The problem is unresolvable not because it is too complex but because human morality is internally inconsistent. Claiming that a + b = 5 and a + b = 4 are both true is not conceptually complex either, but it is nevertheless impossible to resolve. For many moral dilemmas, each side of the trade-off is highly unsatisfactory leading to difficulties to merge the two views within yourself, and no chance to overcome differences across different people.
2. Everything is an impossible dilemma.
The moral dilemmas described above are not limited to a set of niche problems as is sometimes claimed. The exhaustion with the never-ending trolley experiment discussions can lead people to dismiss it as a fringe problem interesting for philosophy classes. But while reality is more complex than stylized thought experiments, the choices these examples represent are real and pervasive. Utilitarian intuitions clash with moral emotions triggered by suffering in front of you, with a bias against action, with tribal intuitions, selfish reasoning, religion, science, pride and so many other emotions or intuitions that directly influence our daily morality.
Almost any action taken by us, the government, a company, or a future autonomous AGI, will involve the allocation of scarce resources from time and money to natural resources and public goods such as clean air and water, to values such as status. Amost any action with real-world consequences can thus justifiably be framed as a moral choice. And each such choice is a real dilemma as long as we have strong disparities between the happiness and opportunities of different people that would be affected by the distribution of such limited resources. There will always be the justified question why to put any money into arts, entertainment, and expensive life-extension treatment instead of helping people at the bottom of the pyramid. Those are not just interesting philosophical questions but real choices between life and death, and between suffering and happiness. Technical disagreements about facts and theory can give the impression the underlying goals are aligned. In discussions on the relationship between GDP growth and top bracket income taxes, or deficit spending, or antitrust regulation for example. But differences in moral preferences are often either part of the disagreement or emerge once technical questions are resolved. Moral preferences about how much support an unemployed person deserves, whether estate taxes and billions in wealth are justified, how much we can ‘borrow’ from future generations, how much rich countries should support poorer countries and foreign wars, the value of nature and clean air, the value of extending life expectancy for older people versus economic opportunities for younger generations. Almost everything is a values trade-off: between public safety and personal freedom, between average wealth and inequality, between higher risks and higher expected rewards, between educating the poor abroad or solving the opioid crisis at home, between bringing more happy lives into this world and further exhausting natural resources or remaining childless, between art and supporting the homeless.
3. Autonomous action means controversy
The impressive ability of recent AI engines to reflect human opinions is no reason to be optimistic about its ability to solve future moral dilemmas because of the fundamental gap between hearing a reasonable sounding ‘both sides’ opinion and making an actual choice. ChatGPT’s ability to list all reasonable positions in an argument won’t avoid controversy when it would make actual triage decisions at a hospital.
And now imagine it will autonomously decide or influence larger issues like gun ownership, factory farming, voting age, the use of nuclear weapons, and public coverage for expensive life extension treatments or gender affirming surgery. No matter how many string scientists, ethicists, or computing power you will have access to when developing the underlying models or systems, the moment AGI can make autonomous and consequential decisions, satisfactory alignment with human values will be impossible.
Anthropic’s aproach to alignment has been to create a constitution with moral and practical principles from various well-regarded sources including the UN Declaration of Human Rights, principles proposed by other AI research labs (such as DeepMind) and Apple’s terms of service. This method might help them in enhancing automation and scale-up efficiency, and it might be better than direct RLHF in avoiding offense or avoiding assisting people with malicious intent but Anthropic’s Constitutional AI will be similarly futile in creating any deeper alignment. A scalable and safe chatbot does not extrapolate to an uncontroversial autonomous actor. You only have to open the newspaper to see how little help the UN declaration of human rights is to resolve controversy surrounding support for Ukraine, Taiwan, or Israel. Similarly, no matter how sensible, no set of principles will resolve controversies around abortion, privacy, guns, taxes, support for the homeless, or affirmative action.
Legitimacy by default
Human morality consists of a wide range of opinions and views, that evolve through an undefined process of economic development, power relations, stories, political processes, influencers, human intuitions, education, traditions, elections, new technologies, lobbying, demonstrations, media hypes, twitter discussions, influencers, Russian misinformation and many more factors. There have been moments in history, around the French, American, and Communistic revolutions for example, where a philosophy, argument, or set of principles played an outsized role and influenced a relatively large part of society, but even these processes took time, conflict and were shaped by power dynamics, individuals, and many random external influences. Most of the time changes in the distribution of opinions and practices emerge from an opaque and chaotic process with logical argument or principles playing only a minor role. And it is precisely this uncoordinated and chaotic process that delivers legitimacy. Formally through new laws and regulations, and informally through changed opinions, new arguments that more people accept, or old arguments and beliefs that more people now reject.
Moral opinions and actions emerge from an opaque and chaotic process with logical argument or principles playing only a minor role. And it is precisely this uncoordinated and chaotic process that delivers legitimacy
Future AGI with autonomous decision making power will not only fail to find general answers to moral problems that have no answer but also lack the legitimacy necessary for acceptance because it will have circumvented the usual processes that shape and change society, whether we infuse the computer with the values of Obama, Peter Singer, the Dalai Lama, or any Nobel prize winner. The legal limits western democracies have on the power of individuals form moral barriers that no stand-alone value system can overcome. A powerful Peter Singer AI would only allow vegan food and abolish expensive end-of-life care or bypass operations until we have solved world hunger. The Dali Lama AI would refuse to send weapons to defend Ukraine and stop investing in a strong military. This might sound like a caricature of what such a world would look like, but that caricature comes from the assumption of creating a powerful AGI with substantial autonomous influence on the development of society. The opinions of a well-calibrated AGI might one day join debates, and people could be influenced by its opinions, as long we do not give it autonomous power far beyond that of human individuals.
Which brings us back to the start because the urgency for solving superalignment is driven by worries about our inability to keep AGI in a box. Value superalignment would have to ensure that once we lose direct full control, AGI will not disempower humans but instead shape the world to the benefit of all humanity. But that is a fantasy. An AGI with moral objectives and autonomous power beyond that of an individual will disempower humans and will never behave in a way that satisfies humanity.
Which means we better work hard to ensure AGI will remain subservient to humans and to existing legal processes, because solving this control problem might not be all we need, but it is all we have.