Utilitarianism and Goodhart’s Law

A consequentialist theory of morality is a theory that judges the correctness of an action by the consequences of that action. If an action tends to bring about a better state of the world, then it is judged to be moral.

Utilitarian theories of morality are a type of consequentialism that says the goodness or badness of a state of the world depends on a specific criterion, and this measure of goodness in the world is called utility. If an action tends to increase the amount of utility in the world, then the action is moral. If it tends to decrease the amount of utility, it is immoral. So to a utilitarian, the goal of morality is to maximize utility.

To say utility is a measure of goodness is pretty vague, and so utilitarian theories of morality usually try to pin down the definition of utility a little more precisely. Jeremy Bentham’s original utilitarianism based morality on achieving “the greatest happiness for the greatest number”[1]. John Stuart Mill’s version of utility considers not only the quantity of happiness but also the quality of the happiness[2]. Preference utilitarianism considers satisfying people’s preferences to be the fundamental measure of utility[3]. Negative utilitarianism values the minimization of pain, rather than the maximization of happiness. Other theories may hold that utility is the balance of pleasure over pain, or the average person’s happiness, or the fulfillment of desires[4].

Criticism of these moral theories usually involves pointing out a situation where the theory clearly does not work properly. For example, if the true goal of morality is maximizing happiness, then wouldn’t this mean that the moral thing to do is to put everyone on drugs that make them happy all the time? Or if morality is to be based on minimizing pain, consider that this could be achieved by simply painlessly killing everyone. If moral actions are actions that maximize pleasure, then perhaps we should be building huge warehouses full of brains in glass jars having their pleasure centers artificially stimulated[5]. If utility is measured by the average person’s happiness, should we just kill all the unhappy people? Suddenly it begins to seem like these simple criteria might not be the ultimate basis for right and wrong after all.

All else being equal, a world with more pleasure, more desire fulfillment, more preference satisfaction, less pain and more happiness certainly seems to be better than a world with more pain, less happiness, etc. The presence of those qualities are all good indicators of morality in the world. But it seems like as soon as you think of them as goals instead of indicators, they suddenly cease to be good measurements of morality.

This effect is similar to an idea originally from economics known as Goodhart’s law, which states that when an economic indicator is used instead as an economic target, it ceases to be a good indicator[6]. For example, when palaeontologists in Java offered the locals ten cents for each fragment of hominid bone they turned in, the locals started smashing the precious bones into smaller fragments to increase their payment[7]. Another example is the “window tax”, a head tax that Britain introduced in 1696 which used the number of windows on each building as an indicator of the number of people who lived there. Soon enough, people began bricking up their windows[8]. In both cases the number of bone fragments or number of windows was a good proxy for what they were trying to measure until it was itself made the goal. Then people figured out how to game the system, and the correlation between the real goal and the indicator suddenly disappeared.

Similarly, things like happiness, desire fulfillment, or preference satisfaction seem like good indicators of goodness in the world, but that doesn’t mean that they are the goal itself. The actual state of the world corresponding to “goodness” is likely to be be something much more subtle, more complicated and more difficult to pin down than any of these ideas, and we should be cautious about naming any one simple objective as the one true goal of morality.

This idea is not necessarily opposed to existing utilitarian theories of morality, it just advocates a kind of humility. There’s little reason to expect our human concept of goodness to boil down to just one simple rule. Even though we can think of examples of things that we think are good, we should resist the temptation to feel like we know what real actual goodness is too precisely. So, we may favor the idea that utility really is preference satisfaction, but then qualify this by saying that “preference” in this context is something much subtler and more complex than simply the answer people give when asked what they prefer; what we mean by “preference” is what people would choose if they had all the relevant information, could reason perfectly from that knowledge and could make optimal decisions from that reasoning.

It’s not that there couldn’t, in principal, be some single all-encompassing criterion of morality. In this context, Goodhart’s Law simply warns against the potential danger of confusing a good indicator of morality with the actual goal of morality.