Machine learning finds patterns in data. ‘AI Bias’ means that it might find the wrong patterns - a system for spotting skin cancer might be paying more attention to whether the photo was taken in a doctor’s office. ML doesn’t ‘understand’ anything - it just looks for patterns in numbers, and if the sample data isn’t representative, the output won’t be either. Meanwhile, the mechanics of ML might make this hard to spot.
The most obvious and immediately concerning place that this issue can come up is in human diversity, and there are plenty of reasons why data about people might come with embedded biases. But it’s misleading, or incomplete, to think that this is only about people - exactly the same issues will come up if you’re trying to spot a flood in a warehouse or a failing gas turbine. One system might be biased around different skin pigmentation, and another might be biased against Siemens sensors.
Such issues are not new or unique to machine learning - all complex organizations make bad assumptions and it’s always hard to work out how a decision was taken. The answer is to build tools and processes to check, and to educate the users - make sure people don’t just ‘do what the AI says’. Machine learning is much better at doing certain things than people, just as a dog is much better at finding drugs than people, but you wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning.
Machine learning is one of the most important fundamental trends in tech today, and it’s one of the main ways that tech will change things in the broader world in the next decade. As part of this, there are aspects to machine learning that cause concern - its potential impact on employment, for example, and its use for purposes that we might consider unethical, such as new capabilities it might give to oppressive governments. Another, and the topic of this post, is the problem of AI bias.
It’s not simple.
What is ’AI Bias’?
“Raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked, with care.”
Until about 2013, If you wanted to make a software system that could, say, recognise a cat in a photo, you would write logical steps. You’d make something that looked for edges in an image, and an eye detector, and a texture analyser for fur, and try to count legs, and so on, and you’d bolt them all together... and it would never really work. Conceptually, this is rather like trying to make a mechanical horse - it’s possible in theory, but in practice the complexity is too great for us to be able to describe. You end up with hundreds or thousands of hand-written rules without getting a working model.
With machine learning, we don’t use hand-written rules to recognise X or Y. Instead, we take a thousand examples of X and a thousand examples of Y, and we get the computer to build a model based on statistical analysis of those examples. Then we can give that model a new data point and it says, with a given degree of accuracy, whether it fits example set X or example set Y. Machine learning uses data to generate a model, rather than a human being writing the model. This produces startlingly good results, particularly for recognition or pattern-finding problems, and this is the reason why the whole tech industry is being remade around machine learning.
However, there’s a catch. In the real world, your thousand (or hundred thousand, or million) examples of X and Y also contain A, B, J, L, O, R, and P. Those may not be evenly distributed, and they may be prominent enough that the system pays more attention to L and R than it does to X.
What does that mean in practice? My favorite example is the tendency of image recognition systems to look at a photo of a grassy hill and say ‘sheep’. Most of the pictures that are examples of ‘sheep’ were taken on grassy hills, because that’s where sheep tend to live, and the grass is a lot more prominent in the images than the little white fluffy things, so that’s where the systems place the most weight.
A more serious example came up recently with a project to look for skin cancer in photographs. It turns out that dermatologists often put rulers in photos of skin cancer, for scale, but that the example photos of healthy skin do not contain rulers. To the system, the rulers (or rather, the pixels that we see as a ruler) were just differences between the example sets, and sometimes more prominent than the small blotches on the skin. So, the system that was built to detect skin cancer was, sometimes, detecting rulers instead.
A central thing to understand here is that the system has no semantic understanding of what it’s looking at. We look at a grid of pixels and translate that into sheep, or skin, or rulers, but the system just sees a string of numbers. It isn’t seeing 3D space, or objects, or texture, or sheep. It’s just seeing patterns in data.
Meanwhile, the challenge in trying to diagnose issues like this is that the model your machine learning system has generated (the neural network) contains thousands or hundreds of thousands of nodes. There is no straightforward way to look inside the model and see how it’s making the decision - if you could, then the process would be simple enough that you wouldn’t have needed ML in the first place and you could have just written the rules yourself. People worry that ML is a ‘black box’. (As I explain later, however, this issue is often hugely overstated.)
This, hugely simplified, is the ‘AI bias’ or ‘machine learning bias’ problem: a system for finding patterns in data might find the wrong patterns, and you might not realise. This is a fundamental characteristic of the technology, and it is very well-understood by everyone working on this in academia and at large tech companies (data people do understand sample basis, yes), but its consequences are complex, and our potential resolutions to those consequences are also complex.
First, the consequences.
AI bias scenarios
The most obvious and immediately concerning place that this issue can be manifested is in human diversity. It was recently reported that Amazon had tried building a machine learning system to screen resumés for recruitment. Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés. Amazon spotted this and the system was never put into production.
The most important part of this example is that the system reportedly manifested this skew even if the gender was not explicitly marked on the resumés. The system was seeing patterns in the sample set of ‘successful employees’ in other things - for example, women might use different words to describe accomplishments, or have played different sports at school. Of course, the system doesn’t know what ice hockey is, nor what people are, nor what ‘success’ is - it was just doing statistical analysis of the text. But the patterns that it was seeing were not necessarily things that a human being would have noticed, and with some things (vocabulary describing success, perhaps, is something we now know can vary between genders) a human might have struggled to see them even if they were looking for them.
It gets worse. A machine learning system that is very good at spotting skin cancer on pale skin might be worse at spotting skin cancer on darker coloured skin, or vice versa, not perhaps because of bias in the sample but because you might need to construct the model differently to begin with to pick out different characteristics. Machine learning systems are not interchangeable, even in a narrow application like image recognition. You have to tune the structure of the system, sometimes just by trial and error, to be good at spotting the particular features in the data that you’re interested in, until you get to the desired degree of accuracy. But you might not realise that the system is 98% accurate for one group but only 91% accurate for another group (even if that accuracy still surpasses human analysis).
So far I’ve mostly used examples around people and their characteristics, and naturally this is where a lot of the conversation around this tends to focus. But it’s important to understand that bias around people is a subset of the issue: we will use ML for lots of things and sample bias will be part of the consideration in all of those. And equally, even if you are working with people, the bias in the data might not be around people.
To understand this systematically, it’s useful to go back to the skin cancer example from earlier, and consider three hypothetical ways it might break:
You don’t have an even distribution of people: your photos of skin with different tones is unbalanced, so your system gives false positives or false negatives based on skin pigmentation.
Your data contains a prominent and unevenly distributed non-human characteristic with no diagnostic value, and the system trains on that - the ruler in the photo of skin cancer, or the grass in the photo of a flock of sheep. In this case it alters its result if the pixels that we see as a ‘ruler’ (but that it does not) are present.
Your data contains some other characteristic that a human cannot see even if they look for it.
What does ‘even if they look for it’ mean? Well, we know a priori, or ought to know, that the data might be skewed around different human groups, and can at least plan to look for this (to put this the other way around, there are all sorts of social reasons why you might expect your data to come with bias around human groups). And if we look at the photo with the ruler, we can see the ruler - we just ignored it, because we knew it was irrelevant and we forgot that the system did not know anything.
But what if all of your photos of unhealthy skin are taken in an office with incandescent light and your photos of healthy skin are taken under fluorescent light? What if you updated the operating system on your smartphone between taking the healthy photos and the unhealthy photos, and Apple or Google made some small change to the noise reduction algorithm? This might be totally invisible to a human, no matter how hard they look, but the machine learning system will see it instantly and use it. It doesn’t know anything.
Next, so far we’ve been talking about correlations that are false, but there may also be patterns in the data that are entirely accurate and correct predictors, but that you don’t want to use, for ethical, legal or product-based reasons. In some jurisdictions, for example, you are not allowed to give better car insurance rates to women even though women might tend to be safer drivers. One could easily imagine a system that looks at the historical data and learns to associate ‘female’ first names with lower risk, so you would remove the first names from the data - but, as with the Amazon example above, there might be other factors that reveal the gender to the system (though of course it would have no concept of gender, or indeed cars), and you might not realise this until the regulator did an ex ante statistical analysis of the quotes you’ve given and fined you.
Finally, this is sometimes talked about as though we will only use these systems for things that involve people and social interactions and assumptions in some way. We won’t. If you make gas turbines, you would be very interested in applying machine learning to the telemetry coming from dozens or hundreds of sensors on your product (audio, vibration, temperature, or any other sensor generates data that is repurposed for a machine learning model with great ease). Hypothetically, you might say ‘here is data from a thousand turbines that were about to fail and here is data from a thousand turbines that were working fine - build a model to tell the difference’. Now, suppose that 75% of the bad turbines use a Siemens sensor and only 12% of the good turbines use one (and suppose this has no connection to the failure). The system will build a model to spot turbines with Siemens sensors. Oops.
AI bias management
What do we do about this? You can divide thinking in the field into three areas:
Methodological rigour in the collection and management of the training data
Technical tools to analyse and diagnose the behavior of the model.
Training, education and caution in the deployment of ML in products.
There’s a joke in Molière's Bourgeois Gentilhomme about a man who is taught that literature is divided into ‘poetry’ and ‘prose’, and is delighted to discover that he’s been speaking prose his whole life without realising. Statisticians might feel the same way today - they’ve been working on ‘artificial intelligence’ and ‘sample bias’ for their whole careers without realising. Looking for and worrying about sample bias is not a new problem - we just have to be very systematic about it. As mentioned above, in some ways this might actually, mechanistically, be easier when looking at issues around people, since we know a priori that we might have bias against different human groups where we might not realise a priori that we might have bias against Siemens.
The part that’s new, of course, is that the people aren’t doing the statistical analysis directly anymore - it’s being done by machines, that generate models of great complexity and size that are not straightforward to analyse. This question of transparency is one of the main areas of concern around bias - the fear is not just that it’s biased but that there is no way to tell, and that this is somehow fundamentally new and different from other forms of organization or automation, where there are (supposedly) clear logical steps that you can audit.
There are two problems with this: we probably can audit ML systems in some ways, and it’s not really any easier to audit any other system.
First, one part of current machine learning research is around finding tools and methods to work out what features are most prominent in a machine learning system. Meanwhile, machine learning (in its current manifestation) is a very new field and the science is changing fast - one should not assume that what is not practical today will not become practical soon. This OpenAI project is an interesting example of exactly this.
Second, the idea that you can audit and understand decision-making in existing systems or organisations is true in theory but flawed in practice. It is not at all easy to audit how a decision is taken in a large organisation. There may well be a formal decision process, but that’s not how the people actually interact, and the people themselves often do not have a clearly logical and systematic way of making their individual decisions. As my colleague Vijay Pande argued here, people are black boxes too - combine thousands of people in many overlapping companies and institutions and the problem compounds. We know ex post that the Space Shuttle was going to disintegrate on re-entry, and different people inside NASA had information that made them think something bad might happen, but the system overall did not know that. Meanwhile, NASA had been through exactly this auditing process when it lost the previous space shuttle, and yet it lost another one for very similar reasons. It’s easy to say that organizations and human systems follow clear logical rules that you can audit, understand and change - experience suggests otherwise. This is the Gosplan fallacy.
In this context, I often compare machine learning to databases, and especially relational databases - a new fundamental technology that changed what was possible in computer science and changed the broader world, that became a commodity that was part of everything, and that we now use all the time without noticing. But databases had problems too, and the problems had the same character: the system could be built on bad assumptions, or bad data, it would be hard to tell, and the people using it would do what the system told them without questioning it. There are lots of old jokes about the tax office misspelling your name, and it being easier to change your name than persuade them to fix the spelling. Is this best thought of as a technical problem inherent to SQL, an execution failure by Oracle, or an institutional failure by a large bureaucracy? And how easy would it be to work out the exact process whereby a system was deployed with no capability to fix typos, or know that it had done this before people started complaining?
At an even simpler level, one can see this issue in the phenomena of people driving their cars into rivers because of an out-of-date SatNav. Yes, the maps should be kept up to date. But, how far is it TomTom’s fault that your car is floating out to sea?
All of this is to say that ML bias will cause problems, in roughly the same kinds of ways as problems in the past, and will be resolvable and discoverable, or not, to roughly the same degree as they were in the past. Hence, the scenario for AI bias causing harm that is easiest to imagine is probably not one that comes from leading researchers at a major institution. Rather, it is a third tier technology contractor or software vendor that bolts together something out of open source components, libraries and tools that it doesn’t really understand and then sells it to an unsophisticated buyer that sees ‘AI’ on the sticker and doesn’t ask the right questions, gives it to minimum-wage employees and tells them to do whatever the ‘AI’ says. This is what happened with databases. This is not, particularly, an AI problem, or even a ‘software’ problem. It’s a ‘human’ problem.
“Machine Learning can do anything you could train a dog to do - but you’re never totally sure what you trained the dog to do.”
I often think that the term ‘artificial intelligence’ is deeply unhelpful in conversations like this. It creates the largely false impression that we have actually created, well, intelligence - that we are somehow on a path to HAL 9000 or Skynet - towards something that actually understands. We aren’t. These are just machines, and it’s much more useful to compare them to, say, a washing machine. A washing machine is much better than a human at washing clothes, but if you put dishes in a washing machine instead of clothes and press start, it will wash them. They’ll even get clean. But this won’t be the result you were looking for, and it won’t be because the system is biased against dishes. A washing machine doesn’t know what clothes or dishes are - it’s just a piece of automation, and it is not conceptually very different from any previous wave of automation.
That is, just as for cars, or aircraft, or databases, these systems can be both extremely powerful and extremely limited, and depend entirely on how they’re used by people, and on how well or badly intentioned and how educated or ignorant people are of how these systems work.
Hence, it is completely false to say that ‘AI is maths, so it cannot be biased’. But it is equally false to say that ML is ‘inherently biased’. ML finds patterns in data - what patterns depends on the data, and the data is up to us, and what we do with it is up to us. Machine learning is much better at doing certain things than people, just as a dog is much better at finding drugs than people, but you wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning.