Authors: Toren Fronsdal, Eric Gilliam, Aakash Pattabi, and Anoop Manjunath
The gods visit the sins of the fathers upon the children.
- Euripides, Greek Tragedist
We are living in the age of big data. Computers can carry out tasks that were previously unthinkable: beating the best humans at chess or Jeopardy, recommending TV shows, recognizing emotions in literary passages, and even predicting sexual orientation based on Facebook profile pictures. But we must be careful. Algorithms learn from data based on humans’ decisions. And humans are still flawed.
If those who have historically made hiring decisions seemed to express implicit discrimination or subconscious bias, then the algorithms will systematize and perpetuate these biases. Using data biased by humans’ inherent racism, sexism, or other ‘-isms’ only guarantees that our machines will be as biased as we are. Using a new field of machine learning methods, however, my colleagues and I demonstrate that this future is not inevitable. It is possible to build a less biased future.
A Story
In 2003, University of Chicago economists Sendhil Mullainathan and Marianne Bertrand published a paper detailing a study they conducted titled, Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. They conducted a randomized experiment in which they submitted fake resumes to employers who posted ‘Help Wanted’ advertisements in local newspapers in Chicago and Boston. The authors then randomly assigned each resume what were voted to be stereotypically ‘white-sounding’ or ‘black-sounding’ names.
The average callback rate for the black-sounding names was about 6.3%. The average callback rate for the white-sounding names was about 9.6%. The difference is striking. White-sounding applicants were over 50% more likely to receive a callback to be interviewed. The authors found that, “A white name yields as many more callback as an additional eight years of experience on a resume.”
Our Experiment
This study is as close as one can get to knowing the true effect being black has on job outcomes.
Real world data is usually observational. Data is observational if one simply collects data from the real world without running some kind of experiment. With normal hiring data it is impossible to determine to what extent being black hurt black applicants. The employer would likely say that black hiring outcomes were worse because they happened to be less qualified for the job.
This study was a randomized control trial, commonly referred to as the gold standard in causal inference. This randomized control trial is so powerful because it isolates only one facet of a job-seeker’s application –– in this case, being black –– to see how it affects callback rates. This is true because resumes are randomly assigned black-sounding or white-sounding names, and the assignment of names is not at all related to the resume’s qualifications. A stellar resume is as-likely to have a black-sounding name as a white-sounding name in this trial.
This trial took care of the bias issue. As a result, we know the true effect of being black in this case. We wanted to know if it was possible to use advanced statistical methods to discover bias like this in real world data.
To test this, we purposely biased the data from the University of Chicago economists.
We took out many of the less qualified applicants with black-sounding names. After this, the callback rates for each ethnicity were about equal. This does not mean, however, that the hiring was fair. This is because the pool of black-sounding names was more qualified and still received the same proportion of callbacks.
We used new methods from the field of machine learning and causal inference to demonstrate that methods exist that can find bias in data. This new field of statistics seeks to combine the methods of machine learning with the principles of causal inference in order to understand cause and effect in complex datasets.
White names were 52% more likely to receive a callback than black names, and these methods discovered as much as 91% of this bias in the purposely biased dataset.
Machine Learning and Causal Inference
Machine learning algorithms are often referred to as black boxes. This analogy of the black box refers to the uncanny ability of machine learning algorithms to come to very accurate predictions without being able to explain how they came to their conclusions. The machine learning done by most computer scientists is great for prediction, but terrible at determining to what extent one thing caused another––otherwise known as causal inference.
Empirical economists, on the other hand, tend to specialize in causal inference. “The ideal technology that causal inference strives to emulate is in our own mind,” wrote authors Judea Pearl and Dana Mackenzie in The Book of Why: The New Science of Cause and Effect. “Some tens of thousands of years ago, humans began to realize that certain things cause other things, and that tinkering with the former could change the latter. No other species grasps this, certainly not to the extent that we do.”
Economists and statisticians like Susan Athey, Guido Imbens, and Stefan Wager of Stanford University are part of a small group of pioneers combining the statistical methods of machine learning with the principles of causal inference. This field utilizes machine learning methods to augment traditional methods of causal inference.
Machine learning methods usually consist of computers using complex statistical methods to create algorithms used for prediction or classification. Machine learning and causal inference, however, repurposes these tools to measure causality. When introducing her methods at a 2015 Big Data conference at Harvard, Dr. Athey explained, “We can take all the machinery that we already have and apply it to causal inference.”
“We can take all the machinery that we already have and apply it to causal inference.”
This field has created new methods which go by names such as X-learners and causal forests. These methods are used on observational data to estimate the causal effects traits like race have on outcomes such as hiring. We use our biased hiring dataset to see if they can determine how biased the data is.
What we found
The machine learning and causal inference methods proved to be a dramatic step in the right direction for solving problems like racist hiring practices.
We changed the dataset so that resumes with black sounding names and white sounding names were called for interviews at equal rates. We removed many of the less qualified black resumes from the pile of total resumes.
The “X-Learner” –– first published in 2017 –– was able to recover 70% of the bias. This helped us learn that despite the apparent absence of difference between black-sounding and white-sounding names in the biased dataset, the true effect of having a black-sounding name was negative. Furthermore, the ‘causal forest’ was able to find 91% of the total bias in the data. This was a dramatic improvement over the traditional methods.
These methods accomplished this by going beyond simply comparing the group of resumes with black-sounding names to the group with white-sounding names. Instead, they attempt to compare the callback rate of black-sounding resumes of a particular caliber with the callback rate of white-sounding resumes of a similar caliber.
Of the three less sophisticated methods we tested which don’t employ machine learning, only one was able to find any statistically significant bias at all.
Machine learning and causal inference was remarkably successful in finding hidden bias in real-world datasets.
A Better Future
Problems caused by racist, sexist, or otherwise biased data are growing more common. Several years ago, Amazon attempted to develop algorithms which would make hiring decisions for the company, putting its machine learning toolbox at the center of its hiring process. Soon after the company began producing software to rank its candidates, it realized that the rankings were far from gender-neutral. They found that the system would penalize words such as ‘women’s’ or the name of an all-women’s college.
Why did this happen?
“garbage in, garbage out.”
The old machine learning adage says of inputting data into models, “garbage in, garbage out.” The algorithm did exactly what it was asked to do. If you hand a computer data riddled with human bias, that is exactly what it is going to learn.
Amazon’s models were trained to vet applicants by looking at Amazon’s hiring trends over the last ten years. Turns out, they had been far from unbiased.
In the future, it is possible that causal forests or X-learners can be used by companies like Amazon in this situation. Instead of naively training an algorithm on it’s own biased data, a company will be able to use these methods to discover to what extent being a woman has unfairly hurt applicants in the past. Companies can still implement hiring algorithms, but, now, they will be able to modify the algorithms to counteract the the bias present in the company’s previous hiring data. In this way, companies can utilize the power of machine learning while avoiding the perpetuation of previous sexism.
Computer scientists are being tasked with building algorithms for trial sentencing, loan approval, job hiring, and many more tasks. All of these areas are susceptible to using data that is riddled with previous human bias. Machine learning and causal inference methods might be the right tools to save us from a future doomed to headlines such as, “Amazon Scraps Secret AI Recruiting Tool That Showed Biased Against Women.”
.
.
.
References
Bertrand, Marianne and Sendhil Mullainathan. “Are Emily And Greg More Employable Than Lakisha And Jamal? A Field Experiment On Labor Market Discrimination,” American Economic Review, 2004, v94(4,Sep), 991-1013.
Athey, Susan, and Stefan Wager. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” ArXiv.org, 10 July 2017, arxiv.org/abs/1510.04342.
Künzel, Sören, et al. “Meta-Learners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” ArXiv.org, 24 Apr. 2019, arxiv.org/abs/1706.03461.
Dastin, Jeffrey. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women.” Reuters, Thomson Reuters, 10 Oct. 2018, www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G.