Despite vigorous debates about the technical characteristics of risk assessments being deployed in the U.S. criminal justice system, remarkably little research has studied how these tools affect actual decision-making processes. After all, risk assessments do not make definitive decisions, they inform judges, who are the final arbiters.
It is therefore essential that considerations of risk assessments be informed by rigorous studies of how judges actually interpret and use them. This paper takes a first step toward such research on human interactions with risk assessments through a controlled experimental study on Amazon Mechanical Turk.
We found several behaviors that call into question the supposed efficacy and fairness of risk assessments: our study participants:
- underperformed the risk assessment even when presented with its predictions,
- could not effectively evaluate the accuracy of their own or the risk assessment’s predictions, and
- exhibited behaviors fraught with “disparate interactions,” whereby the use of risk assessments led to higher risk predictions about black defendants and lower risk predictions about white defendants.
These results suggest the need for a new “algorithm-in-the-loop” framework that places machine learning decision-making aids into the sociotechnical context of improving human decisions rather than the technical context of generating the best prediction in the abstract. If risk assessments are to be used at all, they must be grounded in rigorous evaluations of their real-world impacts instead of in their theoretical potential.