Even with a decent dataset to learn from, software gets worse the darker your skin
That’s the conclusion of a new study, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification”, that compared gender classifiers developed by Microsoft, IBM, and Chinese startup Face++ (also known as Megvii).
The study found that all three services consistently performed worse for people with darker skin, especially women.
The paper found the worst error rate when identifying white males was only 0.8 per cent, a figure recorded by Face++. IBM’s model got it wrong with 34.7 per cent of black women.
Authors Joy Buolamwini, a researcher at MIT Media Lab, and Timnit Gebru, a postdoctoral researcher at Microsoft, fed the services a dataset they dubbed the Pilot Parliaments Benchmark (PPB). The dataset comprised 1,270 male and female parliamentarians from Rwanda, Senegal, South Africa, Iceland, Finland, Sweden. The authors assert the resulting set of images reflects a fair approximation of the world’s population.
Other datasets such as the IJB-A, used for a facial recognition competition set by the US National Institute of Standards and Technology (NIST), and Adience, used for gender and age classification were both overwhelmingly skewed towards people with lighter skin.
And the winner is … nobody
After testing three gender-recognition-as-a-service APIs Microsoft, IBM and Face++, they found that Microsoft performed best and IBM was worst.
Microsoft was perfect at identifying white males in the PPB dataset but scored 1.7 per cent for white females. Figures worsened for darker skin: for black males the error rate hit six per cent and for black women it was 20.8 per cent.
In second place was Face ++ which was also much better at classifying males compared to females regardless of skin colour. The error rate for white males was 0.8 per cent and 0.7 per cent for black males, but for white women it was six per cent and 34.5 per cent for black women.
IBM struggled the most with darker skin tones. The results report error rates of 0.3 per cent for white males, 7.1 per cent for white females, 12 per cent for black males and 34.7 per for black females.
The authors argue that facial recognition software is “very likely” to be used for identifying criminal suspects. An algorithmic error in the output of commercial recognition services could therefore result in dire consequences, people may be wrongfully accused of a crime or misidentified from security footage.
The authors therefore argue for greater transparency and have called for companies to provide information on the demographic and phenotypic images used to train AI models as well as reporting performance levels for each different subgroup.
Their paper (PDF) has been accepted for the Conference on Fairness, Accountability and Transparency happening in New York later this month.
The Register has contacted all three companies for comment. ®
Updated to add
Big Blue has hit back with a big blog post.
“For the past nine months, IBM has been working toward substantially increasing the accuracy of its new Watson Visual Recognition service for facial analysis, which now uses broader training datasets and more robust recognition capabilities than the service evaluated in this study,” wrote Ruchir Puri, chief architect and IBM fellow for IBM Watson and Cloud Platform.
“Our new service, which will be released on February 23, demonstrates a nearly ten-fold decrease in error-rate for facial analysis when measured with the testset similar to the one in Buolamwini and Gebru’s paper.”