By: Arjun N.
School: University High School
Advisor: Tim Smay
ying in online forums have only grown. AI-driven algorithms deployed to moderate and reduce hateful content often exhibit bias, associating racial, gender, and other identity terms with toxicity and unduly censoring productive discussions. The censorship undermines trust and deters users that use these terms in genuine non-toxic ways. Bias-Free or less biased AI models that moderate better by classifying toxic comments fairly can regain the trust of the users and facilitate safer and more productive online discussions.
Current approaches to bias-free AI models are limited in their debiasing scope, are not scalable due to manual feature selection, and are often black-boxes, or not interpretable.
The novel method used in this project addressed these issues by using a hierarchical attention- based sequence learning neural network model that was more interpretable, adopting a linguistic-driven Noun-Adjective criteria that increased scope of important identity terms that contributed to toxicity, and scaled by auto-filtering the selection of relevant identity terms, by grid-searching the metaparameters, and finally de-biasing by augmenting the dataset with the appropriate number of counter-examples containing those identity terms. The model was trained on over 99,000 labeled (toxic/non-toxic) comments from the Wikipedia comment dataset and tested on a balanced test set. The model achieved an accuracy (AUC) of 0.98, compared to 0.95 in a prior paper. The method identified several hundred more identity terms than prior papers and debiased significantly better when compared to the control set, without any human intervention.
Unlike prior approaches, this model does not have comment length limitations, while being way more scalable and adaptable. It can be used to build debiasing models in other languages as well, although that part is for future work. The approach used in this paper breaks new ground on multiple levels as it is novel in its use of attention networks to debias, automate and scale the selection of terms and debiasing while also improving the accuracy of toxic classification by a significant amount.