Tags:Ablation Analysis, Audio Feature Extraction, Hate speech detection and Multiclass Classification
Abstract:
Detecting hate speech on social media is challenging, particularly in low-resourced languages like Malayalam, due to the scarcity of annotated data. To address this challenge, we introduce a new multiclass dataset for hate speech in the Malayalam language, sourced from YouTube. The study benchmarks the performance of machine learning classifiers for the classification of hate and non-hate speech, in both binary and multi-class classification tasks, using audio features alone. The Random Forest Classifier model performed exceptionally well in binary classification, achieving a macro accuracy of 0.93 and an F1 score of 0.93. Ablation studies conducted with other classifiers, such as Logistic Regression, Support Vector Machines, and Naive Bayes, registered accuracies around 0.85 and macro F1 scores of 0.85. In multiclass classification, the Random Forest model excelled with an accuracy of 0.8289, a macro accuracy of 0.72, and an F1 score of 0.74, outperforming all other models tested in the ablation study. These results demonstrate the effectiveness of the Random Forest Classifier in contributing to a safer online environment by reliably detecting hate speech in Malayalam.
Audio-Based Hate Speech Detection in Malayalam Using Machine Learning