Tags:Bangla BERT, Deep Learning, Electra, Ensemble Models, Machine Learning, Misogyny and Text Classification
Abstract:
Misogyny is hatred, disdain, or prejudice toward women or girls. It can be verbal abuse, aggression, harassment, or discrimination. Misogyny negatively impacts society and women’s social, economic, and political status. In Bengali NLP, the scarcity of well-balanced datasets for misogyny detection is a significant problem. While there has been previous work in this area for the Bangla language, the existing dataset, which only contains 2258 comments, has proven insufficient for robust detection, resulting in suboptimal performance in classification models. That is why, our research is centered on proposing an improved dataset that addresses the critical issue of misogyny in digital spaces. To tackle this, we have carefully curated a dataset from different sources, including social media and different websites. This dataset features five categories of misogynistic comments (Discredit, Stereotype & Objectification, Sexual Harassment, Threats of violence, Dominance & Derailing) and one category of Non-misogynistic comments. We evaluated the performance of multiple Machine Learning models such as SVM(Support Vector Machine), Naive Bayes, KNN, Decision Tree, and Random Forest. Bangla BERT base and Bangla BERT Electra, two Deep Learning models, were also used for classification. Additionally, we have done ensemble approaches with Bangla BERT embeddings like LSTM, Bi-LSTM, and CNN+LSTM with two types of Bangla BERT embeddings: Bangla BERT Base, and Bert-base-multilingual-cased on our improved dataset. Among them, Bi-LSTM with BERT (Bert-base-multilingual-cased) Embeddings achieves the highest accuracy, 91.24%, in classifying misogynistic comments. Our results show significant performance gains across all model types, particularly transformer-based approaches, which outperform traditional ML and DL models. This significant improvement in accuracy demonstrates the effectiveness of advanced ensemble techniques and BERT embeddings in addressing NLP challenges.
Towards Better Misogyny Detection in Bangla: Improved Dataset and Cutting-Edge Model Evaluation