Mammo-Bench: a Large-Scale Benchmark Dataset of Mammography Images

EasyChair Preprint 15792

15 pages•Date: February 3, 2025

Gaurav Bhole, Suba Suseela and Nita Parekh

Abstract

Breast cancer remains a significant global health concern, and machine learning algorithms and computer-aided detection systems have shown great promise in enhancing the accuracy and efficiency of mammography image analysis. However, there is a critical need for large, benchmark datasets for training deep learning models for breast cancer detection. In this work we developed Mammo-Bench, a large-scale benchmark dataset of mammography images, by collating data from seven well-curated resources, viz., DDSM, INbreast, KAU-BCMD, CMMD, CDD-CESM, DMID, and RSNA Screening Dataset. To ensure consistency across images from diverse sources while preserving clinically relevant features, a preprocessing pipeline that includes breast segmentation, pectoral muscle removal, and intelligent cropping is proposed. The dataset consists of 74,436 high-quality mammographic images from 26,500 patients across 7 countries and is one of the largest open-source mammography databases to the best of our knowledge. To show the efficacy of training on the large dataset, performance of ResNet101 architecture was evaluated on Mammo-Bench and the results compared by training independently on a few member datasets and an external dataset, VinDr-Mammo. An accuracy of 78.8% (with data augmentation of the minority classes) and 77.8% (without data augmentation) was achieved on the proposed benchmark dataset, compared to the other datasets for which accuracy varied from 25 – 69%. Noticeably, improved prediction of the minority classes is observed with the Mammo-Bench dataset. These results establish baseline performance and demonstrate Mammo-Bench's utility as a comprehensive resource for developing and evaluating mammography analysis systems.

Keyphrases: Breast Cancer Detection, Breast Cancer Diagnosis, Computer Aided Detection, Mammogram Dataset, Medical Imaging, deep learning, large scale benchmark dataset of mammography images, mammography dataset for breast cancer diagnosis research, mammography datasets, masks for regions of interest, pectoral muscle removal and intelligent cropping, screening mammography breast cancer detection

Links:

https://easychair.org/publications/preprint/DfFr

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:15792,
  author    = {Gaurav Bhole and Suba Suseela and Nita Parekh},
  title     = {Mammo-Bench: a Large-Scale Benchmark Dataset of Mammography Images},
  howpublished = {EasyChair Preprint 15792},
  year      = {EasyChair, 2025}}

Download PDF Open PDF in browser