The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammog...
The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammog...
dataset2025-11-29https://doi.org/10.1148/atlas.1764459072234
167

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/dataset.json

Name

The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammog...

Link

https://doi.org/10.1148/ryai.220047

Indexing

Keywords: Mammography, Breast Imaging, Breast Cancer, Digital Breast Tomosynthesis, Lesion Annotation, Pathologic Outcomes, Racial Diversity, Dataset Curation, De-identification, BI-RADS
Content: BR, OI, BQ

Author(s)

Jiwoong J. Jeong
Brianna L. Vey
Ananth Bhimireddy
Thomas Kim
Thiago Santos
Ramon Correa
Raman Dutt
Marina Mosunjac
Gabriela Oprea-Ilies
Geoffrey Smith
Minjae Woo
Christopher R. McAdams
Mary S. Newell
Imon Banerjee
Judy Gichoya
Hari Trivedi

Organization(s)

Arizona State University
Emory University
Georgia Institute of Technology
Kennesaw State University

Contact

jjeong35@asu.edu

Funding

Supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health (award no. UL1TR002378).

Ethical review

Approved by Emory University’s institutional review board. Written informed consent was waived due to the use of de-identified data.

Comments

The EMory BrEast imaging Dataset (EMBED) contains two-dimensional and digital breast tomosynthesis screening and diagnostic mammograms with lesion-level annotations and pathologic information in racially diverse patients. It includes 3,383,659 2D and DBT mammograms from 116,000 women, with an equal representation of African American and White patients. The dataset also contains 40,000 annotated lesions linked to structured imaging descriptors and 56 ground truth pathologic outcomes grouped into seven severity classes. 20% of the dataset is freely available through the Amazon Web Services Open Data Program.

Date

Updated: 2022-12-16
Created: 2022-03-07

Dataset

Motivation

To develop deep learning models for breast cancer screening that are generalizable across diverse demographics, specifically addressing the underrepresentation of African American and other minority patients in existing datasets.

Sampling

Patients with screening or diagnostic mammograms at Emory University institutional hospitals from January 2013 through December 2020 were identified. Women aged 18 years or older with at least one available mammogram were included; patients younger than 18 were excluded. Data collected from four institutional hospitals (two community, one large inner-city, one private academic).

Missing information

Approximately 20% of lesions were classified as ambiguous due to inability to automatically link ROIs for patients with multiple imaging findings. DICOM metadata was sometimes corrupted, leading to retention of only metadata present in at least 10% of files. Pathologic diagnoses may be missed if specimens were not obtained at Emory. Linkage of ROIs to imaging and pathologic findings in MagView is challenging for examinations with multiple findings per breast.

Re-identification

The dataset is de-identified. All Health Insurance Portability and Accountability Act metadata elements were removed or de-identified. Dates were shifted using fixed patient-level offsets. A master key was retained for adding new examinations.

Sensitive data

Demographic characteristics (age, race, ethnicity, insurance status), family history, procedure history, and treatment history (including hormone replacement therapy) were collected.