EMory BrEast imaging Dataset (EMBED)
dataset2026-01-24https://doi.org/10.1148/atlas.1769272149180
20

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/dataset.json

Name

EMory BrEast imaging Dataset (EMBED)

Link

https://registry.opendata.aws/emory-breast-imaging-dataset-embed/

Indexing

Keywords: Mammography, Breast, Machine Learning, Digital breast tomosynthesis, BI-RADS
Content: BR, IN
RadLex: RID10359, RID29897, RID12634, RID49697, RID10357
SNOMED: 254837009

Author(s)

Jiwoong J. Jeong
Brianna L. Vey
Ananth Bhimireddy
Thomas Kim
Thiago Santos
Ramon Correa
Raman Dutt
Marina Mosunjac
Gabriela Oprea-Ilies
Geoffrey Smith
Minjae Woo
Christopher R. McAdams
Mary S. Newell
Imon Banerjee
Judy Gichoya
Hari Trivedi

Organization(s)

Emory University
Arizona State University
Georgia Institute of Technology
Kennesaw State University

Contact

Corresponding author: Jiwoong J. Jeong

Funding

Supported in part by the National Center for Advancing Translational Sciences, NIH (award UL1TR002378).

Ethical review

Institutional review board approved; waiver of written informed consent due to use of de-identified data.

Date

Published: 2023-01-04

References

[1] Jeong JJ, Vey BL, Bhimireddy A, Kim T, Santos T, Correa R, Dutt R, Mosunjac M, Oprea-Ilies G, Smith G, Woo M, McAdams CR, Newell MS, Banerjee I, Gichoya J, Trivedi H. "The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammographic Images". Radiology: Artificial Intelligence. 2023-01-01. doi:10.1148/ryai.220047. PMID: 36721407. PMCID: PMC9885379.

Dataset

Motivation

To provide a large, racially diverse, and granular mammography dataset with lesion-level annotations and linked pathology to improve generalizability and reduce bias in AI for breast cancer screening and diagnosis.

Sampling

Retrospective inclusion of women ≥18 years with screening or diagnostic mammograms performed 2013–2020 at four hospitals (two community, one inner-city, one private academic).

Missing information

Approximately 20% of lesions were classified as ambiguous due to inability to automatically link ROIs to specific findings when multiple findings were present per breast. Pathologic outcomes may be missing when specimens were obtained outside the institution. Few annotations on diagnostic images; no ROIs for DBT.

Relationships between instances

Multiple exams per patient; per exam and per breast zero or more findings; ROIs saved as burned-in circles on screen-save images and mapped back to source mammograms; pathology recorded per finding with up to 10 pathologic results, and a derived worst-pathology label.

Noise

Variability and occasional corruption in DICOM metadata across manufacturers; differences in window-level mapping between GE and Hologic scanners required manufacturer-specific normalization; pathologic outcomes in MagView were manually entered and may include discrepancies (secondary NLP checks proposed).

External data

Twenty percent of the dataset is freely available via the Amazon Web Services Open Data Program; access to the full dataset considered on a case-by-case basis following IRB approval.

Confidentiality

All images de-identified; HIPAA identifiers removed or de-identified. Dates shifted using fixed patient-level offsets to preserve temporality.

Re-identification

Data de-identified; DICOM PHI removed; dates shifted. A master key was retained internally to allow addition of new exams.

Sensitive data

Includes demographic data (age, race, ethnicity, insurance), family/procedure/treatment histories when available; all data de-identified.