EMory BrEast imaging Dataset (EMBED)
2026-01-24https://doi.org/10.1148/atlas.1769272149180
20
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/dataset.json
Name
EMory BrEast imaging Dataset (EMBED)
Link
https://registry.opendata.aws/emory-breast-imaging-dataset-embed/
Indexing
Keywords: Mammography, Breast, Machine Learning, Digital breast tomosynthesis, BI-RADS
Content: BR, IN
RadLex: RID10359, RID29897, RID12634, RID49697, RID10357
SNOMED: 254837009
Author(s)
Jiwoong J. Jeong
Brianna L. Vey
Ananth Bhimireddy
Thomas Kim
Thiago Santos
Ramon Correa
Raman Dutt
Marina Mosunjac
Gabriela Oprea-Ilies
Geoffrey Smith
Minjae Woo
Christopher R. McAdams
Mary S. Newell
Imon Banerjee
Judy Gichoya
Hari Trivedi
Organization(s)
Emory University
Arizona State University
Georgia Institute of Technology
Kennesaw State University
Contact
Corresponding author: Jiwoong J. Jeong
Funding
Supported in part by the National Center for Advancing Translational Sciences, NIH (award UL1TR002378).
Ethical review
Institutional review board approved; waiver of written informed consent due to use of de-identified data.
Date
Published: 2023-01-04
References
[1] Jeong JJ, Vey BL, Bhimireddy A, Kim T, Santos T, Correa R, Dutt R, Mosunjac M, Oprea-Ilies G, Smith G, Woo M, McAdams CR, Newell MS, Banerjee I, Gichoya J, Trivedi H. "The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammographic Images". Radiology: Artificial Intelligence. 2023-01-01. doi:10.1148/ryai.220047. PMID: 36721407. PMCID: PMC9885379.
Dataset
Motivation
To provide a large, racially diverse, and granular mammography dataset with lesion-level annotations and linked pathology to improve generalizability and reduce bias in AI for breast cancer screening and diagnosis.
Sampling
Retrospective inclusion of women ≥18 years with screening or diagnostic mammograms performed 2013–2020 at four hospitals (two community, one inner-city, one private academic).
Missing information
Approximately 20% of lesions were classified as ambiguous due to inability to automatically link ROIs to specific findings when multiple findings were present per breast. Pathologic outcomes may be missing when specimens were obtained outside the institution. Few annotations on diagnostic images; no ROIs for DBT.
Relationships between instances
Multiple exams per patient; per exam and per breast zero or more findings; ROIs saved as burned-in circles on screen-save images and mapped back to source mammograms; pathology recorded per finding with up to 10 pathologic results, and a derived worst-pathology label.
Noise
Variability and occasional corruption in DICOM metadata across manufacturers; differences in window-level mapping between GE and Hologic scanners required manufacturer-specific normalization; pathologic outcomes in MagView were manually entered and may include discrepancies (secondary NLP checks proposed).
External data
Twenty percent of the dataset is freely available via the Amazon Web Services Open Data Program; access to the full dataset considered on a case-by-case basis following IRB approval.
Confidentiality
All images de-identified; HIPAA identifiers removed or de-identified. Dates shifted using fixed patient-level offsets to preserve temporality.
Re-identification
Data de-identified; DICOM PHI removed; dates shifted. A master key was retained internally to allow addition of new exams.
Sensitive data
Includes demographic data (age, race, ethnicity, insurance), family/procedure/treatment histories when available; all data de-identified.