Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia
2025-11-29https://doi.org/10.1148/atlas.1764457585966
32
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/dataset.json
Name
Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia
Link
https://rsna.org/challenge-datasets/2018
Indexing
Keywords: Chest Radiography, Pneumonia Detection, Machine Learning, Bounding Box Annotation, Pulmonary Opacity, NIH CXR8 Dataset, DICOM, Radiologist Annotation, Image Classification, Object Detection
Content: CH, IN
Author(s)
George Shih
Carol C. Wu
Safwan S. Halabi
Marc D. Kohli
Luciano M. Prevedello
Tessa S. Cook
Arjun Sharma
Judith K. Amorosa
Veronica Arteaga
Maya Galperin-Aizenberg
Ritu R. Gill
Myrna C.B. Godoy
Stephen Hobbs
Jean Jeudy
Archana Laroia
Palmi N. Shah
Dharshan Vummidi
Kavitha Yaddanapudi
Anouk Stein
Organization(s)
Radiological Society of North America
Society of Thoracic Radiology
MD.ai
Weill Cornell Medical College
University of Texas MD Anderson Cancer Center
Stanford University School of Medicine
University of California–San Francisco
The Ohio State University Wexner Medical Center
Hospital of the University of Pennsylvania
Amita Health
Rutgers Robert Wood Johnson Medical School
University of Arizona College of Medicine
Beth Israel Deaconess Medical Center
University of Kentucky College of Medicine
University of Maryland School of Medicine
University of Iowa Carver College of Medicine
Rush University Medical Center
University of Michigan Health System
Stony Brook School of Medicine
Version
Adjudicated Version
Contact
george@cornellradiology.org
Ethical review
No institutional review board approval was obtained; the examinations were part of a publicly available NIH dataset.
Comments
This dataset augments the NIH CXR8 dataset with expert radiologist annotations, including bounding boxes for pulmonary opacities suggestive of pneumonia. It was used for the RSNA 2018 Machine Learning Challenge.
Dataset
Motivation
To provide an annotated dataset to help develop machine learning algorithms that can assist in diagnosis of pneumonia, especially for areas of the world lacking the requisite expertise.
Sampling
Comprised 30,000 frontal view chest radiographs from the 112,000-image public NIH CXR8 dataset. Included 15,000 examinations with pneumonia-like labels, 7500 with 'no findings' label, and 7500 without either, all randomly selected.
Partitioning scheme
A 90:10 training-to-test split was used for the RSNA Machine Learning Challenge, resulting in 3000 cases for the test set. The remaining cases were incorporated into the training set.
Missing information
Only 4527 of 30,000 cases were read by three radiologists; limited number of examinations adjudicated. No specific pathologic findings other than pneumonia were annotated due to time constraints. Bounding box intersections were used instead of weighted averages, resulting in smaller test set bounding boxes (average area for test set = 51,063 pixels, average area for training set = 77,663 pixels).
Relationships between instances
There were 12,274 unique patients (6747 male and 5527 female patients) in the 30,000 examination dataset, with several patients having multiple examinations (up to 75 radiographs for one patient).
Noise
Original NIH dataset contained categorical labels derived automatically from radiology reports using natural language processing, which were not always accurate. This dataset aims to improve accuracy.
External data
National Institutes of Health (NIH) CXR8 dataset
Confidentiality
Publicly available NIH dataset
Re-identification
Random unique identifiers were generated for each examination, implying de-identification.