Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia
dataset2025-11-29https://doi.org/10.1148/atlas.1764457585966
32

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/dataset.json

Name

Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia

Link

https://rsna.org/challenge-datasets/2018

Indexing

Keywords: Chest Radiography, Pneumonia Detection, Machine Learning, Bounding Box Annotation, Pulmonary Opacity, NIH CXR8 Dataset, DICOM, Radiologist Annotation, Image Classification, Object Detection
Content: CH, IN

Author(s)

George Shih
Carol C. Wu
Safwan S. Halabi
Marc D. Kohli
Luciano M. Prevedello
Tessa S. Cook
Arjun Sharma
Judith K. Amorosa
Veronica Arteaga
Maya Galperin-Aizenberg
Ritu R. Gill
Myrna C.B. Godoy
Stephen Hobbs
Jean Jeudy
Archana Laroia
Palmi N. Shah
Dharshan Vummidi
Kavitha Yaddanapudi
Anouk Stein

Organization(s)

Radiological Society of North America
Society of Thoracic Radiology
MD.ai
Weill Cornell Medical College
University of Texas MD Anderson Cancer Center
Stanford University School of Medicine
University of California–San Francisco
The Ohio State University Wexner Medical Center
Hospital of the University of Pennsylvania
Amita Health
Rutgers Robert Wood Johnson Medical School
University of Arizona College of Medicine
Beth Israel Deaconess Medical Center
University of Kentucky College of Medicine
University of Maryland School of Medicine
University of Iowa Carver College of Medicine
Rush University Medical Center
University of Michigan Health System
Stony Brook School of Medicine

Version

Adjudicated Version

Contact

george@cornellradiology.org

Ethical review

No institutional review board approval was obtained; the examinations were part of a publicly available NIH dataset.

Comments

This dataset augments the NIH CXR8 dataset with expert radiologist annotations, including bounding boxes for pulmonary opacities suggestive of pneumonia. It was used for the RSNA 2018 Machine Learning Challenge.

Dataset

Motivation

To provide an annotated dataset to help develop machine learning algorithms that can assist in diagnosis of pneumonia, especially for areas of the world lacking the requisite expertise.

Sampling

Comprised 30,000 frontal view chest radiographs from the 112,000-image public NIH CXR8 dataset. Included 15,000 examinations with pneumonia-like labels, 7500 with 'no findings' label, and 7500 without either, all randomly selected.

Partitioning scheme

A 90:10 training-to-test split was used for the RSNA Machine Learning Challenge, resulting in 3000 cases for the test set. The remaining cases were incorporated into the training set.

Missing information

Only 4527 of 30,000 cases were read by three radiologists; limited number of examinations adjudicated. No specific pathologic findings other than pneumonia were annotated due to time constraints. Bounding box intersections were used instead of weighted averages, resulting in smaller test set bounding boxes (average area for test set = 51,063 pixels, average area for training set = 77,663 pixels).

Relationships between instances

There were 12,274 unique patients (6747 male and 5527 female patients) in the 30,000 examination dataset, with several patients having multiple examinations (up to 75 radiographs for one patient).

Noise

Original NIH dataset contained categorical labels derived automatically from radiology reports using natural language processing, which were not always accurate. This dataset aims to improve accuracy.

External data

National Institutes of Health (NIH) CXR8 dataset

Confidentiality

Publicly available NIH dataset

Re-identification

Random unique identifiers were generated for each examination, implying de-identification.