Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/dataset.json

Name

Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia

Link

https://rsna.org/challenge-datasets/2018

Indexing

Keywords: Chest Radiography, Pneumonia Detection, Machine Learning, Bounding Box Annotation, Pulmonary Opacity, NIH CXR8 Dataset, DICOM, Radiologist Annotation, Image Classification, Object Detection

Content: CH, IN

Author(s)

George Shih

Carol C. Wu

Safwan S. Halabi

Marc D. Kohli

Luciano M. Prevedello

Tessa S. Cook

Arjun Sharma

Judith K. Amorosa

Veronica Arteaga

Maya Galperin-Aizenberg

Ritu R. Gill

Myrna C.B. Godoy

Stephen Hobbs

Jean Jeudy

Archana Laroia

Palmi N. Shah

Dharshan Vummidi

Kavitha Yaddanapudi

Anouk Stein

Organization(s)

Radiological Society of North America

Society of Thoracic Radiology

MD.ai

Weill Cornell Medical College

University of Texas MD Anderson Cancer Center

Stanford University School of Medicine

University of California–San Francisco

The Ohio State University Wexner Medical Center

Hospital of the University of Pennsylvania

Amita Health

Rutgers Robert Wood Johnson Medical School

University of Arizona College of Medicine

Beth Israel Deaconess Medical Center

University of Kentucky College of Medicine

University of Maryland School of Medicine

University of Iowa Carver College of Medicine

Rush University Medical Center

University of Michigan Health System

Stony Brook School of Medicine

Version

Adjudicated Version

Contact

george@cornellradiology.org

Ethical review

No institutional review board approval was obtained; the examinations were part of a publicly available NIH dataset.

Comments

This dataset augments the NIH CXR8 dataset with expert radiologist annotations, including bounding boxes for pulmonary opacities suggestive of pneumonia. It was used for the RSNA 2018 Machine Learning Challenge.

Dataset

Motivation

To provide an annotated dataset to help develop machine learning algorithms that can assist in diagnosis of pneumonia, especially for areas of the world lacking the requisite expertise.

Sampling

Comprised 30,000 frontal view chest radiographs from the 112,000-image public NIH CXR8 dataset. Included 15,000 examinations with pneumonia-like labels, 7500 with 'no findings' label, and 7500 without either, all randomly selected.

Partitioning scheme

A 90:10 training-to-test split was used for the RSNA Machine Learning Challenge, resulting in 3000 cases for the test set. The remaining cases were incorporated into the training set.

Missing information

Only 4527 of 30,000 cases were read by three radiologists; limited number of examinations adjudicated. No specific pathologic findings other than pneumonia were annotated due to time constraints. Bounding box intersections were used instead of weighted averages, resulting in smaller test set bounding boxes (average area for test set = 51,063 pixels, average area for training set = 77,663 pixels).

Relationships between instances

There were 12,274 unique patients (6747 male and 5527 female patients) in the 30,000 examination dataset, with several patients having multiple examinations (up to 75 radiographs for one patient).

Noise

Original NIH dataset contained categorical labels derived automatically from radiology reports using natural language processing, which were not always accurate. This dataset aims to improve accuracy.

External data

National Institutes of Health (NIH) CXR8 dataset

Confidentiality

Publicly available NIH dataset

Re-identification

Random unique identifiers were generated for each examination, implying de-identification.