Curation of the CANDID-PTX Dataset with Free-Text Reports
Curation of the CANDID-PTX Dataset with Free-Text Reports
2025-11-29https://doi.org/10.1148/atlas.1764458274807
62
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/dataset.json
Name
Curation of the CANDID-PTX Dataset with Free-Text Reports
Link
https://doi.org/10.1148/ryai.2021210136
Indexing
Keywords: Conventional Radiography, Thorax, Trauma, Ribs, Catheters, Segmentation, Diagnosis, Classification, Supervised Learning, Machine Learning
Content: CH
RadLex: RID5352, RID5573
Author(s)
Sijing Feng
Damian Azzollini
Ji Soo Kim
Cheng-Kai Jin
Simon P. Gordon
Jason Yeoh
Eve Kim
Mina Han
Andrew Lee
Aakash Patel
Joy Wu
Martin Urschler
Amy Fong
Cameron Simmers
Gregory P. Tarr
Stuart Barnard
Ben Wilson
Organization(s)
Department of Radiology, Dunedin Hospital
Eastern Health
Auckland District Health Board
Waitemata District Health Board
Waikato District Health Board
The University of Auckland Faculty of Medical and Health Sciences
University of Otago Medical School
IBM Almaden Research Center
School of Computer Science, University of Auckland
Department of Radiology, Auckland City Hospital
Department of Radiology, Middlemore Hospital
Contact
sijingfeng@gmail.com
Funding
Royal Australia and New Zealand College of Radiologists (RANZCR) research grant in 2020.
Ethical review
Approved by the University of Otago Human Ethics Committee, which waived individual patient informed consent due to low-risk nature and use of anonymized data.
Comments
This large chest radiograph dataset has segmented annotations for pneumothoraces, acute rib fractures, and intercostal chest tubes. The dataset, which can be used to train and test machine learning algorithms in the identification of these features on chest radiographs, includes corresponding anonymized, free-text radiology reports. The temporal relationship between images of the same patient has been preserved.
Dataset
Motivation
To curate the first large pneumothorax segmentation dataset in a New Zealand population to address the bottleneck in developing large, well-annotated datasets for training and testing AI algorithms on adult chest radiographs.
Sampling
A total of 295,613 chest radiographs and their radiology reports were retrospectively acquired from Dunedin Hospital PACS between January 2010 and April 2020. Inclusion criteria: frontal chest radiographs (including bedside images) from patients over 16 years of age.
Missing information
30 images had no information on patient sex when extraction from the original DICOM metadata was attempted.
Relationships between instances
The temporal relationship between images of the same patient has been preserved.
Noise
80 images were excluded due to lacking clear 'pneumothorax' or 'no pneumothorax' labels, likely due to human error. The dataset's single-institution origin reduces heterogeneity, potentially limiting AI algorithm generalizability.
External data
MD.ai, a Health Insurance Portability and Accountability Act–compliant, commercial, cloud-based image annotation platform, was used for image upload and annotation.
Confidentiality
Image metadata and reports were de-identified using algorithm-based tools to remove protected health information (PHI) and satisfy New Zealand Ministry of Health standards. Manual review ensured quality of anonymization.
Re-identification
File names containing acquisition dates were anonymized consistently and irreversibly by translating them into future dates, while preserving temporal relationships between images.
Sensitive data
Protected health information (PHI) was removed from images and reports during de-identification.