UT Southwestern Adult Chest Radiograph Reports (2020-2021) with Tube/Line Annotations
2026-01-24https://doi.org/10.1148/atlas.1769275462656
122
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/dataset.json
Name
UT Southwestern Adult Chest Radiograph Reports (2020-2021) with Tube/Line Annotations
Link
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9344209/
Indexing
Keywords: radiology reports, natural language processing, BERT, PubMedBERT, RoBERTa, DeBERTa, DistilBERT, chest radiograph, endotracheal tube, nasogastric tube, central venous catheter, Swan-Ganz catheter, UT Southwestern, dataset annotation
Content: CH, IN
RadLex: RID5557, RID49587, RID5566, RID5584, RID49768
Author(s)
Ali S. Tejani
Yee S. Ng
Yin Xi
Julia R. Fielding
Travis G. Browning
Jesse C. Rayan
Organization(s)
University of Texas Southwestern Medical Center
Contact
Corresponding author: Jesse C. Rayan (email shown as ude.nretsewhtuostu@nayar.essej)
Funding
Authors declared no funding for this work.
Ethical review
IRB approved with exempt status; informed consent waived; HIPAA compliant. Reports were de-identified with pseudoanonymization.
Comments
Retrospective study of chest radiograph text reports used to train and evaluate pretrained transformer NLP models to identify presence/absence of devices (ETT, NGT, CVC, SGC). 1004 reports were manually annotated; overall cohort included 69,095 adult reports.
Date
Published: 2022-06-29
Created: 2020-04-01
References
[1] Tejani AS; Ng YS; Xi Y; Fielding JR; Browning TG; Rayan JC. "Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets". Radiology: Artificial Intelligence. 2022-06-29. doi:10.1148/ryai.220007. PMID: 35923377. PMCID: PMC9344209.
Dataset
Motivation
Automate and accelerate annotation of large datasets for downstream computer vision tasks using pretrained transformer NLP models.
Sampling
From 69,095 adult chest radiograph reports (April 2020–March 2021), 1004 reports were randomly selected for manual annotation and modeling.
Partitioning scheme
Fivefold cross-validation on 1004 annotated reports with 60%/20%/20% train/validation/test per fold (runs 1–5). Additional runs with reduced training/validation sizes (runs 6–10) while keeping a fixed test set (208 reports).
Missing information
No external site validation reported; no public data repository link provided.
Relationships between instances
Each radiology report was unique to a study instance/accession; no repeated reports; folds assigned by pseudoanonymized MRN to avoid patient overlap across folds.
Noise
Reports included both structured and unstructured formats; potential negative-set bias since 64.4% of reports had no listed lines/tubes; relatively few Swan-Ganz–positive cases.
Confidentiality
Reports were de-identified and pseudoanonymized (PHI replaced with unique identifiers; original data encrypted).
Re-identification
Low risk due to de-identification and pseudoanonymization; single academic site.
Sensitive data
Adult clinical radiology reports; de-identified.