Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/dataset.json

Name

UT Southwestern Adult Chest Radiograph Reports (2020-2021) with Tube/Line Annotations

Link

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9344209/

Indexing

Keywords: radiology reports, natural language processing, BERT, PubMedBERT, RoBERTa, DeBERTa, DistilBERT, chest radiograph, endotracheal tube, nasogastric tube, central venous catheter, Swan-Ganz catheter, UT Southwestern, dataset annotation

Content: CH, IN

RadLex: RID5557, RID49587, RID5566, RID5584, RID49768

Author(s)

Ali S. Tejani

Yee S. Ng

Yin Xi

Julia R. Fielding

Travis G. Browning

Jesse C. Rayan

Organization(s)

University of Texas Southwestern Medical Center

Contact

Corresponding author: Jesse C. Rayan (email shown as ude.nretsewhtuostu@nayar.essej)

Funding

Authors declared no funding for this work.

Ethical review

IRB approved with exempt status; informed consent waived; HIPAA compliant. Reports were de-identified with pseudoanonymization.

Comments

Retrospective study of chest radiograph text reports used to train and evaluate pretrained transformer NLP models to identify presence/absence of devices (ETT, NGT, CVC, SGC). 1004 reports were manually annotated; overall cohort included 69,095 adult reports.

Date

Published: 2022-06-29

Created: 2020-04-01

References

[1] Tejani AS; Ng YS; Xi Y; Fielding JR; Browning TG; Rayan JC. "Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets". Radiology: Artificial Intelligence. 2022-06-29. doi:10.1148/ryai.220007. PMID: 35923377. PMCID: PMC9344209.

Dataset

Motivation

Automate and accelerate annotation of large datasets for downstream computer vision tasks using pretrained transformer NLP models.

Sampling

From 69,095 adult chest radiograph reports (April 2020–March 2021), 1004 reports were randomly selected for manual annotation and modeling.

Partitioning scheme

Fivefold cross-validation on 1004 annotated reports with 60%/20%/20% train/validation/test per fold (runs 1–5). Additional runs with reduced training/validation sizes (runs 6–10) while keeping a fixed test set (208 reports).

Missing information

No external site validation reported; no public data repository link provided.

Relationships between instances

Each radiology report was unique to a study instance/accession; no repeated reports; folds assigned by pseudoanonymized MRN to avoid patient overlap across folds.

Noise

Reports included both structured and unstructured formats; potential negative-set bias since 64.4% of reports had no listed lines/tubes; relatively few Swan-Ganz–positive cases.

Confidentiality

Reports were de-identified and pseudoanonymized (PHI replaced with unique identifiers; original data encrypted).

Re-identification

Low risk due to de-identification and pseudoanonymization; single academic site.

Sensitive data

Adult clinical radiology reports; de-identified.