Hurdles to Artificial Intelligence Deployment: Noise in Schemas and “Gold” Labels
2026-01-24https://doi.org/10.1148/atlas.1769271856794
71
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/model.json
Name
Hurdles to Artificial Intelligence Deployment: Noise in Schemas and “Gold” Labels
Link
https://dx.doi.org/10.1148/ryai.220056
Indexing
Keywords: Radiology AI, Dataset creation, Noise in datasets, Schema noise, Label noise, Chest radiograph, CheXpert, ChestX-ray14
Content: CH, RS, IN
RadLex: RID5557, RID43255, RID5352
SNOMED: 36118008, 233604007
Author(s)
Mohamed Abdalla
Benjamin Fine
Organization(s)
Institute for Better Health, Trillium Health Partners
University of Toronto
Centre for Information Technology, Department of Computer Science, University of Toronto
Department of Medical Imaging, University of Toronto
Version
1.0
License
Text: © 2023 by the Radiological Society of North America, Inc.
Contact
Mohamed Abdalla: ude.otnorot.sc@asm
Funding
Acknowledgments note support for M.A. from a Vanier Scholarship (Government of Canada) and the Vector Institute; the AI Deployment and Evaluation Laboratory at Trillium Health Partners and B.F. supported by TD Bank, Canada’s Supercluster, and Trillium Health Partners Foundation.
Ethical review
No human research was performed; study was exempt from institutional review board review (as stated in Label Noise Demonstration section).
Date
Updated: 2023-03-01
Published: 2023-01-11
Created: 2022-03-24
Model
Clinical benefit
Not a deployable model; study characterizes schema and label noise affecting evaluation and deployment of chest radiograph AI classifiers.
Clinical workflow phase
Evaluation/validation considerations prior to deployment; guidance for dataset design and external testing.
Input
Chest radiograph annotations (CheXpert test set; eight annotators; 14 classes; 500 images); comparison of class schemas across CheXpert, ChestX-ray14, and one proprietary classifier.
Limitations
Limited to CheXpert test set; does not analyze underlying images; annotator uncertainty levels unavailable; image-only context (no clinical history/priors); focus on pairwise agreement summaries though Fleiss k also reported.
Output
CDEs: RDE1401, RDE1402
Description: Study outputs include quantified schema overlaps between datasets/classifiers and agreement metrics (percent agreement, Cohen’s kappa, Fleiss kappa) across simulated gold label sets.
Recommendation
Report schema justification and label noise metrics; consider ontology-anchored schemas; prefer soft labels where appropriate; standardize use cases (e.g., ACR DSI) to reduce noise.
Reproducibility
Analyses based on publicly described CheXpert test annotations and combinatorial re-sampling of annotator panels; supplemental tables/figures provide details.