Hurdles to Artificial Intelligence Deployment: Noise in Schemas and “Gold” Labels
model2026-01-24https://doi.org/10.1148/atlas.1769271856794
71

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/model.json

Name

Hurdles to Artificial Intelligence Deployment: Noise in Schemas and “Gold” Labels

Link

https://dx.doi.org/10.1148/ryai.220056

Indexing

Keywords: Radiology AI, Dataset creation, Noise in datasets, Schema noise, Label noise, Chest radiograph, CheXpert, ChestX-ray14
Content: CH, RS, IN
RadLex: RID5557, RID43255, RID5352
SNOMED: 36118008, 233604007

Author(s)

Mohamed Abdalla
Benjamin Fine

Organization(s)

Institute for Better Health, Trillium Health Partners
University of Toronto
Centre for Information Technology, Department of Computer Science, University of Toronto
Department of Medical Imaging, University of Toronto

Version

1.0

License

Text: © 2023 by the Radiological Society of North America, Inc.

Contact

Mohamed Abdalla: ude.otnorot.sc@asm

Funding

Acknowledgments note support for M.A. from a Vanier Scholarship (Government of Canada) and the Vector Institute; the AI Deployment and Evaluation Laboratory at Trillium Health Partners and B.F. supported by TD Bank, Canada’s Supercluster, and Trillium Health Partners Foundation.

Ethical review

No human research was performed; study was exempt from institutional review board review (as stated in Label Noise Demonstration section).

Date

Updated: 2023-03-01
Published: 2023-01-11
Created: 2022-03-24

Model

Clinical benefit

Not a deployable model; study characterizes schema and label noise affecting evaluation and deployment of chest radiograph AI classifiers.

Clinical workflow phase

Evaluation/validation considerations prior to deployment; guidance for dataset design and external testing.

Input

Chest radiograph annotations (CheXpert test set; eight annotators; 14 classes; 500 images); comparison of class schemas across CheXpert, ChestX-ray14, and one proprietary classifier.

Limitations

Limited to CheXpert test set; does not analyze underlying images; annotator uncertainty levels unavailable; image-only context (no clinical history/priors); focus on pairwise agreement summaries though Fleiss k also reported.

Output

CDEs: RDE1401, RDE1402
Description: Study outputs include quantified schema overlaps between datasets/classifiers and agreement metrics (percent agreement, Cohen’s kappa, Fleiss kappa) across simulated gold label sets.

Recommendation

Report schema justification and label noise metrics; consider ontology-anchored schemas; prefer soft labels where appropriate; standardize use cases (e.g., ACR DSI) to reduce noise.

Reproducibility

Analyses based on publicly described CheXpert test annotations and combinatorial re-sampling of annotator panels; supplemental tables/figures provide details.