Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
model2026-01-24https://doi.org/10.1148/atlas.1769272531729
114

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/model.json

Name

Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls

Link

https://dx.doi.org/10.1148/ryai.220028

Indexing

Keywords: generalizability, independence assumption, data leakage, oversampling, data augmentation, feature selection, batch effect, performance metrics, F1 score, Dice, IoU, radiomics, head and neck cancer, lung adenocarcinoma, pneumonia detection
Content: RS, CT, CH
RadLex: RID10323, RID10311, RID12722, RID10321

Author(s)

Farhad Maleki
Katie Ovens
Rajiv Gupta
Caroline Reinhold
Alan Spatz
Reza Forghani

Organization(s)

Department of Computer Science, University of Calgary, Calgary, Canada
Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
Augmented Intelligence & Precision Health Laboratory (AIPHL), Department of Radiology and the Research Institute of the McGill University Health Centre, McGill University, Montreal, Canada
Montreal Imaging Experts, Montreal, Canada
Division of Pathology, Jewish General Hospital, Montreal, Canada
Radiomics and Augmented Intelligence Laboratory (RAIL), Department of Radiology and the Norman Fixel Institute for Neurologic Diseases, University of Florida College of Medicine, UF Health Shands Hospital, Gainesville, FL, USA

Version

1.0

License

Text: CC BY 4.0
URL: https://creativecommons.org/licenses/by/4.0/

Contact

Corresponding author: Reza Forghani (email: ude.lfu@inahgrof.r)

Funding

R.F. supported by Fonds de recherche en santé du Québec (FRQS) and an operating grant jointly funded by FRQS and the Fondation de l’Association des radiologistes du Québec (FARQ). R.G. supported by NIH (5R01CA212382-05, 5R01EB024343-04, 5R01EB024343-04, 1R03EB032038-01). C.R. supported by a grant from Imagia-Medteq. Additional disclosures include grants to institution from McGill University Health Centre Foundation, TD Bank, GE Healthcare, Intel; Canadian Cancer Society/CIHR/Brain Canada SPARK-21; FRQS and FARQ research salary and operating grant.

Ethical review

Retrospective, institutional review board–exempt study.

Date

Updated: 2023-01-01
Published: 2022-11-16
Created: 2022-02-13

References

[1] Maleki F, Ovens K, Gupta R, Reinhold C, Spatz A, Forghani R. "Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls". Radiology: Artificial Intelligence. 2023;5(1):e220028.. 2022-11-16. doi:10.1148/ryai.220028. PMID: 36721408. PMCID: PMC9885377.

Model

Architecture

Multiple illustrative models: conventional radiomics-based ML classifiers and deep learning CNN-based classifiers; simple threshold-based segmentation baseline.

Availability

Not stated.

Clinical benefit

Methodological demonstration to identify pitfalls that impair model generalizability; not intended for clinical deployment.

Clinical workflow phase

Research methodology and model evaluation best practices.

Degree of automation

Not specified; study demonstrates effects of methodological choices on automated model performance.

Indications for use

Not a clinical device; models built to illustrate methodological pitfalls across diagnosis/prognosis/segmentation tasks.

Input

CT images (head and neck; lung), chest radiographs, histopathology whole-slide image patches; derived radiomics features for some tasks.

Limitations

Authors note use of well-established architectures only; impact of sample size and image resolution not fully explored; focus on independence assumption; datasets may not fully represent real-world deployment distributions.

Output

CDEs: RDE2459, RDE1703, RDE744, RDE339, RDE2622
Description: Illustrative model outputs included binary/multiclass classifications (e.g., local recurrence, 3-year overall survival, histopathologic pattern, pneumonia vs normal) and segmentation (lung).

Recommendation

Follow provided methodological guidelines to avoid independence assumption violations, select appropriate performance metrics and baselines, and detect/mitigate batch effects.

Reproducibility

Radiomics model building repeated 100 times for some comparisons; internal evaluation only; no external code release noted.

Use

Intended: Decision support, Detection and diagnosis
Out-of-scope: Decision support
Excluded: Other

User

Intended: Researcher
Out-of-scope: Patient
Excluded: Caregiver