Prediction–Saliency Correlation (PSC) for evaluating saliency methods in radiology AI
2025-11-30https://doi.org/10.1148/atlas.1764531889856
61
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/model.json
Name
Prediction–Saliency Correlation (PSC) for evaluating saliency methods in radiology AI
Link
https://dx.doi.org/10.1148/ryai.220221
Indexing
Keywords: saliency maps, explainability, trustworthiness, prediction-saliency correlation (PSC), CheXpert, DenseNet-121, ResNet-152, brain MRI, adversarial perturbation, SSIM, AUC
Content: CH, MR, NR, RS
RadLex: RID10312
Author(s)
Jiajin Zhang
Hanqing Chao
Giridhar Dasegowda
Ge Wang
Mannudeep K. Kalra
Pingkun Yan
Organization(s)
Rensselaer Polytechnic Institute, Department of Biomedical Engineering, Center for Biotechnology and Interdisciplinary Studies
Massachusetts General Hospital, Department of Radiology, Harvard Medical School
Version
1.0
License
Text: © 2023 by the Radiological Society of North America, Inc.
Funding
Supported by National Science Foundation (2046708) and National Institutes of Health (R01EB032716).
Ethical review
Retrospective study using fully de-identified public datasets; IRB approval exempt and HIPAA compliant.
Date
Published: 2023-11-08
References
[1] Zhang J, Chao H, Dasegowda G, Wang G, Kalra MK, Yan P. "Revisiting the Trustworthiness of Saliency Methods in Radiology AI". Radiology: Artificial Intelligence. 2024;6(1):e220221. Published online 2023 Nov 8.. 2023-11-08. doi:10.1148/ryai.220221. PMID: 38166328. PMCID: PMC10831523.
Model
Architecture
Evaluation framework using standard CNN classifiers (DenseNet-121, ResNet-152, ResNet-50) and seven saliency methods (vanilla backpropagation, vanilla BP×image, Grad-CAM, guided-Grad-CAM, integrated gradients, SmoothGrad, XRAI). Includes assessment of a commercial black-box prototype.
Availability
Data used: CheXpert (https://stanfordmlgroup.github.io/competitions/chexpert) and a brain tumor MRI dataset (https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset).
Clinical benefit
Provides a quantitative method (PSC) to assess trustworthiness of saliency explanations in medical AI, potentially informing safer interpretation and deployment of AI outputs.
Clinical workflow phase
Research and validation; methodology to support evaluation of AI explainability prior to clinical deployment.
Degree of automation
Analytical/evaluation tool; does not automate clinical decisions; quantifies agreement between model prediction changes and saliency map changes.
Indications for use
Quantitative evaluation of the sensitivity and robustness of saliency-based explanations for AI models using medical images (e.g., chest radiographs and brain MRI) in research settings.
Input
Medical images (frontal chest radiographs from CheXpert; brain MRI images from a public dataset); model predictions and saliency maps from various saliency methods or a commercial prototype.
Instructions
Apply defined perturbation strategies to create prediction-changing (for sensitivity) or saliency-changing (for robustness) adversarial images; compute PSC as Pearson correlation between prediction and saliency changes (measured via Jensen–Shannon divergence); assess AUC and SSIM as reported.
Limitations
Study focused on attribution-based saliency methods; other explainability techniques (e.g., counterfactuals) not evaluated; generalization beyond evaluated datasets and models not established; commercial prototype evaluated without access to internal architecture; details such as image file formats/resolutions not specified.
Output
CDEs: RDE28, RDE17
Description: PSC coefficient (−1 to +1) quantifying correlation between prediction changes and saliency map changes; accompanying AUC and SSIM measurements for sensitivity and robustness analyses.
Recommendation
Use PSC to validate saliency methods before relying on them for interpreting medical AI outputs; exercise caution as commonly used saliency maps showed low sensitivity and robustness under subtle perturbations.
Regulatory information
Comment: Research method; no regulatory authorization applicable.
Reproducibility
Model-agnostic and generalizable across tested CNN architectures and two datasets; human reader study showed perturbed images are difficult to detect by experts.
Use
Intended: Detection and diagnosis
Out-of-scope: Decision support
Excluded: Decision support
User
Intended: Researcher
Out-of-scope: Patient
Excluded: Layperson