VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
2025-11-22https://doi.org/10.1148/atlas.1763846318923
92
Overview
Schema Version
https://atlas.rsna.org/schemas/2025-11/model.json
Name
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
Link
https://github.com/Project-MONAI/VISTA
Indexing
Keywords: VISTA3D, Foundation Model, 3D Medical Imaging, Image Segmentation, Automatic Segmentation, Interactive Segmentation, Zero-shot Learning, Computed Tomography, Deep Learning, Human-in-the-loop, Multi-organ Segmentation, Tumor Segmentation, Lesion Segmentation, Transfer Learning
Content: CT, MR, NM, US
Author(s)
Yufan He
Pengfei Guo
Yucheng Tang
Andriy Myronenko
Vishwesh Nath
Ziyue Xu
Dong Yang
Can Zhao
Benjamin Simon
Mason Belue
Stephanie Harmon
Baris Turkbey
Daguang Xu
Wenqi Li
Organization(s)
NVIDIA
University of Arkansas for Medical Sciences
National Institutes of Health
University of Oxford
Version
v3
Comments
VISTA3D is a unified foundation model for 3D medical image segmentation, supporting a full human-in-the-loop workflow. It achieves state-of-the-art performance in both 3D automatic (127 classes) and 3D interactive segmentation, including zero-shot capabilities for novel structures and efficient human correction. The model is built on a SegResNet backbone with shared encoder and separate automatic and interactive branches, utilizing a novel 3D supervoxel method for zero-shot performance.
Date
Published: 2024-11-22
Model
Architecture
VISTA3D is built on a SegResNet (U-net type) convolutional neural network backbone from MONAI, adapted for 3D medical image segmentation using patch-based training (128-voxel cubic patch) and sliding window inference. It features a shared image encoder and two distinct branches: an automatic branch with an MLP layer and learnable class embedding for 127 supported classes, and an interactive branch based on SAM's point prompt encoder, modified for 3D inputs. A novel 3D supervoxel method distills 2D pre-trained backbones for enhanced zero-shot performance.
Availability
Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.
Clinical benefit
Reduces human effort in 3D medical image annotation and segmentation by providing highly accurate automatic segmentation for common structures, interactive correction capabilities, and zero-shot ability for novel structures. Facilitates surgical planning, large cohort analysis, and diagnosis by enabling efficient and accurate segmentation.
Clinical workflow phase
Diagnosis, Treatment planning, Follow-up
Degree of automation
High degree of automation for 127 supported classes, with human-in-the-loop interactive capabilities for refinement and zero-shot segmentation of novel structures. The model can operate fully automatically or with user guidance.
Indications for use
Segmentation of 3D medical images (primarily CT) for common anatomical structures (127 classes), rare pathologies, and novel structures. Intended for both automatic segmentation and interactive segmentation with human correction, applicable in tasks such as organ segmentation, tumor detection, and lesion segmentation.
Input
3D CT volumetric images; Class prompts (integer index for supported classes); 3D point click prompts (coordinates and labels for interactive segmentation).
Instructions
For supported classes (127 classes), provide a class prompt for automatic segmentation. For novel or rare classes, use 3D point click prompts for interactive zero-shot segmentation. Interactive segmentation can also be used to efficiently refine automatic segmentation results by providing positive/negative click points in false positive/negative regions.
Limitations
Current model primarily supports CT imaging; future work aims to include MRI and PET. While demonstrating state-of-the-art zero-shot capabilities, interactive input is still required for novel structures. Potential for overfitting to common organs if zero-shot embedding is not explicitly used for ambiguous cases.
Output
Description: Binary segmentation masks for 3D medical images, delineating anatomical structures, lesions, or tumors. Output can be generated automatically or interactively refined.
Recommendation
Recommended for researchers and clinicians requiring versatile, accurate, and efficient 3D medical image segmentation across a broad range of anatomical structures and pathologies, particularly for applications involving large datasets, surgical planning, or the analysis of novel/rare findings.
Reproducibility
Code and weights are publicly available, enabling reproducibility of the model and its results.
Sustainability
Training required 64 32GB NVIDIA V100 GPUs for approximately 20,000 total GPU hours. Inference on a 16GB V100 GPU for 118-class automatic segmentation ranges from 1m07s (333x333x603 image size) to 9m20s (1024x1024x512 image size). Single-click interactive inference takes 3.2s on a 12GB GPU.
Use
Intended: Image segmentation, Automatic segmentation, Interactive segmentation, Zero-shot segmentation, Image annotation, Surgical planning, Large cohort analysis, Tumor detection, Organ segmentation
Out-of-scope: 2D image segmentation, Diagnosis without human oversight
User
Intended: Diagnostic radiologist, Physician, Researcher, Medical annotator, AI developer
Out-of-scope: General public