Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/model.json

Name

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

Link

https://github.com/Project-MONAI/VISTA

Indexing

Keywords: VISTA3D, Foundation Model, 3D Medical Imaging, Image Segmentation, Automatic Segmentation, Interactive Segmentation, Zero-shot Learning, Computed Tomography, Deep Learning, Human-in-the-loop, Multi-organ Segmentation, Tumor Segmentation, Lesion Segmentation, Transfer Learning

Content: CT, MR, NM, US

Author(s)

Yufan He

Pengfei Guo

Yucheng Tang

Andriy Myronenko

Vishwesh Nath

Ziyue Xu

Dong Yang

Can Zhao

Benjamin Simon

Mason Belue

Stephanie Harmon

Baris Turkbey

Daguang Xu

Wenqi Li

Organization(s)

NVIDIA

University of Arkansas for Medical Sciences

National Institutes of Health

University of Oxford

Version

Comments

VISTA3D is a unified foundation model for 3D medical image segmentation, supporting a full human-in-the-loop workflow. It achieves state-of-the-art performance in both 3D automatic (127 classes) and 3D interactive segmentation, including zero-shot capabilities for novel structures and efficient human correction. The model is built on a SegResNet backbone with shared encoder and separate automatic and interactive branches, utilizing a novel 3D supervoxel method for zero-shot performance.

Date

Published: 2024-11-22

Model

Architecture

VISTA3D is built on a SegResNet (U-net type) convolutional neural network backbone from MONAI, adapted for 3D medical image segmentation using patch-based training (128-voxel cubic patch) and sliding window inference. It features a shared image encoder and two distinct branches: an automatic branch with an MLP layer and learnable class embedding for 127 supported classes, and an interactive branch based on SAM's point prompt encoder, modified for 3D inputs. A novel 3D supervoxel method distills 2D pre-trained backbones for enhanced zero-shot performance.

Availability

Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.

Clinical benefit

Reduces human effort in 3D medical image annotation and segmentation by providing highly accurate automatic segmentation for common structures, interactive correction capabilities, and zero-shot ability for novel structures. Facilitates surgical planning, large cohort analysis, and diagnosis by enabling efficient and accurate segmentation.

Clinical workflow phase

Diagnosis, Treatment planning, Follow-up

Degree of automation

High degree of automation for 127 supported classes, with human-in-the-loop interactive capabilities for refinement and zero-shot segmentation of novel structures. The model can operate fully automatically or with user guidance.

Indications for use

Segmentation of 3D medical images (primarily CT) for common anatomical structures (127 classes), rare pathologies, and novel structures. Intended for both automatic segmentation and interactive segmentation with human correction, applicable in tasks such as organ segmentation, tumor detection, and lesion segmentation.

Input

3D CT volumetric images; Class prompts (integer index for supported classes); 3D point click prompts (coordinates and labels for interactive segmentation).

Instructions

For supported classes (127 classes), provide a class prompt for automatic segmentation. For novel or rare classes, use 3D point click prompts for interactive zero-shot segmentation. Interactive segmentation can also be used to efficiently refine automatic segmentation results by providing positive/negative click points in false positive/negative regions.

Limitations

Current model primarily supports CT imaging; future work aims to include MRI and PET. While demonstrating state-of-the-art zero-shot capabilities, interactive input is still required for novel structures. Potential for overfitting to common organs if zero-shot embedding is not explicitly used for ambiguous cases.

Output

Description: Binary segmentation masks for 3D medical images, delineating anatomical structures, lesions, or tumors. Output can be generated automatically or interactively refined.

Recommendation

Recommended for researchers and clinicians requiring versatile, accurate, and efficient 3D medical image segmentation across a broad range of anatomical structures and pathologies, particularly for applications involving large datasets, surgical planning, or the analysis of novel/rare findings.

Reproducibility

Code and weights are publicly available, enabling reproducibility of the model and its results.

Sustainability

Training required 64 32GB NVIDIA V100 GPUs for approximately 20,000 total GPU hours. Inference on a 16GB V100 GPU for 118-class automatic segmentation ranges from 1m07s (333x333x603 image size) to 9m20s (1024x1024x512 image size). Single-click interactive inference takes 3.2s on a 12GB GPU.

Use

Intended: Image segmentation, Automatic segmentation, Interactive segmentation, Zero-shot segmentation, Image annotation, Surgical planning, Large cohort analysis, Tumor detection, Organ segmentation

Out-of-scope: 2D image segmentation, Diagnosis without human oversight

User

Intended: Diagnostic radiologist, Physician, Researcher, Medical annotator, AI developer

Out-of-scope: General public