ChatGPT (GPT-3.5 and GPT-4) for Brazilian Radiology Board Exam Question Answering
model2025-11-30https://doi.org/10.1148/atlas.1764531770889
223

Overview

Schema Version

https://atlas.rsna.org/schemas/2025-11/model.json

Name

ChatGPT (GPT-3.5 and GPT-4) for Brazilian Radiology Board Exam Question Answering

Link

https://dx.doi.org/10.1148/ryai.230103

Indexing

Keywords: ChatGPT, Artificial Intelligence, Board Examinations, Radiology and Diagnostic Imaging, Mammography, Neuroradiology
Content: ED, IN, OT
RadLex: RID13060, RID10357

Author(s)

Leonardo C. Almeida
Eduardo M. J. M. Farina
Paulo E. A. Kuriki
Nitamar Abdala
Felipe C. Kitamura

Organization(s)

Universidade Federal de São Paulo (UNIFESP) – Department of Artificial Intelligence and Management; Graduate Program in Medicine (Clinical Radiology)
AI Lab, Dasa

Version

1.0

License

Text: © 2023 by the Radiological Society of North America, Inc.

Contact

rb.psefinu@alenac.odranoel

Funding

Authors declared no funding for this work.

Ethical review

This prospective exploratory study did not include any human subjects or patient data and was not required to get approval from the institutional review board.

Date

Updated: 2024-01-01
Published: 2023-11-08
Created: 2023-04-01

References

[1] Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. "Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations". Radiology: Artificial Intelligence. 2024 Jan;6(1):e230103. 2023-11-08. doi:10.1148/ryai.230103. PMID: 38294325. PMCID: PMC10831524.

Model

Architecture

Large language model based on transformer architecture (OpenAI GPT-3.5 and GPT-4).

Availability

Accessed through OpenAI official chat completion API; maximum tokens set to 2048 and temperature set to 0.5 (as used in the study).

Clinical benefit

Educational/assessment use: evaluates ability to answer radiology-related multiple-choice questions; not a clinical diagnostic tool.

Clinical workflow phase

Education; knowledge assessment/benchmarking.

Decision threshold

Passing threshold defined by the examinations: score ≥ 60%.

Degree of automation

Fully automated question answering given text prompts; no human-in-the-loop for answer generation.

Indications for use

Answer multiple-choice questions from Brazilian College of Radiology (CBR) theoretical board examinations (radiology and diagnostic imaging, mammography, and neuroradiology) in a research/benchmarking setting.

Input

Text-only multiple-choice questions (Portuguese) from 2022 CBR theoretical board examinations; five options per question; no image inputs.

Instructions

Zero-shot prompting; five prompt styles tested: raw, brief instruction, long instruction, chain-of-thought ("Let us think about this step by step"), and question-specific automatic prompt generation (QAPG). Each exam per style and model was run five times; median score reported.

Limitations

Evaluation excluded image-based questions; zero-shot only (no few-shot or contextualization); conducted on 2022 exams available online; results depend on prompt style; randomness in LLM outputs mitigated by five repetitions; not a validation for clinical use; Portuguese language questions only.

Output

CDEs: RDE448, RDE442
Description: For each question, the model outputs a selected option (A–E); study aggregates to exam scores (percentage correct) and related statistics.

Recommendation

Use for research/educational benchmarking and prompt engineering exploration; not recommended for clinical decision-making or certification purposes.

Regulatory information

Comment: Study assesses performance; no regulatory submission reported.
Authorization status: Not a regulated medical device; research use in question-answering benchmark.

Reproducibility

Each model/prompt-style combination was executed five times per exam; median score used; statistical tests included Wilcoxon signed rank, Friedman, Nemenyi; observed agreement assessed; temperature fixed at 0.5.

Sustainability

Runtime/energy consumption not reported; API parameters included max tokens 2048 and temperature 0.5.

Use

Intended: Other
Out-of-scope: Decision support, Detection and diagnosis
Excluded: Detection and diagnosis, Other

User

Intended: Other, Researcher
Out-of-scope: Patient
Excluded: Patient