VLM3D Challenge – Task 2: Multi‑Abnormality Classification

Welcome to Task 2 of the Vision‑Language Modeling in 3D Medical Imaging (VLM3D) Challenge. In this task, participants develop algorithms that label 3D chest CT volumes with 18 clinically significant abnormalities.


Contents

  1. Overview
  2. Dataset
  3. Task Objective
  4. Participation Rules
  5. Evaluation & Ranking
  6. Prizes & Publication
  7. Citation
  8. Contact

Overview

Radiologists often screen chest CT scans for multiple co‑occurring pathologies. Automating this step can

  • Accelerate triage in busy emergency or outpatient workflows
  • Standardize reporting across institutions
  • Enable downstream AI tasks such as localization and report generation

Task 2 uses CT‑RATE, currently the largest public chest‑CT dataset with per‑scan abnormality labels, to benchmark multi‑label classification models for 3D data.


Dataset

Split Patients CT Volumes Labels/Scan Source
Train 20 000  ≈ 47 k  18      Istanbul Medipol University
Validation 1 304  ≈ 3 k      18      Istanbul Medipol University
Internal Test 2 000 2 000 hidden Istanbul Medipol University
External Test 1 024 1 024 hidden Boston University Hospital

Each scan is paired with a binary vector indicating the presence/absence of 18 thoracic findings (e.g., pleural effusion, lung nodule). Raw nifti volumes are provided with full DICOM metadata.


Task Objective

Given a 3D chest CT volume, predict a binary label for each of the following 18 abnormalities:

  • Medical material, Arterial wall calcification, Cardiomegaly, Pericardial effusion, Coronary artery wall calcification, Hiatal hernia, Lymphadenopathy, Emphysema, Atelectasis, Lung nodule, Lung opacity, Pulmonary fibrotic sequela, Pleural effusion, Mosaic attenuation pattern, Peribronchial thickening, Consolidation, Bronchiectasis, Interlobular septal thickening

A valid submission must output one 18‑length vector per scan.


Participation Rules

  • Method type: Fully automatic – no human input at inference.
  • Training data: Use CT‑RATE plus any publicly available data or models.
  • Team limits: Max 1 submission / day; the last valid run counts.
  • Organizer teams: May appear on the leaderboard but cannot win prizes.

Evaluation & Ranking

Classification Metrics

Metric What it Measures
AUROC Threshold‑free separability
F1 Score Balance of precision & recall
CRG Score Clinically-weighted

Metrics are aggregated per abnormality, then macro‑averaged over all 18 classes.

Final Ranking

A point‑based scheme (VerSe / BraTS style):

  1. For each metric, run a two‑sided permutation test (10 000 samples) between every pair of teams.
  2. Award 1 point for each significant win.
  3. Rank by total points (higher = better). Ties share the same place.

Missing predictions receive the minimum score for that scan.


Prizes & Publication

  • Awards – details TBA.
  • Every team with a valid submission will be invited to co‑author the joint challenge paper (MedIA / IEEE TMI).
  • An overview manuscript describing baseline results will appear on arXiv before the test phase closes.

Citation

If you use CT‑RATE or participate in VLM3D, please cite:

@article{hamamci2024developing,
  title   = {Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography},
  author  = {Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and others},
  journal = {arXiv preprint arXiv:2403.17834},
  year    = {2024}
}


@inproceedings{hamamci2025crg,
  title={CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation},
  author={Hamamci, Ibrahim Ethem and Er, Sezgin and Shit, Suprosanna and Reynaud, Hadrien and Kainz, Bernhard and Menze, Bjoern},
  booktitle={Medical Imaging with Deep Learning-Short Papers},
  year={2025}
}

Contact

For technical issues, open an issue or post on the challenge forum. For all other inquiries, use the “Help → Email organizers” link on the challenge site.