A Multimodal Class-Incremental Learning Benchmark for Classification Tasks

2024·

Marco D'Alessandro

Enrique Calabrés

Mikel Elkano

David Vázquez

Abstract

Continual learning has made significant progress in addressing catastrophic forgetting in vision and language domains, yet the majority of research has treated these modalities separately. The exploration of multimodal continual learning remains sparse, with a few existing works focused on specific applications like VQA, text-to-vision retrieval, and incremental multi-tasking. These efforts lack a general benchmark to standardize the evaluation of models in multimodal continual learning settings. In this paper, we introduce a novel benchmark for Multimodal Class-Incremental Learning (MCIL), designed specifically for multimodal classification tasks. The benchmark comprises a curated selection of multimodal datasets tailored to classification challenges. We further adapt a widely used Vision-Language model to multiple existing continual learning strategies, providing crucial insights into the behavior of vision-language models in incremental classification tasks. This work represents the first comprehensive framework for MCIL, establishing a foundation for future research in multimodal continual learning.

Type

Manuscript

Publication

Workshop at Advances in Neural Information Processing Systems (NeurIPS)

Last updated on Apr 15, 2026

← Improved Training Set Selection for Semi-Supervised Learning Dec 1, 2024

CADet: Fully Self-Supervised Out-of-Distribution Detection with Contrastive Learning Jan 1, 2024 →

No results found

A Multimodal Class-Incremental Learning Benchmark for Classification Tasks