UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

a) Sorbonne University     b) Valeo.ai  

UnIVAL model. Our sequence-to-sequence model unifies the architecture, tasks, input/output format, and training objective (next token prediction). UnIVAL is pretrained on image and video-text tasks and can be finetuned to tackle new modalities (audio-text) and tasks (text-to-image generation) that were not used during pretraining.


Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is to unify models, allowing the support of a myriad of tasks and modalities while scaling easily. While few large models (e.g., Flamingo (Alayrac et al., 2022)), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities (e.g., image-text, or video-text). The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The representation learned from image and video-text modalities, allows the model to achieve competitive performance to SoTA when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits for out-of-distribution generalization. We motivate unification by showing the synergy between tasks.

Findings and Contributions

  • To the best of our knowledge, UnIVAL is the first model, with unified architecture, vocabulary, input/output format, and training objective, that is able to tackle image, video, and audio language tasks, without relying on large scale training or large model size. Our 0.25B parameter model achieves competitive performance to existing modality-customized work. With comparable model sizes, we achieves new SoTA on some tasks (e.g. +1.4/+0.98/+0.46 points accuracy on RefCOCO/RefCOCO+/RefCOCOg Visual Grounding, +3.4 CIDEr on Audiocaps).
  • We show the benefits of multimodal curriculum learning with task balancing, for efficiently training the model beyond two modalities.
  • We show the importance of multitask pretraining, compared to the standard single task one, and study the synergy and knowledge transfer between pretrained tasks and modalities. In addition, we find that pretraining on more modalities makes the model generalizes better to new ones. In particular, without any audio pretraining, UnIVAL is able to attain competitive performance to SoTA when finetuned on audio-text tasks.
  • We propose a novel study on multimodal model merging via weight interpolation. We show that, thanks to our unified pretraining and model, when the model is finetuned on different multimodal tasks, weight interpolation can effectively combine the skills of different finetuned weights and improve generalization, creating more robust multitask models without any inference overhead. Thus, in addition to multitask pretraining, averaging differently finetuned weights is another way to leverage and recycle the diversity of multimodal tasks, enabling their collaboration. This is the first study of weight interpolation showing its effectiveness with multimodal foundation models.

Unification of UnIVAL

UnIVAL is unified along 4 axes:

Unified model/architecture: we use the same model during pretraining and finetuning of all tasks, without any task-specific heads. Our model's core is a LM designed to process abstract representations. We employ an encoder-decoder transformer as LM, due to its effectiveness for multimodal tasks and zero-shot generalization after multitask training. The LM is enhanced with lightweight modality-specific projections/encoders that enable the mapping of different modalities to a shared and more abstract representation space. Each encoder extracts a feature map, which is then flattened to generate a sequence of tokens. These tokens are linearly projected to match the input dimension of the LM. In our approach, we opt for CNN encoders as they scale effectively with high-resolution inputs, minimize the number of output tokens, and exhibit improved efficiency during both inference and training compared to transformers.

Unified input/output format: the input/output of all tasks consists of a sequence of tokens, where we use a unified vocabulary that contains text, location, and discrete image tokens.

Unified pretraining tasks: to train a single model on many tasks, a unified representation of these tasks is necessary. We transform all tasks into a sequence-to-sequence format, where each task is specified by a textual prompt (e.g., "What does the video describe?" for video captioning). For pretraining tasks, we pretrain on many relatively small public datasets.

Unified training objective: We optimize the model for conditional next token prediction. Specifically, we only use a cross-entropy loss.


We evaluate the model on several multimodal tasks such as: Image/Video/Audio Captioning, Image/Video QA, Visual Grounding, Visual Entailment and Text-to-Image Generation. In the following we present only few results, including some results on multimodal model merging.

Finetuning for Visual Grounding on RefCOCO, RefCOCO+, and RefCOCOg datasets. UnIVAL achieves the new SoTA results among comparable model sizes.

Finetuning for Video Captioning on MSRVTT dataset. UnIVAL is competitive with other task/modality-customized SoTA that are trained on larger datasets.

Finetuning on the new audio-text modality for audio-captioning. We compare UnIVAL to other audio-text models on Audiocaps and Clotho v1 datasets. Despite not using audio-text during pretraining, UnIVAL is very competitive with other customized SoTA. We compare with models that rely only on audio as input. The best and next best scores are bolded and underlined respectively.

We also evaluate UnIVAL without finetuning on seen and unseen datasets:

Evaluation without finetuning. UnIVAL outperforms OFA and competitive with Unified-IO trained on more data.

Zero-Shot Evaluation. Scores in gray means the dataset is used during pretraining. UnIVAL is competitive with modality-specific models.

Now we present the results on multimodal model merging. Specifically, we average and interpolate between models trained on different multimodal tasks.

Weight interpolation between models trained on different multimodal tasks.
Finetuning for OOD. We uniformly average the models finetuned on 4 image-text tasks and evaluate the resulting model on the same (ID) and new (OOD) tasks.

Qualitative Results

Some qualitative results on image-text tasks.


This work was supprted by HPC resources of CINES and GENCI. The authors would like to thank the staff of CINES for technical support in managing the Adastra GPU cluster, in particular; Jean-Christophe Penalva, Johanne Charpentier, Mathieu Cloirec, Jerome Castaings, Gérard Vernou, Bertrand Cirou and José Ricardo Kouakou. This work was also partly supported by ANR grant VISA DEEP (ANR-20-CHIA-0022).


	title={Un{IVAL}: Unified Model for Image, Video, Audio and Language Tasks},
	author={Mustafa Shukor and Corentin Dancette and Alexandre Rame and Matthieu Cord},
	journal={Transactions on Machine Learning Research},