Multimodal Pathway: Enhancing Transformers with Irrelevant Data from Other Modalities

Models & Research

The Engineer

29 Jan 2024 · 3 min read

Researchers at The Chinese University of Hong Kong and Tencent AI Lab propose using irrelevant multimodal data to enhance transformers, boosting their performance across various tasks without direct relevance to the input.

CVPR 2024

Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue
Affiliations: Multimedia Lab, The Chinese University of Hong Kong; Tencent AI Lab
Links: arXiv, Paper, Code

In the world of deep learning, transformers have become a cornerstone for various tasks, from natural language processing to computer vision. However, one challenge remains: how can we leverage data from different modalities (e.g., images, audio, point clouds) to improve performance in a specific task? The paper "Multimodal Pathway" by researchers from the Chinese University of Hong Kong and Tencent AI Lab introduces a novel approach that uses irrelevant data from other modalities to enhance transformers for a target modality. This method not only improves model performance but also aligns with broader visions of multimodal learning.

Key Contributions

Enhancement via Irrelevant Data: The paper demonstrates how data from one modality can be used to improve the performance of a transformer trained on another, even if the data is irrelevant (i.e., not paired or interleaved).
Cross-Modal Re-parameterization: A technique that allows the model to utilize weights from an auxiliary transformer without incurring additional inference costs.
Consistent Performance Improvements: The method shows significant gains across multiple modalities, including image, point cloud, video, and audio datasets.

Technical Details

Architecture Overview

The Multimodal Pathway (M2PT) framework consists of two main components:

Target Transformer: A transformer designed for the target modality (e.g., an ImageNet model).
Auxiliary Transformer: A transformer trained on data from a different modality (e.g., point cloud or audio).

The key idea is to construct pathways that connect the layers of these two transformers, allowing the target model to benefit from the auxiliary model's learned representations.

Cross-Modal Re-parameterization

To integrate the auxiliary model without additional inference costs, the authors propose Cross-Modality Re-parameterization (CMR). This technique involves:

Weight Sharing: The weights of the auxiliary transformer are shared with the target transformer.
Pathway Construction: Pathways are created to connect specific layers of the two models, enabling the target model to leverage the auxiliary model's learned features.

For example, in an ImageNet task, a point cloud-trained auxiliary transformer can be connected to an MAE-pretrained Vision Transformer (ViT) via CMR. This connection allows the ViT to process image data using both its own weights and those from the point cloud model, leading to improved performance.

Implementation Details

Tokenizer: A modality-specific tokenizer is used to convert input data into token sequences.
Task-Specific Head: The final layer of the target transformer is tailored to the specific task (e.g., classification for ImageNet).
Pathway Construction: Pathways are constructed by adding skip connections or residual blocks between the layers of the two transformers.

Experimental Results

The authors conducted experiments on various datasets and tasks, including:

ImageNet-1K: A point cloud-trained auxiliary transformer improved an MAE-pretrained ViT by 0.6%.
Point Cloud Datasets: Similar improvements were observed when using image or audio data as the auxiliary modality.
Video and Audio Tasks: The method also showed consistent gains across these modalities.

Why It Matters

The Multimodal Pathway approach addresses a fundamental challenge in deep learning: how to effectively leverage diverse data sources. By introducing irrelevant data from other modalities, this method not only improves model performance but also opens up new avenues for research in multimodal learning. The ability to train models that can "do many things" and utilize multiple senses aligns with broader visions of general AI, as envisioned by researchers like Jeff Dean.

Conclusion

The Multimodal Pathway framework represents a significant step forward in leveraging