
Share
Researchers at The Chinese University of Hong Kong and Tencent AI Lab propose using irrelevant multimodal data to enhance transformers, boosting their performance across various tasks without direct relevance to the input.
Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue
Affiliations: Multimedia Lab, The Chinese University of Hong Kong; Tencent AI Lab
Links: arXiv, Paper, Code
In the world of deep learning, transformers have become a cornerstone for various tasks, from natural language processing to computer vision. However, one challenge remains: how can we leverage data from different modalities (e.g., images, audio, point clouds) to improve performance in a specific task? The paper "Multimodal Pathway" by researchers from the Chinese University of Hong Kong and Tencent AI Lab introduces a novel approach that uses irrelevant data from other modalities to enhance transformers for a target modality. This method not only improves model performance but also aligns with broader visions of multimodal learning.
The Multimodal Pathway (M2PT) framework consists of two main components:
The key idea is to construct pathways that connect the layers of these two transformers, allowing the target model to benefit from the auxiliary model's learned representations.

To integrate the auxiliary model without additional inference costs, the authors propose Cross-Modality Re-parameterization (CMR). This technique involves:
For example, in an ImageNet task, a point cloud-trained auxiliary transformer can be connected to an MAE-pretrained Vision Transformer (ViT) via CMR. This connection allows the ViT to process image data using both its own weights and those from the point cloud model, leading to improved performance.
The authors conducted experiments on various datasets and tasks, including:
The Multimodal Pathway approach addresses a fundamental challenge in deep learning: how to effectively leverage diverse data sources. By introducing irrelevant data from other modalities, this method not only improves model performance but also opens up new avenues for research in multimodal learning. The ability to train models that can "do many things" and utilize multiple senses aligns with broader visions of general AI, as envisioned by researchers like Jeff Dean.
The Multimodal Pathway framework represents a significant step forward in leveraging
Tags
Original Sources
↗ https://ailab-cvc.github.io/M2PT/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
29 January 2024
133 articles
Related Articles
Related Articles
More Stories