
Share
Researchers introduce Norton, a technique tackling multi-granularity noisy correspondence in long-term video-language studies, aiming to bridge temporal gaps left by short clip analysis methods.
Long-term video-language studies often struggle with the computational cost of modeling extended sequences, leading to a focus on shorter clips. However, this approach leaves long-term temporal dependencies largely unexplored. The recent paper by Yijie Lin et al., titled "Multi-granularity Correspondence Learning from Long-term Noisy Videos," introduces NOise Robust Temporal Optimal traNsport (Norton), a method that addresses the multi-granularity noisy correspondence (MNC) problem in long-term video-language learning. MNC refers to both coarse-grained clip-caption misalignment and fine-grained frame-word misalignment, which can significantly hinder temporal learning and video understanding.
Norton employs a multi-step process to address MNC in long-term video-language learning:
Video-Paragraph Contrastive Learning:
Alignable Prompt Bucket:

Norton was trained on the HowTo100M dataset, which is widely used for video-language tasks. The authors provide preprocessed data features that can be downloaded from Baidu Cloud Disk (password: nk6e). Detailed instructions for processing the data are available in the project's GitHub repository.
Extensive experiments on various tasks, including video retrieval, videoQA, and action segmentation, demonstrate the effectiveness of Norton. The method shows significant improvements in capturing long-term temporal dependencies and handling noisy correspondences.
Norton offers a robust solution to the multi-granularity noisy correspondence problem in long-term video-language learning by leveraging optimal transport and contrastive learning techniques. This approach not only improves alignment accuracy but also ensures precise temporal modeling, making it a valuable contribution to the field.
Tags
Original Sources
↗ https://lin-yijie.github.io/projects/Norton/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
1 February 2024
88 articles
Related Articles
Related Articles
More Stories