Norton: Addressing Multi-Granularity Noisy Correspondence in Long-Term Video-Language Learning

Models & Research

The Engineer

1 Feb 2024 · 3 min read

Researchers introduce Norton, a technique tackling multi-granularity noisy correspondence in long-term video-language studies, aiming to bridge temporal gaps left by short clip analysis methods.

Introduction

Long-term video-language studies often struggle with the computational cost of modeling extended sequences, leading to a focus on shorter clips. However, this approach leaves long-term temporal dependencies largely unexplored. The recent paper by Yijie Lin et al., titled "Multi-granularity Correspondence Learning from Long-term Noisy Videos," introduces NOise Robust Temporal Optimal traNsport (Norton), a method that addresses the multi-granularity noisy correspondence (MNC) problem in long-term video-language learning. MNC refers to both coarse-grained clip-caption misalignment and fine-grained frame-word misalignment, which can significantly hinder temporal learning and video understanding.

Key Contributions

Unified Optimal Transport Framework: Norton leverages optimal transport (OT) to handle both coarse-grained and fine-grained misalignments.
Contrastive Losses: Uses video-paragraph and clip-caption contrastive losses to capture long-term dependencies.
Alignable Prompt Bucket: Filters out irrelevant clips and captions to improve alignment accuracy.
Soft-Maximum Operator: Identifies crucial words and key frames for fine-grained alignment.
Faulty Negative Sample Rectification: Ensures precise temporal modeling by rectifying the alignment target with OT assignment.

Method Overview

Norton employs a multi-step process to address MNC in long-term video-language learning:

Video-Paragraph Contrastive Learning:
- Fine-to-Coarse Perspective: Norton starts by performing fine-grained similarity calculations between frames and words using the log-sum-exp operator on the frame-word similarity matrix.
- Clip-Caption Similarity Matrix: This matrix is then used to capture the coarse-grained similarity between clips and captions.
Alignable Prompt Bucket:
- Filtering Irrelevant Pairs: Norton appends an alignable prompt bucket to the clip-caption similarity matrix, which helps filter out irrelevant clips or captions.
- Realignment Based on Transport Distance: By applying Sinkhorn iterations, Norton realigns asynchronous clip-caption pairs and calculates the optimal transport distance as the video-paragraph similarity.

Fine-Grained Misalignment Handling:
- Soft-Maximum Operator: This operator identifies crucial words and key frames, ensuring precise alignment at a fine-grained level.
- Faulty Negative Sample Rectification: Norton also addresses potential faulty negative samples in clip-caption contrast by rectifying the alignment target using OT assignment.

Dataset

Norton was trained on the HowTo100M dataset, which is widely used for video-language tasks. The authors provide preprocessed data features that can be downloaded from Baidu Cloud Disk (password: nk6e). Detailed instructions for processing the data are available in the project's GitHub repository.

Training Dataset: HowTo100M
Data Feature Download: Baidu Cloud Disk
Processing Instructions: GitHub Repository

Experimental Results

Extensive experiments on various tasks, including video retrieval, videoQA, and action segmentation, demonstrate the effectiveness of Norton. The method shows significant improvements in capturing long-term temporal dependencies and handling noisy correspondences.

Conclusion

Norton offers a robust solution to the multi-granularity noisy correspondence problem in long-term video-language learning by leveraging optimal transport and contrastive learning techniques. This approach not only improves alignment accuracy but also ensures precise temporal modeling, making it a valuable contribution to the field.