当前位置：首页 > news >正文

Single-Model and Any-Modality for Video Object Tracking——2024——cvpr-阅读笔记

news 2026/2/10 0:47:36

Single-Model and Any-Modality for Video Object Tracking

摘要
相关工作
创新处
Method
- Shared embedding
- Modal prompting
- RGB Tracker based on Transformer
- Overall
Experiiment
- Dataset
- - RGB-D samples are sourced from DepthTrack
  - RGB-T samples are extracted from LasHeR
  - RGB-E samples are obtained from VisEven
- 对比试验
- 模型泛化
- 主成分分析
- Ablation Studies
结论与未来工作

这是一篇2024年发表在cvpr的文章，研究领域是利用辅助模态目标跟踪
模型用一句话来概述就是：

Our primary focus is on multimodal tracking, with the constraint that only one modality is available at a time

论文地址
阅读笔记

摘要

在视频对象跟踪领域，深度、热成像或事件数据等辅助模态已成为补充RGB跟踪器的宝贵资产。在实践中，大多数现有的RGB跟踪器学习单一的参数集，以便在各种数据集和应用中使用它们。然而，对于多模态跟踪，类似的单一模型统一性面临着几个挑战。这些挑战源于输入的固有异质性——每个输入都有特定模态的表示，多模态数据集的稀缺性，以及并非所有模态在任何时候都存在。在本研究中，我们引入了Un-Track，这是一种针对任何模态的单一参数集的统一跟踪器。为了处理任何模态，我们的方法通过
低秩分解和重构技术学习它们的共同潜在空间。更重要的是，我们仅使用RGB-X对来学习共同潜在空间。这种独特的共享表示无缝地将所有模态绑定在一起，实现有效的统一，并适应任何缺失的模态，所有这些都在单个基于transfor mer的架构中实现。在 DepthTrack 数据集上，我们的 Un-Track 实现了 +8.1 的绝对 F 分数增益，仅增加了 +2.14（超过 21.50）千兆浮点运算次数和 +6.6M（超过 93M）个参数，通过一种简单而高效的提示策略。在五个具有不同模态的基准数据集上进行的大量比较表明，Un-Track 超越了最先进的统一跟踪器和特定模态的对应跟踪器，验证了我们的有效性和实用性。源代码可在 https://github.com/
Zongwei97/UnTrack 公开获取。

创新处

通过factorization prior 使得 from the low-rank latent space learn common(shared) embedding，从而将异构模态表示转换为统一表示
多种模态学习一个低维的潜在空间，尽管可以提取部分的共同语义，但可能会损失每种模态的独特性，为了充分利用辅助输入，就利用了外部模态提示
同时这种简单高效 light weight的提示策略，使得参数和计算量增加的很少

Method

Shared embedding

在这里插入图片描述
3类7个----看代码都是通过一线性层定义MLP

Modal prompting

在这里插入图片描述

为什么要feature分开这样画了，因为整个RGB Tracker是基于Transformer的，其实就是vit 然后外部模态提示输入进来的图像就是分成patch了然后根据评分函数把转化后的token再分类
和第一个模块一样都是，先进行通道融合再通过MLP投影到低秩空间
主要对不确定标记的进行处理，主要是token fusion 通过相邻可靠的，还有不确定的融合然后那些可靠的token同时就进行了保留

RGB Tracker based on Transformer

在这里插入图片描述
为了缓解在稀疏下游多模态数据集上的过拟合问题，我们采用了一种基于 Transformer 的 RGB 跟踪器，其参数被冻结，并针对多模态跟踪进行了微调。
This leads to the replacement of the frozen attention mechanism h =W0x with the new LoRA attention:
在这里插入图片描述

Overall

在这里插入图片描述
During training, our model learns the shared embedding from samples in the mixed dataset M, effectively binding all modalities together.
As for inference, our model can accommodate any modal input X, thanks to the emergent alignment.

Experiiment

Dataset

RGB-D samples are sourced from DepthTrack

DepthTrack: Unveiling the Power of RGBD Tracking(ICCV 2021)
在这里插入图片描述

RGB-T samples are extracted from LasHeR

LasHeR: A Large-scale High-diversity Benchmark for RGBT Tracking(TIP 2021)
出自安徽大学李成龙课题组
在这里插入图片描述
In addition, we release the unalignedversion of LasHeR to attract the research interest for alignmentfree RGBT tracking

RGB-E samples are obtained from VisEven

VisEvent: Reliable Object Tracking viaCollaboration of Frame and Event Flows(IEEE TRANSACTIONS ON CYBERNETICS 2021)
这是第一个从真实世界收集的用于单目标跟踪的大规模可视事件基准数据集
Propose a cross-modality transformer to achieve more effective feature fusion between visible and event data
Construct more than 30 baseline methods by extending current single-modality trackers into dualmodality versions

对比试验

在这里插入图片描述

模型泛化

Assess the versatility by evaluating performance on datasets that differ from the training ones
在这里插入图片描述

此外，还在缺少D的情况，在DepthTrack上训练
在这里插入图片描述
In practical scenarios, challenges arise when there are no modal clues available, a typical case is when the auxiliary sensor fails to work properly
对此
We address this demanding case in our study by substituting the modal input with dummy values 文章中也没有说具体方法，可能就是之前训练阶段的先验了

主成分分析

在这里插入图片描述
We perform all experiments on the DepthTrack testing set under a single parameter set setting
然后对比原先：

Ablation Studies

在这里插入图片描述
从而可以说明，Low-rank approximation plays a vital role in our model
这些low-rank variants 比SOTA在单一参数集下都表现好，可以看出模型的resilience
对于第四个图
Modal Prompting
Score function based on their confdence scores. So we exploredifferent percentiles for the number of positive, which is the same as the number of negative tokens, leaving the rest as uncertain tokens

结论与未来工作

实现了将所有模态绑定在一起的共享嵌入，克服了它们异构的表示。
这种统一是通过轻量级的模态提示和内部微调实现的继承了大规模预训练跟踪器的优点，而没有引入大量的计算负担。
Surpasses both SOTA unifed trackers and modality-specifc counterparts
Introduce only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters
Model with a single set of parameters already achieves very competitive performance compared to the thermal-specifc ViPT version.
Ensure a training-friendly pipeline that can be effciently employed end-to-end on a single 24G GPU