当前位置：首页 > news >正文

大模型下的视频理解video understanding

news 2026/5/10 18:00:58

数据集

Learning Video Context as Interleaved Multimodal Sequences

Motivation：
针对Narrative videos, like movie clips, TV series, etc.：因为比较复杂
most top-performing video perception models 都是研究那种原子动作or人or物
understanding video contexts 有很多任务，解决这些任务的模型都太 specific了，不够general
++++=====>
can we develop a general solution that handles these diverse contexts and needs in videos?

Our work
虽然有类似的模型但是when applied to narrative videos, which encompass informative contexts , these models with a pre-defined visual-textual template still exhibit limitations due to inflexibility。基于此做了如下贡献：

提了一个新的多模态模型来解决这类视频，由于有复杂的结构，核心是要将embed the videos as
interleaved multi-modal sequences
想要统一多模态context和任务以一种用户友好的方式
收集了指令微调数据集（用了一系列方法a package of solutions来转换现有的数据集）而且是interleaved multimodal instruction-following。用这个数据集训练了一个deconder-only的模型
除此之外，这个模型的应用是，可以让用户以一种更free-form的形式与视频交互

Model
模型总体来说不难，frame也只是一个token，作者希望通过这样方式更好的编码交错多模态信息来帮助回答问题
model
DATA
建立了几个模板主要关注how to collect the corresponding tuning data for each type of interleaved prompt
实验
实验部分的话，任务很多,都是video 理解中最火的任务，基本都是sota了。一开始提了几个有意义的问题，并进行了深入思考。除此之外容易混淆的setting用了一些小标志代替，显得更清楚。

multi-task learning enhances individual capabilities.
This highlights the language model’s ability to acquire commonsense across
diverse objectives and contexts.
different kinds of interleaved multimodal instruction.

大模型下的视频理解video understanding

数据集

Learning Video Context as Interleaved Multimodal Sequences

相关文章：

大模型下的视频理解video understanding

【网络安全】CR/LF注入+Race Condition绕过MFA

深度学习入门——卷积神经网络

快团团供货大大团长帮卖团长如何线上结算和支付货款？

vite vue3 Webstorm multiple export width the same name “default“

Transformer预测模型及其Python和MATLAB实现

草的渲染理论

Redis：十大数据类型

bugku-web-source

一键生成视频并批量上传视频抖音、bilibili、腾讯（已打包）

Python WSGI服务器库之gunicorn使用详解

Java编程达人：每日一练，提升自我

（35）远程识别(又称无人机识别)（二）

提供三方API接口、调用第三方接口API接口、模拟API接口（一）通过signature签名验证，避免参数恶意修改

CDO学习

奥运会Ⅱ---谁会先抢走你的工作？

用Python打造精彩动画与视频，4.3 创建动态文本和字幕

spring boot + vue3 接入钉钉实现扫码登录

二叉树构建（从3种遍历中构建）python刷题记录

计算机网络中协议与报文的关系

[具身智能-628]：树莓派 4B/5、RK3568/RK3588 开发板的语音传感器接口

如何在Firefox中免费下载Sketchfab模型：3步掌握离线保存终极技巧

在Taotoken控制台进行API Key权限管理与审计日志查看

MVDR算法在5G毫米波基站中的实战：如何用Capon波束形成提升用户侧向精度？

如何高效解决ComfyUI ControlNet Aux插件模型下载失败问题：完整配置指南

从渔船到货轮：聊聊AIS Class A/B/SART设备怎么选，以及那些年我们踩过的安装坑

Navicat无限试用终极指南：三步快速解决macOS版14天限制

从零到一：基于腾讯IM与TRTC构建Android原生语音通话SDK的实战指南

golembot：在聊天平台集成AI编程助手的框架设计与实战

终极网盘直链解决方案：八大主流网盘文件下载地址一键获取指南