Memory‑efficient Streaming VideoLLMs for Real‑time Procedural Video Understanding

Dibyadip Chatterjee^1,3 Edoardo Remelli¹ Yale Song² Bugra Tekin¹ Abhay Mittal¹ Bharat Bhatnagar¹

Necati Cihan Camgöz¹ Shreyas Hampali¹ Eric Sauser¹ Shugao Ma¹ Angela Yao³ Fadime Sener¹

¹ Meta Reality Labs

² FAIR, Meta

³ National University of Singapore

ICCV 2025

Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens—verbalized text tokens, which provide compressed textual summaries of long‑term observations, and visual tokens, encoded with DETR‑QFormer to capture fine‑grained details from short‑term observations. This design reduces token count by 22× over existing methods in representing one hour of long‑term observations while effectively encoding present fine‑granularity. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub‑linear scaling of memory and compute with video length, enabling per‑frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2 GB GPU memory footprint. ProVideLLM also sets new state‑of‑the‑art results on six procedural tasks across four datasets.

Citation

@article{chatterjee2025memory,
    title={Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding},
    author={Chatterjee, Dibyadip and Remelli, Edoardo and Song, Yale and Tekin, Bugra 
      and Mittal, Abhay and Bhatnagar, Bharat and Camg{\~A}{\c{k}}z, Necati Cihan 
      and Hampali, Shreyas and Sauser, Eric and Ma, Shugao and others},
    journal={arXiv preprint arXiv:2504.13915},
    year={2025}
  }
}