Streaming VideoLLMs for Real‑time Procedural Video Understanding

1 Meta Reality Labs

2 FAIR, Meta

3 National University of Singapore

ICCV 2025

Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22× over existing methods when representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM achieves sub-linear scaling of memory and compute with video length, ensuring per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Framework overview

Citation

@article{chatterjee2025memory,
    title={Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding},
    author={Chatterjee, Dibyadip and Remelli, Edoardo and Song, Yale and Tekin, Bugra 
      and Mittal, Abhay and Bhatnagar, Bharat and Camg{\~A}{\c{k}}z, Necati Cihan 
      and Hampali, Shreyas and Sauser, Eric and Ma, Shugao and others},
    journal={arXiv preprint arXiv:2504.13915},
    year={2025}
  }
}