SPOT-Bench

Abstract

Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time — compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.

SPOT-Bench: Streaming Perception Over Time Benchmark

SPOT-Bench features six proactive streaming tasks grouped into three broad categories.

Detection

Detection evaluates a model's general event perception in a streaming setting: can it identify the exact moment an event occurs or transitions, given only partial observations up to that point?

Action Boundary Detection (ABD)

Respond only when the queried action boundary begins or ends.

Illustration for Action Boundary Detection

Point-of-No-Return Detection (PNR)

Respond only when the queried visual state change occurs.

Illustration for Point-of-No-Return Detection

Interaction

Interaction evaluates a model’s ability to engage in streaming interactions, producing temporally grounded and context-aware responses.

Streaming Question Answering (SQA)

Respond once sufficient visual evidence is available to make a correct decision.

Illustration for Streaming Question Answering

Streaming Procedure Guidance (SPG)

Track procedural progress and proactively cue the next step as the current one nears completion.

Illustration for Streaming Procedure Guidance

Intervention

Intervention probes proactive behavior, where the model assists a user without explicit requests by anticipating potential errors or failures.

Solicited Intervention (SI)

Intervene during a known task or procedure only when the user hesitates or makes an error.

Unsolicited Intervention (UI)

Detect imminent risk and intervene early enough to prevent failure, yet restrained enough to avoid disruptive responses.

Illustration for Unsolicited Intervention

Do existing Online VideoQA benchmarks incentivize streaming?

Existing Online VideoQA benchmarks such as StreamingBench and OVO-Bench mainly adopt a "pause-ask-play" evaluation protocol, which tests whether a model can answer retrospective queries about a video after it has happened. In Table 1, we analyze if such benchmarks incentivize streaming model development. After comparing state-of-the-art streaming models with closed- and open-source offline MLLMs, we observe the following:

Blind baselines perform better than random.
One frame is nearly enough.
Four recent frames outperform streaming models.

Conclusion. Together, these findings reveal a fundamental limitation: OVO-Bench and StreamingBench measure descriptive understanding, evaluating whether a model can describe observed events. Their MCQ format suffers from language bias; correctness-only evaluation provides no signal for timeliness or frequency. Their short durations remove the need for long-term temporal modeling — an offline model processing a single frame can match one that streams every frame. Consequently, these benchmarks offer little incentive to develop streaming models with proactive, timely intervention capabilities; the core requirement for real-world assistive systems.

@article{chatterjee2026don, title={Don't Pause! Every prediction matters in a streaming video}, author={Chatterjee, Dibyadip and Pang, Zhanzhong and Sener, Fadime and Song, Yale and Yao, Angela}, journal={arXiv preprint arXiv:2604.24317}, year={2026} }

Don't Pause Every prediction matters in a streaming video

2026

Abstract

Streaming Evaluation Protocol

Proactive QA

SPOT-Bench: Streaming Perception Over Time Benchmark

Detection

Interaction

Intervention

Do existing Online VideoQA benchmarks incentivize streaming?

BibTeX

⏸︎ Don't Pause Every prediction matters in a streaming video

2026

Abstract

Streaming Evaluation Protocol

Proactive QA

SPOT-Bench: Streaming Perception Over Time Benchmark

Detection

Interaction

Intervention

Do existing Online VideoQA benchmarks incentivize streaming?

BibTeX

Don't Pause Every prediction matters in a streaming video