Dependence on a Single Base Model: The construction of the sample filtering mechanism is entirely dependent on Qwen2-VL. While chosen for its technical necessity, this reliance limits generalizability, as different models may define "difficult" samples slightly differently. In the future, this framework will be extended to more base models that have great dynamic resolution support.
Single Criterion for Filtering: We use only downsampling as the criterion for identifying "difficult" samples. Although effective, employing a more diverse set of baseline methods for ensemble filtering could yield a more robust definition. We will explore a more robust filtering mechanism for the subsequent version of VTC-Bench in future work.
Lack of a Formal Theoretical Definition: Our identification of "difficult" and "simple" samples is primarily based on the experimental results. A formal theoretical framework and mathematical proof for a more explainable sample filtering mechanism is yet to be established.
Limited Coverage of Methods: Due to the rapidly evolving field and adaptation constraints, our experimental comparison doesn't include all visual token compression methods. However, the selected methods are sufficient to support our core claims, and the framework remains open and extensible.