Are We Using the Right Benchmark:
An Evaluation Framework for Visual Token Compression Methods

Chenfei Liao1,2,6, Wensong Wang3,2, Zichen Wen2,5, Xu Zheng1,4,6, Yiyu Wang2,
Haocong He2, Yuanhuiyi Lyu1,6, Lutao Jiang1,6, Xin Zou1,6, Yuqian Fu4, Bin Ren7,8,4,
Linfeng Zhang2,*, Xuming Hu1,6,*
1HKUST (Guangzhou) 2SJTU 3Northeastern University 4INSAIT
5Shanghai AI Laboratory 6HKUST 7University of Pisa 8University of Trento

Abstract

Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch.

In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks.

Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity.

Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.

Motivation

Some recent MLLMs, such as Qwen2-VL and Qwen2.5-VL, natively support inputs of varying resolutions. A trivial yet efficient method to handle high-resolution images is to simply downsample them to a lower resolution. However, most token compression methods for MLLMs choose to adaptively drop useless tokens or merge similar tokens instead of directly downsampling the original image, which theoretically should be more intelligent.

Surprisingly, we find that image downsampling consistently exceeds other sophisticated methods under some settings. Based on comprehensive experiments, we propose a bold hypothesis:

Some data in the existing benchmarks is overly simplistic and irrelevant to evaluating visual token compression methods, leading to the unreasonable phenomenon that even the downsampling method is sufficient to deal with the visual token compression task.

Motivation: Downsampling Anomaly
Figure 1. The Anomaly. (a) Average Performance Retention Ratio (APRR) of five visual token compression methods on eight benchmarks (Model: Qwen2-VL-7B). (b) Comparison of advanced token compression methods and downsampling on Qwen2-VL-7B by groups at 75% compression.

To validate this, we design a data-centric analysis using downsampling as a discriminator. We identify two crucial findings:

  1. Current benchmarks are noisy for the visual token compression task. Many samples can be answered correctly even with significant downsampling, indicating they do not test fine-grained visual understanding.
  2. Downsampling can serve as a data filter. By separating samples into "simple" (Group B) and "difficult" (Group A) based on whether downsampling succeeds, we can effectively distinguish samples that truly require advanced compression.

VTC-Bench Framework

Based on these findings, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks. By explicitly distinguishing between “simple” and “difficult” samples through downsampling, VTC-Bench adaptively selects "difficult" samples that satisfy the requirements of evaluating visual token compression methods.

VTC-Bench Framework Pipeline
Figure 2. VTC-Bench Overview. The VTC-Bench is a simple but effective framework that can transform any existing benchmarks to a subset that can fairly evaluate visual token compression methods.

The pipeline consists of three critical steps:

Step 1: Inference & Compression. Given a sample and a target token compression ratio, we run two inference pipelines: (1) a downsampling baseline (the filter) and (2) advanced visual token compression methods (e.g., FastV, VisionZip, DART) evaluated directly on the target MLLM.

Step 2: Grouping. We use the performance of the downsampling method as a binary discriminator to categorize samples:

  • Group A (Difficult Samples): Samples that are answered incorrectly by the downsampling method.
  • Group B (Simple Samples): Samples that are answered correctly by the downsampling method.

This step filters the existing benchmarks and removes noisy data that is not applicable for evaluating the visual token compression methods.

Step 3: Result Aggregation. We perform a statistical analysis on the accuracy of the "difficult" samples to obtain an indicator that truly reflects the capability of visual compression methods.

Experiments & Findings

When evaluated using VTC-Bench (focusing on "Difficult Samples"), the landscape of visual token compression changes completely. Advanced methods prove their worth where it matters most.

Is downsampling all you need? Across many benchmarks, simple image downsampling often beats more advanced compression methods. VTC-Bench overturns this impression: when we restrict evaluation to the compression-relevant difficult samples (Group A), the trend reverses. By filtering out easy samples, VTC-Bench reveals that for truly challenging instances, advanced visual token compression methods are not only effective but necessary.

1. Standard Benchmark Results (Table 1)

Performance comparison on standard benchmarks (Qwen2-VL-7B). LLaVA-OV-7B overall results are not available in the paper.

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAPRR
Vanilla62.378.978.0230688.457.180.781.6100.0
Downsample59.275.073.8225986.250.164.965.091.0
FastV57.073.773.1208384.544.642.058.183.2
VisionZip58.671.170.5206287.147.242.166.984.9
PruMerge+59.472.172.0204487.248.033.956.282.7
DART56.972.570.2206684.747.252.552.783.9
MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAPRR
Vanilla62.378.978.0230688.457.180.781.6100.0
Downsample55.569.070.2212782.944.048.824.877.6
FastV52.365.065.5185477.440.325.932.970.2
VisionZip53.362.963.0182083.640.225.148.472.5
PruMerge+54.862.261.3180684.338.422.244.271.0
DART51.961.361.8191580.539.841.030.871.6
MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAPRR
Vanilla62.378.978.0230688.457.180.781.6100.0
Downsample52.666.466.8199479.540.940.312.771.0
FastV49.057.157.9168474.937.518.720.662.1
VisionZip49.054.854.0170480.235.215.928.062.2
PruMerge+48.748.448.1167979.233.214.430.059.5
DART49.253.454.0178678.133.633.719.263.2
MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAPRR
Vanilla62.378.978.0230688.457.180.781.6100.0
Downsample50.162.061.4193878.837.532.311.766.4
FastV46.143.946.6158972.433.614.415.854.5
VisionZip46.449.550.0162877.833.412.019.457.1
PruMerge+45.039.140.9154474.030.510.520.952.1
DART45.647.948.2170174.731.729.316.658.3
MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAPRR
Vanilla62.378.978.0230688.457.180.781.6100.0
Downsample43.551.651.9158972.833.813.212.155.4
FastV38.223.924.5118955.026.15.811.938.0
VisionZip41.940.540.5133565.530.84.912.847.3
PruMerge+39.023.724.4116551.625.73.513.937.4
DART40.530.830.7134660.028.823.211.845.4

2. Group Comparison (Table 2)

The accuracy gap between Group A (Difficult) and Group B (Simple).

MethodGQAMMBMMBCNMMEPOPEMMStarChartQAAverage
Group B (Simple)
FastV87.695.995.896.794.876.078.185.3
VisionZip91.293.893.695.396.881.487.387.2
PruMerge+91.995.194.695.997.582.373.684.6
DART88.194.994.694.994.577.769.085.5
Downsample100100100100100100100100
Group A (Difficult)
FastV57.845.256.578.965.441.035.051.1
VisionZip59.342.442.254.972.545.951.249.8
PruMerge+57.751.252.662.072.148.140.550.7
DART58.954.852.267.669.447.039.053.6
Downsample0.00.00.00.00.00.00.00.0
MethodGQAMMBMMBCNMMEPOPEMMStarChartQAAverage
Group B (Simple)
FastV82.590.390.894.088.773.061.777.8
VisionZip83.489.088.192.292.373.074.478.6
PruMerge+85.887.286.491.994.271.673.878.0
DART81.287.786.991.790.970.057.678.6
Downsample100100100100100100100100
Group A (Difficult)
FastV44.539.244.159.446.831.028.438.9
VisionZip49.433.244.448.170.030.349.743.4
PruMerge+50.436.938.442.971.528.843.541.3
DART47.540.540.949.657.735.427.341.3
Downsample0.00.00.00.00.00.00.00.0
MethodGQAMMBMMBCNMMEPOPEMMStarChartQAAverage
Group B (Simple)
FastV81.485.786.691.588.174.574.877.0
VisionZip79.081.982.288.489.469.871.373.4
PruMerge+76.776.976.187.887.665.568.970.2
DART78.881.880.488.988.561.867.175.6
Downsample100100100100100100100100
Group A (Difficult)
FastV35.731.935.348.837.422.814.830.0
VisionZip41.034.533.343.566.324.326.135.4
PruMerge+43.029.634.143.067.725.529.435.6
DART41.933.838.446.957.026.214.535.5
Downsample0.00.00.00.00.00.00.00.0
MethodGQAMMBMMBCNMMEPOPEMMStarChartQAAverage
Group B (Simple)
FastV79.274.574.490.185.868.675.371.9
VisionZip75.680.380.287.486.569.669.271.2
PruMerge+72.569.369.183.982.261.470.365.9
DART74.577.375.184.984.363.571.173.1
Downsample100100100100100100100100
Group A (Difficult)
FastV29.624.733.135.635.621.59.325.0
VisionZip38.631.232.637.960.024.515.131.4
PruMerge+38.826.029.337.056.422.617.029.5
DART36.433.736.437.953.222.310.531.9
Downsample0.00.00.00.00.00.00.00.0
MethodGQAMMBMMBCNMMEPOPEMMStarChartQAAverage
Group B (Simple)
FastV75.555.653.275.359.361.073.359.1
VisionZip79.478.376.980.570.269.470.367.8
PruMerge+73.955.952.673.949.757.069.555.7
DART76.163.259.073.267.463.770.764.6
Downsample100100100100100100100100
Group A (Difficult)
FastV18.318.321.521.544.315.03.818.4
VisionZip23.428.832.228.553.619.45.524.4
PruMerge+20.717.821.122.952.617.17.120.2
DART24.526.528.130.641.519.24.225.0
Downsample0.00.00.00.00.00.00.00.0
Group Analysis
Figure 4. Comparison of advanced token compression methods and downsampling on Qwen2-VL-7B by groups at 75% compression.

3. VTC-Bench Results (Qwen2-VL-7B & LLaVA-OV-7B)

Comparison of advanced token compression methods on VTC-Bench (Difficult Samples). The best results are bolded, and the second best results are underlined.

VTC-Bench Results on Qwen2-VL-7B
Figure 3. VTC-Bench results on Qwen2-VL-7B. Advanced methods show significant advantages on difficult samples.

Qwen2-VL-7B Results

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAverage
FastV57.845.256.578.965.441.029.135.051.1
VisionZip59.342.442.254.972.545.929.651.249.8
PruMerge+57.751.252.662.072.148.121.240.550.7
DART58.954.852.267.669.447.040.239.053.6

LLaVA-OV-7B Results

MethodGQAMMBMMBCNPOPEMMStarAverage
FastV54.370.569.163.848.661.3
VisionZip59.067.771.380.844.864.7
PruMerge+60.474.273.575.648.666.5

Qwen2-VL-7B Results

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAverage
FastV44.539.244.159.446.831.017.828.438.9
VisionZip49.433.244.448.170.030.322.049.743.4
PruMerge+50.436.938.442.971.528.818.143.541.3
DART47.540.540.949.657.735.431.527.341.3

LLaVA-OV-7B Results

MethodGQAMMBMMBCNPOPEMMStarAverage
FastV45.364.666.439.142.451.6
VisionZip56.671.971.269.643.562.6
PruMerge+57.468.871.576.045.863.9

Qwen2-VL-7B Results

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAverage
FastV35.731.935.348.837.422.813.314.830.0
VisionZip41.034.533.343.566.324.314.026.135.4
PruMerge+43.029.634.143.067.725.512.629.435.6
DART41.933.838.446.957.026.225.614.535.5

LLaVA-OV-7B Results

MethodGQAMMBMMBCNPOPEMMStarAverage
FastV36.751.253.329.732.640.7
VisionZip49.164.362.453.636.653.2
PruMerge+50.266.665.359.934.855.4

Qwen2-VL-7B Results

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAverage
FastV29.624.733.135.635.621.511.09.325.0
VisionZip38.631.232.637.960.024.511.015.131.4
PruMerge+38.826.029.337.056.422.69.217.029.5
DART36.433.736.437.953.222.324.710.531.9

LLaVA-OV-7B Results

MethodGQAMMBMMBCNPOPEMMStarAverage
FastV31.437.443.124.528.633.0
VisionZip42.655.456.945.430.846.2
PruMerge+42.757.859.649.931.348.3

Qwen2-VL-7B Results

MethodGQAMMBMMBCNMMEPOPEMMStarOCRChartQAAverage
FastV18.318.321.521.544.315.04.23.818.4
VisionZip23.428.832.228.553.619.43.75.524.4
PruMerge+20.717.821.122.952.617.12.57.120.2
DART24.526.528.130.641.519.225.64.225.0

LLaVA-OV-7B Results

MethodGQAMMBMMBCNPOPEMMStarAverage
FastV25.725.829.639.321.928.5
VisionZip28.328.132.842.124.731.2
PruMerge+25.325.528.540.425.229.0

Conclusion

This paper systematically analyzes the task mismatch problem presented in current MLLM benchmarks when evaluating visual token compression methods. Based on a surprising and counterintuitive finding: simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks, we conduct a comprehensive empirical study across several advanced visual token compression methods.

Thus, two crucial findings are concluded based on the empirical study:

  1. Current benchmarks are noisy for the visual token compression task.
  2. Downsampling can serve as a data filter to evaluate the difficulty of samples upon the visual token compression task.

Furthermore, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks by explicitly distinguishing between “simple” and “difficult” samples through downsampling. Through this work, we hope to not only advance the field of visual token compression but also inspire more discussions within the community on "how to properly evaluate efficient MLLMs."

Limitations

Dependence on a Single Base Model: The construction of the sample filtering mechanism is entirely dependent on Qwen2-VL. While chosen for its technical necessity, this reliance limits generalizability, as different models may define "difficult" samples slightly differently. In the future, this framework will be extended to more base models that have great dynamic resolution support.

Single Criterion for Filtering: We use only downsampling as the criterion for identifying "difficult" samples. Although effective, employing a more diverse set of baseline methods for ensemble filtering could yield a more robust definition. We will explore a more robust filtering mechanism for the subsequent version of VTC-Bench in future work.

Lack of a Formal Theoretical Definition: Our identification of "difficult" and "simple" samples is primarily based on the experimental results. A formal theoretical framework and mathematical proof for a more explainable sample filtering mechanism is yet to be established.

Limited Coverage of Methods: Due to the rapidly evolving field and adaptation constraints, our experimental comparison doesn't include all visual token compression methods. However, the selected methods are sufficient to support our core claims, and the framework remains open and extensible.

Appendix

Concurrent Similar Research

VisionThink: VisionThink identifies that downsampling shows surprising effectiveness on most general VQA tasks except for OCR and detail-sensitive benchmarks. It employs a reinforcement learning framework to dynamically decide if high-resolution images are needed. Our work further investigates this observation through extensive experiments, clarifying the task mismatch in current benchmarks and proving downsampling's potential as a data filter.

EffiVLM-Bench: EffiVLM-Bench offers a unified evaluation framework for assessing training-free acceleration techniques. Unlike EffiVLM-Bench, which aims to provide a toolkit, VTC-Bench specifically targets the task mismatch problem in visual token compression evaluation.

Experiment Details

All experiments are conducted on one A800 GPU. Code environment: Python=3.10, torch=2.6.0, torchvision=0.21.0. Bicubic interpolation is used for downsampling.

Benchmark Details

  • GQA: 22M reasoning questions with strict distribution control.
  • MMBench: 3,217 multiple-choice questions across 20 dimensions.
  • MME: Evaluates perceptual and cognitive abilities across 14 subtasks.
  • POPE: Evaluates object hallucination under different sampling strategies.
  • MMStar: 1,500 samples covering six core abilities.
  • OCRBench: 1,000 manually verified samples for OCR capabilities.
  • ChartQA: Evaluates visual and logical reasoning over charts.

Visualization between Groups

We visualize "difficult" and "simple" samples to observe their characteristics. "Difficult" samples (Group A) tend to require multiple detail perceptions and comparisons (e.g., finding extreme values in charts or complex counting). "Simple" samples (Group B) usually rely on medium/large-sized patterns or simple comparisons.

Difficult Samples Visualization
Figure A1. Visualization of "difficult" samples with downsampling ratio set to 2.
Simple Samples Visualization
Figure A2. Visualization of "simple" samples with downsampling ratio set to 2.

Statistical Analysis of Image Properties

We investigated whether the difficulty of samples could be attributed to basic low-level visual properties (entropy, brightness, contrast, colorfulness, size). Results show no statistically significant difference in these properties between Group A and Group B, validating that "difficulty" cannot be simply predicted by these elementary features and justifying the use of downsampling as a filter.

Statistical Visualization
Figure A3. Visualization of low-level visual properties of two groups based on MMStar and POPE.

Citation

@article{liao2025vtcbench,
  title={Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods},
  author={Liao, Chenfei and Wang, Wensong and Wen, Zichen and Zheng, Xu and Wang, Yiyu and He, Haocong and Lyu, Yuanhuiyi and Jiang, Lutao and Zou, Xin and Fu, Yuqian and Ren, Bin and Zhang, Linfeng and Hu, Xuming},
  journal={arXiv preprint},
  year={2025}
}