Are We Using the Right Benchmark:
An Evaluation Framework for Visual Token Compression Methods

Chenfei Liao1,2,6, Wensong Wang3,2, Zichen Wen2,5, Xu Zheng1,4,6, Yiyu Wang2,
Haocong He2, Yuanhuiyi Lyu1,6, Lutao Jiang1,6, Xin Zou1,6, Yuqian Fu4, Bin Ren7,8,4,
Linfeng Zhang2,*, Xuming Hu1,6,*

1HKUST (Guangzhou) 2SJTU 3Northeastern University 4INSAIT
5Shanghai AI Laboratory 6HKUST 7University of Pisa 8University of Trento

View on GitHub Paper

Abstract

Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch.

In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks.

Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity.

Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.

Motivation

Some recent MLLMs, such as Qwen2-VL and Qwen2.5-VL, natively support inputs of varying resolutions. A trivial yet efficient method to handle high-resolution images is to simply downsample them to a lower resolution. However, most token compression methods for MLLMs choose to adaptively drop useless tokens or merge similar tokens instead of directly downsampling the original image, which theoretically should be more intelligent.

Surprisingly, we find that image downsampling consistently exceeds other sophisticated methods under some settings. Based on comprehensive experiments, we propose a bold hypothesis:

Some data in the existing benchmarks is overly simplistic and irrelevant to evaluating visual token compression methods, leading to the unreasonable phenomenon that even the downsampling method is sufficient to deal with the visual token compression task.

Figure 1. The Anomaly. (a) Average Performance Retention Ratio (APRR) of five visual token compression methods on eight benchmarks (Model: Qwen2-VL-7B). (b) Comparison of advanced token compression methods and downsampling on Qwen2-VL-7B by groups at 75% compression.

To validate this, we design a data-centric analysis using downsampling as a discriminator. We identify two crucial findings:

Current benchmarks are noisy for the visual token compression task. Many samples can be answered correctly even with significant downsampling, indicating they do not test fine-grained visual understanding.
Downsampling can serve as a data filter. By separating samples into "simple" (Group B) and "difficult" (Group A) based on whether downsampling succeeds, we can effectively distinguish samples that truly require advanced compression.

VTC-Bench Framework

Based on these findings, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks. By explicitly distinguishing between “simple” and “difficult” samples through downsampling, VTC-Bench adaptively selects "difficult" samples that satisfy the requirements of evaluating visual token compression methods.

Figure 2. VTC-Bench Overview. The VTC-Bench is a simple but effective framework that can transform any existing benchmarks to a subset that can fairly evaluate visual token compression methods.

The pipeline consists of three critical steps:

Step 1: Inference & Compression. Given a sample and a target token compression ratio, we run two inference pipelines: (1) a downsampling baseline (the filter) and (2) advanced visual token compression methods (e.g., FastV, VisionZip, DART) evaluated directly on the target MLLM.

Step 2: Grouping. We use the performance of the downsampling method as a binary discriminator to categorize samples:

Group A (Difficult Samples): Samples that are answered incorrectly by the downsampling method.
Group B (Simple Samples): Samples that are answered correctly by the downsampling method.

This step filters the existing benchmarks and removes noisy data that is not applicable for evaluating the visual token compression methods.

Step 3: Result Aggregation. We perform a statistical analysis on the accuracy of the "difficult" samples to obtain an indicator that truly reflects the capability of visual compression methods.

Experiments & Findings

When evaluated using VTC-Bench (focusing on "Difficult Samples"), the landscape of visual token compression changes completely. Advanced methods prove their worth where it matters most.

Is downsampling all you need? Across many benchmarks, simple image downsampling often beats more advanced compression methods. VTC-Bench overturns this impression: when we restrict evaluation to the compression-relevant difficult samples (Group A), the trend reverses. By filtering out easy samples, VTC-Bench reveals that for truly challenging instances, advanced visual token compression methods are not only effective but necessary.

1. Standard Benchmark Results (Table 1)

Performance comparison on standard benchmarks (Qwen2-VL-7B). LLaVA-OV-7B overall results are not available in the paper.

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	APRR
Vanilla	62.3	78.9	78.0	2306	88.4	57.1	80.7	81.6	100.0
Downsample	59.2	75.0	73.8	2259	86.2	50.1	64.9	65.0	91.0
FastV	57.0	73.7	73.1	2083	84.5	44.6	42.0	58.1	83.2
VisionZip	58.6	71.1	70.5	2062	87.1	47.2	42.1	66.9	84.9
PruMerge+	59.4	72.1	72.0	2044	87.2	48.0	33.9	56.2	82.7
DART	56.9	72.5	70.2	2066	84.7	47.2	52.5	52.7	83.9

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	APRR
Vanilla	62.3	78.9	78.0	2306	88.4	57.1	80.7	81.6	100.0
Downsample	55.5	69.0	70.2	2127	82.9	44.0	48.8	24.8	77.6
FastV	52.3	65.0	65.5	1854	77.4	40.3	25.9	32.9	70.2
VisionZip	53.3	62.9	63.0	1820	83.6	40.2	25.1	48.4	72.5
PruMerge+	54.8	62.2	61.3	1806	84.3	38.4	22.2	44.2	71.0
DART	51.9	61.3	61.8	1915	80.5	39.8	41.0	30.8	71.6

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	APRR
Vanilla	62.3	78.9	78.0	2306	88.4	57.1	80.7	81.6	100.0
Downsample	52.6	66.4	66.8	1994	79.5	40.9	40.3	12.7	71.0
FastV	49.0	57.1	57.9	1684	74.9	37.5	18.7	20.6	62.1
VisionZip	49.0	54.8	54.0	1704	80.2	35.2	15.9	28.0	62.2
PruMerge+	48.7	48.4	48.1	1679	79.2	33.2	14.4	30.0	59.5
DART	49.2	53.4	54.0	1786	78.1	33.6	33.7	19.2	63.2

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	APRR
Vanilla	62.3	78.9	78.0	2306	88.4	57.1	80.7	81.6	100.0
Downsample	50.1	62.0	61.4	1938	78.8	37.5	32.3	11.7	66.4
FastV	46.1	43.9	46.6	1589	72.4	33.6	14.4	15.8	54.5
VisionZip	46.4	49.5	50.0	1628	77.8	33.4	12.0	19.4	57.1
PruMerge+	45.0	39.1	40.9	1544	74.0	30.5	10.5	20.9	52.1
DART	45.6	47.9	48.2	1701	74.7	31.7	29.3	16.6	58.3

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	APRR
Vanilla	62.3	78.9	78.0	2306	88.4	57.1	80.7	81.6	100.0
Downsample	43.5	51.6	51.9	1589	72.8	33.8	13.2	12.1	55.4
FastV	38.2	23.9	24.5	1189	55.0	26.1	5.8	11.9	38.0
VisionZip	41.9	40.5	40.5	1335	65.5	30.8	4.9	12.8	47.3
PruMerge+	39.0	23.7	24.4	1165	51.6	25.7	3.5	13.9	37.4
DART	40.5	30.8	30.7	1346	60.0	28.8	23.2	11.8	45.4

2. Group Comparison (Table 2)

The accuracy gap between Group A (Difficult) and Group B (Simple).

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	ChartQA	Average
Group B (Simple)
FastV	87.6	95.9	95.8	96.7	94.8	76.0	78.1	85.3
VisionZip	91.2	93.8	93.6	95.3	96.8	81.4	87.3	87.2
PruMerge+	91.9	95.1	94.6	95.9	97.5	82.3	73.6	84.6
DART	88.1	94.9	94.6	94.9	94.5	77.7	69.0	85.5
Downsample	100	100	100	100	100	100	100	100
Group A (Difficult)
FastV	57.8	45.2	56.5	78.9	65.4	41.0	35.0	51.1
VisionZip	59.3	42.4	42.2	54.9	72.5	45.9	51.2	49.8
PruMerge+	57.7	51.2	52.6	62.0	72.1	48.1	40.5	50.7
DART	58.9	54.8	52.2	67.6	69.4	47.0	39.0	53.6
Downsample	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	ChartQA	Average
Group B (Simple)
FastV	82.5	90.3	90.8	94.0	88.7	73.0	61.7	77.8
VisionZip	83.4	89.0	88.1	92.2	92.3	73.0	74.4	78.6
PruMerge+	85.8	87.2	86.4	91.9	94.2	71.6	73.8	78.0
DART	81.2	87.7	86.9	91.7	90.9	70.0	57.6	78.6
Downsample	100	100	100	100	100	100	100	100
Group A (Difficult)
FastV	44.5	39.2	44.1	59.4	46.8	31.0	28.4	38.9
VisionZip	49.4	33.2	44.4	48.1	70.0	30.3	49.7	43.4
PruMerge+	50.4	36.9	38.4	42.9	71.5	28.8	43.5	41.3
DART	47.5	40.5	40.9	49.6	57.7	35.4	27.3	41.3
Downsample	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	ChartQA	Average
Group B (Simple)
FastV	81.4	85.7	86.6	91.5	88.1	74.5	74.8	77.0
VisionZip	79.0	81.9	82.2	88.4	89.4	69.8	71.3	73.4
PruMerge+	76.7	76.9	76.1	87.8	87.6	65.5	68.9	70.2
DART	78.8	81.8	80.4	88.9	88.5	61.8	67.1	75.6
Downsample	100	100	100	100	100	100	100	100
Group A (Difficult)
FastV	35.7	31.9	35.3	48.8	37.4	22.8	14.8	30.0
VisionZip	41.0	34.5	33.3	43.5	66.3	24.3	26.1	35.4
PruMerge+	43.0	29.6	34.1	43.0	67.7	25.5	29.4	35.6
DART	41.9	33.8	38.4	46.9	57.0	26.2	14.5	35.5
Downsample	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	ChartQA	Average
Group B (Simple)
FastV	79.2	74.5	74.4	90.1	85.8	68.6	75.3	71.9
VisionZip	75.6	80.3	80.2	87.4	86.5	69.6	69.2	71.2
PruMerge+	72.5	69.3	69.1	83.9	82.2	61.4	70.3	65.9
DART	74.5	77.3	75.1	84.9	84.3	63.5	71.1	73.1
Downsample	100	100	100	100	100	100	100	100
Group A (Difficult)
FastV	29.6	24.7	33.1	35.6	35.6	21.5	9.3	25.0
VisionZip	38.6	31.2	32.6	37.9	60.0	24.5	15.1	31.4
PruMerge+	38.8	26.0	29.3	37.0	56.4	22.6	17.0	29.5
DART	36.4	33.7	36.4	37.9	53.2	22.3	10.5	31.9
Downsample	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	ChartQA	Average
Group B (Simple)
FastV	75.5	55.6	53.2	75.3	59.3	61.0	73.3	59.1
VisionZip	79.4	78.3	76.9	80.5	70.2	69.4	70.3	67.8
PruMerge+	73.9	55.9	52.6	73.9	49.7	57.0	69.5	55.7
DART	76.1	63.2	59.0	73.2	67.4	63.7	70.7	64.6
Downsample	100	100	100	100	100	100	100	100
Group A (Difficult)
FastV	18.3	18.3	21.5	21.5	44.3	15.0	3.8	18.4
VisionZip	23.4	28.8	32.2	28.5	53.6	19.4	5.5	24.4
PruMerge+	20.7	17.8	21.1	22.9	52.6	17.1	7.1	20.2
DART	24.5	26.5	28.1	30.6	41.5	19.2	4.2	25.0
Downsample	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Figure 4. Comparison of advanced token compression methods and downsampling on Qwen2-VL-7B by groups at 75% compression.

3. VTC-Bench Results (Qwen2-VL-7B & LLaVA-OV-7B)

Comparison of advanced token compression methods on VTC-Bench (Difficult Samples). The best results are bolded, and the second best results are underlined.

Figure 3. VTC-Bench results on Qwen2-VL-7B. Advanced methods show significant advantages on difficult samples.

Qwen2-VL-7B Results

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	Average
FastV	57.8	45.2	56.5	78.9	65.4	41.0	29.1	35.0	51.1
VisionZip	59.3	42.4	42.2	54.9	72.5	45.9	29.6	51.2	49.8
PruMerge+	57.7	51.2	52.6	62.0	72.1	48.1	21.2	40.5	50.7
DART	58.9	54.8	52.2	67.6	69.4	47.0	40.2	39.0	53.6

LLaVA-OV-7B Results

Method	GQA	MMB	MMB^CN	POPE	MMStar	Average
FastV	54.3	70.5	69.1	63.8	48.6	61.3
VisionZip	59.0	67.7	71.3	80.8	44.8	64.7
PruMerge+	60.4	74.2	73.5	75.6	48.6	66.5

Qwen2-VL-7B Results

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	Average
FastV	44.5	39.2	44.1	59.4	46.8	31.0	17.8	28.4	38.9
VisionZip	49.4	33.2	44.4	48.1	70.0	30.3	22.0	49.7	43.4
PruMerge+	50.4	36.9	38.4	42.9	71.5	28.8	18.1	43.5	41.3
DART	47.5	40.5	40.9	49.6	57.7	35.4	31.5	27.3	41.3

LLaVA-OV-7B Results

Method	GQA	MMB	MMB^CN	POPE	MMStar	Average
FastV	45.3	64.6	66.4	39.1	42.4	51.6
VisionZip	56.6	71.9	71.2	69.6	43.5	62.6
PruMerge+	57.4	68.8	71.5	76.0	45.8	63.9

Qwen2-VL-7B Results

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	Average
FastV	35.7	31.9	35.3	48.8	37.4	22.8	13.3	14.8	30.0
VisionZip	41.0	34.5	33.3	43.5	66.3	24.3	14.0	26.1	35.4
PruMerge+	43.0	29.6	34.1	43.0	67.7	25.5	12.6	29.4	35.6
DART	41.9	33.8	38.4	46.9	57.0	26.2	25.6	14.5	35.5

LLaVA-OV-7B Results

Method	GQA	MMB	MMB^CN	POPE	MMStar	Average
FastV	36.7	51.2	53.3	29.7	32.6	40.7
VisionZip	49.1	64.3	62.4	53.6	36.6	53.2
PruMerge+	50.2	66.6	65.3	59.9	34.8	55.4

Qwen2-VL-7B Results

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	Average
FastV	29.6	24.7	33.1	35.6	35.6	21.5	11.0	9.3	25.0
VisionZip	38.6	31.2	32.6	37.9	60.0	24.5	11.0	15.1	31.4
PruMerge+	38.8	26.0	29.3	37.0	56.4	22.6	9.2	17.0	29.5
DART	36.4	33.7	36.4	37.9	53.2	22.3	24.7	10.5	31.9

LLaVA-OV-7B Results

Method	GQA	MMB	MMB^CN	POPE	MMStar	Average
FastV	31.4	37.4	43.1	24.5	28.6	33.0
VisionZip	42.6	55.4	56.9	45.4	30.8	46.2
PruMerge+	42.7	57.8	59.6	49.9	31.3	48.3

Qwen2-VL-7B Results

Method	GQA	MMB	MMB^CN	MME	POPE	MMStar	OCR	ChartQA	Average
FastV	18.3	18.3	21.5	21.5	44.3	15.0	4.2	3.8	18.4
VisionZip	23.4	28.8	32.2	28.5	53.6	19.4	3.7	5.5	24.4
PruMerge+	20.7	17.8	21.1	22.9	52.6	17.1	2.5	7.1	20.2
DART	24.5	26.5	28.1	30.6	41.5	19.2	25.6	4.2	25.0

LLaVA-OV-7B Results

Method	GQA	MMB	MMB^CN	POPE	MMStar	Average
FastV	25.7	25.8	29.6	39.3	21.9	28.5
VisionZip	28.3	28.1	32.8	42.1	24.7	31.2
PruMerge+	25.3	25.5	28.5	40.4	25.2	29.0

Conclusion

This paper systematically analyzes the task mismatch problem presented in current MLLM benchmarks when evaluating visual token compression methods. Based on a surprising and counterintuitive finding: simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks, we conduct a comprehensive empirical study across several advanced visual token compression methods.

Thus, two crucial findings are concluded based on the empirical study:

Current benchmarks are noisy for the visual token compression task.
Downsampling can serve as a data filter to evaluate the difficulty of samples upon the visual token compression task.

Furthermore, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks by explicitly distinguishing between “simple” and “difficult” samples through downsampling. Through this work, we hope to not only advance the field of visual token compression but also inspire more discussions within the community on "how to properly evaluate efficient MLLMs."

Limitations

Dependence on a Single Base Model: The construction of the sample filtering mechanism is entirely dependent on Qwen2-VL. While chosen for its technical necessity, this reliance limits generalizability, as different models may define "difficult" samples slightly differently. In the future, this framework will be extended to more base models that have great dynamic resolution support.

Single Criterion for Filtering: We use only downsampling as the criterion for identifying "difficult" samples. Although effective, employing a more diverse set of baseline methods for ensemble filtering could yield a more robust definition. We will explore a more robust filtering mechanism for the subsequent version of VTC-Bench in future work.

Lack of a Formal Theoretical Definition: Our identification of "difficult" and "simple" samples is primarily based on the experimental results. A formal theoretical framework and mathematical proof for a more explainable sample filtering mechanism is yet to be established.

Limited Coverage of Methods: Due to the rapidly evolving field and adaptation constraints, our experimental comparison doesn't include all visual token compression methods. However, the selected methods are sufficient to support our core claims, and the framework remains open and extensible.

Appendix

Concurrent Similar Research

VisionThink: VisionThink identifies that downsampling shows surprising effectiveness on most general VQA tasks except for OCR and detail-sensitive benchmarks. It employs a reinforcement learning framework to dynamically decide if high-resolution images are needed. Our work further investigates this observation through extensive experiments, clarifying the task mismatch in current benchmarks and proving downsampling's potential as a data filter.

EffiVLM-Bench: EffiVLM-Bench offers a unified evaluation framework for assessing training-free acceleration techniques. Unlike EffiVLM-Bench, which aims to provide a toolkit, VTC-Bench specifically targets the task mismatch problem in visual token compression evaluation.

Experiment Details

All experiments are conducted on one A800 GPU. Code environment: Python=3.10, torch=2.6.0, torchvision=0.21.0. Bicubic interpolation is used for downsampling.

Benchmark Details

GQA: 22M reasoning questions with strict distribution control.
MMBench: 3,217 multiple-choice questions across 20 dimensions.
MME: Evaluates perceptual and cognitive abilities across 14 subtasks.
POPE: Evaluates object hallucination under different sampling strategies.
MMStar: 1,500 samples covering six core abilities.
OCRBench: 1,000 manually verified samples for OCR capabilities.
ChartQA: Evaluates visual and logical reasoning over charts.

Visualization between Groups

We visualize "difficult" and "simple" samples to observe their characteristics. "Difficult" samples (Group A) tend to require multiple detail perceptions and comparisons (e.g., finding extreme values in charts or complex counting). "Simple" samples (Group B) usually rely on medium/large-sized patterns or simple comparisons.

Figure A1. Visualization of "difficult" samples with downsampling ratio set to 2.

Figure A2. Visualization of "simple" samples with downsampling ratio set to 2.

Statistical Analysis of Image Properties

We investigated whether the difficulty of samples could be attributed to basic low-level visual properties (entropy, brightness, contrast, colorfulness, size). Results show no statistically significant difference in these properties between Group A and Group B, validating that "difficulty" cannot be simply predicted by these elementary features and justifying the use of downsampling as a filter.

Figure A3. Visualization of low-level visual properties of two groups based on MMStar and POPE.

Citation

@article{liao2025vtcbench,
  title={Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods},
  author={Liao, Chenfei and Wang, Wensong and Wen, Zichen and Zheng, Xu and Wang, Yiyu and He, Haocong and Lyu, Yuanhuiyi and Jiang, Lutao and Zou, Xin and Fu, Yuqian and Ren, Bin and Zhang, Linfeng and Hu, Xuming},
  journal={arXiv preprint},
  year={2025}
}

Are We Using the Right Benchmark:An Evaluation Framework for Visual Token Compression Methods

Abstract

Motivation

VTC-Bench Framework

Experiments & Findings

1. Standard Benchmark Results (Table 1)

2. Group Comparison (Table 2)

3. VTC-Bench Results (Qwen2-VL-7B & LLaVA-OV-7B)

Qwen2-VL-7B Results

LLaVA-OV-7B Results

Qwen2-VL-7B Results

LLaVA-OV-7B Results

Qwen2-VL-7B Results

LLaVA-OV-7B Results

Qwen2-VL-7B Results

LLaVA-OV-7B Results

Qwen2-VL-7B Results

LLaVA-OV-7B Results

Conclusion

Limitations

Appendix

Concurrent Similar Research

Experiment Details

Benchmark Details

Visualization between Groups

Statistical Analysis of Image Properties

Citation

Are We Using the Right Benchmark:
An Evaluation Framework for Visual Token Compression Methods