Providers

Qwen-VL (Patch Grid)

How Alibaba Qwen2-VL / Qwen2.5-VL tokenizes images on a 28px grid and how VisionSqueezer snaps to it.

Experimental connector. The 28px effective grid and the patch-count formula are stable (qwen_vl_utils.smart_resize). The token clamp defaults shown here are the library defaults; a DashScope endpoint may cap lower via a per-request max_pixels. Verify against your deployment.

How Qwen-VL bills images

Alibaba Qwen2-VL / Qwen2.5-VL use a 14px ViT patch with a 2×2 spatial merge, giving a 28px effective grid (image_factor = 14 × 2 = 28). smart_resize rounds each side to a multiple of 28; the image token count is the number of merged patches:

tokens = (W/28) · (H/28)

bounded by [IMAGE_MIN_TOKEN_NUM, IMAGE_MAX_TOKEN_NUM] = [4, 16384] tokens (the qwen_vl_utils defaults: max_pixels = 16384 × 28²). Production DashScope tiers often set a lower max_pixels per request.

Does snapping help?

The patch is small (28px), so spill-over per step is modest — but two effects still pay off:

Snapping to the 28px grid removes fractional rows/columns that round up.
Padding strip lowers the pixel area, which directly lowers the patch count until the max_pixels clamp is hit.

For high-resolution inputs the image is first scaled under max_pixels, so the win is often capped — Qwen's value is more about file size + context budget than dramatic token cuts.

Terminal

vision-squeezer image.png --model qwen

CLI aliases: qwen, qwen-vl. MCP target_model: "qwen".

Token savings

The 28px patch is small, so the lever here is area (padding strip), not boundary snapping.

Scenario	Before	After	Saved
1024×1024 → strip solid border to 896×896	1,369 tok	1,024 tok	−25%
1024×1024 → 28px grid snap only (1008×1008)	1,369 tok	1,296 tok	−5%

Cropping wasted area pays off; pure grid snapping is incremental. For very large inputs the count is also capped at 16,384 tokens, so savings there come from staying under that ceiling.

Source

Grid and clamp taken from the official qwen_vl_utils.smart_resize reference implementation (QwenLM/Qwen2.5-VL) — image_factor = patch_size(14) × spatial_merge(2) = 28, bounds [IMAGE_MIN_TOKEN_NUM, IMAGE_MAX_TOKEN_NUM] = [4, 16384]. The same 28px grid carries across Qwen2-VL, Qwen2.5-VL, and Qwen3-VL (identical vision_process.py). Background: Qwen2.5-VL technical report (arXiv:2502.13923, Feb 2025). Verified 2026-06-11.

Llama Vision (Tiles)

How Meta Llama 3.2/4 Vision tiles images and how VisionSqueezer snaps to the 560px canvas.

DeepSeek-VL (Open Weights)

How DeepSeek-VL2 tiles images at 384px and why the win is local-inference context, not API billing.