Providers

DeepSeek-VL (Open Weights)

How DeepSeek-VL2 tiles images at 384px and why the win is local-inference context, not API billing.

Open-weights, not a billed API. At time of writing DeepSeek's public API (deepseek-chat / deepseek-reasoner) is text-only — DeepSeek-VL2 ships as open weights. So optimizing here saves your local-inference context budget and latency, not API dollars. Verify if/when a vision endpoint ships.

How DeepSeek-VL2 processes images

DeepSeek-VL2 uses a 384×384 global view plus dynamic local tiles on an anyres canvas of (m·384, n·384) with m·n ≤ 9. The encoder is SigLIP-SO400M-384 (14px patch) with a 2× downsample, giving a per-view grid side of h = ⌈(384/14)/2⌉ = 14.

The exact token count (from tokenize_with_images):

h          = 14
global view = h·(h+1) = 210          # +1 per row = line separator
separator   = 1
local tiles = (nh·h)·(nw·h + 1)      # nw·nh ≤ 9
tokens      = 210 + 1 + local

The 384px boundary

Image	nw × nh	Tokens
≤ 384×384	1×1	211 + 210 = 421
768×768	2×2	211 + 28·29 = 1,023
1152×1152	3×3	211 + 42·43 = 2,017

Snapping each side down to the 384px grid keeps nw·nh (and the token bill) minimal.

Optimization strategy

≤384px stays a single tile; otherwise snap each side down to the 384px grid.

Terminal

vision-squeezer image.png --model deepseek

CLI aliases: deepseek, deepseek-vl. MCP target_model: "deepseek".

Token savings

Crossing a 384px boundary adds a whole tile row or column — snapping back undoes it.

Scenario	Before	After	Saved
800×768 → snap to 768×768	3×2 tiles · 1,415 tok	2×2 tiles · 1,023 tok	−28%
385×384 (1px over a tile) → 384×384	617 tok	421 tok	−32%

A single pixel past a 384px edge can cost ~30% more. Snapping each side down to the grid keeps nw·nh minimal.

Source

Formula taken verbatim from the DeepSeek-VL2 technical report, §2 Model Architecture (arXiv:2412.10302, submitted 13 Dec 2024) and the reference implementation processing_deepseek_vl_v2.py. Grid constants (patch_size 14, siglip_so400m_patch14_384, candidate_resolutions up to 1152×1152) cross-checked against the model config.json. Verified 2026-06-11.

Qwen-VL (Patch Grid)

How Alibaba Qwen2-VL / Qwen2.5-VL tokenizes images on a 28px grid and how VisionSqueezer snaps to it.

Python Bindings

Use VisionSqueezer from Python via native pyo3 wheels.