DeepSeek-VL (Open Weights)
deepseek-chat / deepseek-reasoner) is text-only — DeepSeek-VL2 ships as open weights. So optimizing here saves your local-inference context budget and latency, not API dollars. Verify if/when a vision endpoint ships.How DeepSeek-VL2 processes images
DeepSeek-VL2 uses a 384×384 global view plus dynamic local tiles on an anyres canvas of (m·384, n·384) with m·n ≤ 9. The encoder is SigLIP-SO400M-384 (14px patch) with a 2× downsample, giving a per-view grid side of h = ⌈(384/14)/2⌉ = 14.
The exact token count (from tokenize_with_images):
h = 14
global view = h·(h+1) = 210 # +1 per row = line separator
separator = 1
local tiles = (nh·h)·(nw·h + 1) # nw·nh ≤ 9
tokens = 210 + 1 + local
The 384px boundary
| Image | nw × nh | Tokens |
|---|---|---|
| ≤ 384×384 | 1×1 | 211 + 210 = 421 |
| 768×768 | 2×2 | 211 + 28·29 = 1,023 |
| 1152×1152 | 3×3 | 211 + 42·43 = 2,017 |
Snapping each side down to the 384px grid keeps nw·nh (and the token bill) minimal.
Optimization strategy
≤384px stays a single tile; otherwise snap each side down to the 384px grid.
vision-squeezer image.png --model deepseek
CLI aliases: deepseek, deepseek-vl. MCP target_model: "deepseek".
Token savings
Crossing a 384px boundary adds a whole tile row or column — snapping back undoes it.
| Scenario | Before | After | Saved |
|---|---|---|---|
| 800×768 → snap to 768×768 | 3×2 tiles · 1,415 tok | 2×2 tiles · 1,023 tok | −28% |
| 385×384 (1px over a tile) → 384×384 | 617 tok | 421 tok | −32% |
A single pixel past a 384px edge can cost ~30% more. Snapping each side down to the grid keeps nw·nh minimal.
Source
Formula taken verbatim from the DeepSeek-VL2 technical report, §2 Model Architecture (arXiv:2412.10302, submitted 13 Dec 2024) and the reference implementation processing_deepseek_vl_v2.py. Grid constants (patch_size 14, siglip_so400m_patch14_384, candidate_resolutions up to 1152×1152) cross-checked against the model config.json. Verified 2026-06-11.
