Providers

Llama Vision (Tiles)

How Meta Llama 3.2/4 Vision tiles images and how VisionSqueezer snaps to the 560px canvas.

Experimental connector. The 560px tile grid, 14px patch and 4-tile cap are architectural constants (transformers MllamaVisionConfig). The per-tile token cost (~1601) can vary by host (Together, Amazon Bedrock, Fireworks, Groq) — verify against your provider's billing. The tile count is what drives savings and is stable.

How Llama Vision bills images

Meta Llama 3.2 / 3.3 Vision (Mllama) tile images on a 560×560 grid:

The image is fit into an aspect-ratio canvas built from 560px tiles.
The canvas is capped at max_num_tiles = 4 (e.g. 2×2 / 1×4 / 4×1).

There is no separate global-thumbnail tile — the canvas is the full representation. With a 14px ViT patch, each 560px tile is 40×40 = 1600 patches (+1 CLS) ≈ 1601 tokens/tile.

tiles  = min(⌈W/560⌉ · ⌈H/560⌉, 4)
tokens ≈ tiles × 1601

The spill-over trap

Like Gemini, the tiles are large — crossing a 560px boundary by a few pixels adds an entire tile:

Image	Tiles	Tokens
560×560	1	~1,601
561×560	2	~3,202

32 extra pixels of width ≈ 1,600 extra tokens.

Optimization strategy

Snap each side down to the 560px grid within the 2×2 (1120px) max canvas, eliminating spill-over tiles.

Terminal

vision-squeezer image.png --model llama

CLI aliases: llama, llama-vision. MCP target_model: "llama".

Token savings

Each 560px tile is ~1,601 tokens, so removing a single tile row or column is a big proportional win.

Scenario	Before	After	Saved
2400×1670 screenshot → trim padding to 2400×1200	4 tiles · 6,404 tok	2 tiles · 3,202 tok	−50%

Dropping the image from a 2×2 canvas to a 2×1 canvas halves the bill. Savings depend entirely on how much padding pushes you into an extra tile — images already snug inside their tiles save little.

Source

Grid constants from the transformers MllamaVisionConfig (image_size 560, patch_size 14, max_num_tiles 4) — the architecture behind Meta's Llama 3.2 Vision (released Sep 2024) and Llama 3.3. The ~1601 tokens/tile is the model's own footprint (560/14)² + 1; hosted APIs (Together, Bedrock, Fireworks, Groq) may bill a different per-tile rate, so treat absolute tokens as indicative. Llama 4 uses a different native-multimodal vision encoder (Llama4ForConditionalGeneration, not Mllama) and is not modeled by this connector. Verified 2026-06-11.

Gemini (Large Tiles)

How Google Gemini tiles images and why snapping to 768px boundaries halves the cost.

Qwen-VL (Patch Grid)

How Alibaba Qwen2-VL / Qwen2.5-VL tokenizes images on a 28px grid and how VisionSqueezer snaps to it.