Llama Vision (Tiles)
MllamaVisionConfig). The per-tile token cost (~1601) can vary by host (Together, Amazon Bedrock, Fireworks, Groq) — verify against your provider's billing. The tile count is what drives savings and is stable.How Llama Vision bills images
Meta Llama 3.2 / 3.3 Vision (Mllama) tile images on a 560×560 grid:
- The image is fit into an aspect-ratio canvas built from 560px tiles.
- The canvas is capped at
max_num_tiles= 4 (e.g. 2×2 / 1×4 / 4×1).
There is no separate global-thumbnail tile — the canvas is the full representation. With a 14px ViT patch, each 560px tile is 40×40 = 1600 patches (+1 CLS) ≈ 1601 tokens/tile.
tiles = min(⌈W/560⌉ · ⌈H/560⌉, 4)
tokens ≈ tiles × 1601
The spill-over trap
Like Gemini, the tiles are large — crossing a 560px boundary by a few pixels adds an entire tile:
| Image | Tiles | Tokens |
|---|---|---|
| 560×560 | 1 | ~1,601 |
| 561×560 | 2 | ~3,202 |
32 extra pixels of width ≈ 1,600 extra tokens.
Optimization strategy
Snap each side down to the 560px grid within the 2×2 (1120px) max canvas, eliminating spill-over tiles.
vision-squeezer image.png --model llama
CLI aliases: llama, llama-vision. MCP target_model: "llama".
Token savings
Each 560px tile is ~1,601 tokens, so removing a single tile row or column is a big proportional win.
| Scenario | Before | After | Saved |
|---|---|---|---|
| 2400×1670 screenshot → trim padding to 2400×1200 | 4 tiles · 6,404 tok | 2 tiles · 3,202 tok | −50% |
Dropping the image from a 2×2 canvas to a 2×1 canvas halves the bill. Savings depend entirely on how much padding pushes you into an extra tile — images already snug inside their tiles save little.
Source
Grid constants from the transformers MllamaVisionConfig (image_size 560, patch_size 14, max_num_tiles 4) — the architecture behind Meta's Llama 3.2 Vision (released Sep 2024) and Llama 3.3. The ~1601 tokens/tile is the model's own footprint (560/14)² + 1; hosted APIs (Together, Bedrock, Fireworks, Groq) may bill a different per-tile rate, so treat absolute tokens as indicative. Llama 4 uses a different native-multimodal vision encoder (Llama4ForConditionalGeneration, not Mllama) and is not modeled by this connector. Verified 2026-06-11.
