A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?
Here’s a fact: GPT-4o charges 170 tokens to process each 512x512
tile
used in high-res mode. At ~0.75 tokens/word, this suggests a picture is worth
about 227 words—only a factor of four off from the traditional saying.
(There’s also an 85 tokens charge for a low-res ‘master thumbnail’ of each picture
and higher resolution images are broken into many such 512x512
tiles,
but let’s just focus on a single high-res tile.)
OK, but why 170? It’s an oddly specific number, isn’t it? OpenAI uses round numbers like “\$20” or “\$0.50” in their pricing, or powers of 2 and 3 for their internal dimensions. Why choose a numbers like 170 in this instance?
Numbers that are just dropped into a codebase without explanation are called “magic numbers” in programming, and 170 is a pretty glaring magic number.
And why are image costs even being converted to token counts anyway? If it were just for billing purposes, wouldn’t it be less confusing to simply list the cost per tile?
What if OpenAI chose 170, not as part of some arcane pricing strategy, but simply because it’s literally true? What if image tiles are in fact represented as 170 consecutive embedding vectors? And if so, how?
Embeddings
The first thing to recall about the transformer model is that it operates on vectors, not discrete tokens. The inputs have to be vectors, or the dot product similarity at the heart of the transformer wouldn’t make any sense. The whole concept of tokens is a pre-processing step: text is converted to tokens and tokens are converted to embedding vectors by an embedding model before they even hit the first layer of the transformer model.
For example, Llama 3 uses 4,096 feature dimensions internally. Consider the
sentence, “My very educated mother just served us nine pizzas.” It gets
converted into 10 integer tokens (counting the period) by BPE, then
those are each converted into 4,096-dimensional vectors by an embedding model,
resulting in a 10x4096
matrix. That’s the “real” input into a transformer
model.
But there’s no law that says that these vectors must come from a text embedding model. It’s a strategy that works well for text data, but if we have data in a different format that we want to feed into a transformer then we can simply use a different embedding strategy.
We know that OpenAI has been thinking along these lines because in 2021 they released the CLIP embedding model. CLIP embeds both text and images into the same semantic vector space, allowing you to use cosine similarity to find images related to text strings, or images which are semantically similar to other images. You can try the demo on hugging face to get a feel for how it works:
However, CLIP embeds the entire image as a single vector, not 170 of them. GPT-4o must be using a different, more advanced strategy internally to represent images (and likewise video, voice, and other kinds of data; that’s why it’s “omnimodal.”)
Let’s see if we can’t deduce what that strategy might be for image data in particular.
Number of Feature Dimensions
Let’s start by guesstimating the number of dimensions used internally by GPT-4o to represent embedding vectors. We can’t know the real number for certain because it’s proprietary, but we can make some reasonable assumptions.
OpenAI seems to likes powers of 2, sometimes with a single factor of 3 mixed in. For example, they used 1,536 for ada-002 embeddings or 3,072 for text-embedding-3-large. GPT-3 is known to use 12,288 dimensions throughout. It’s probable that GPT-4o either kept or increased that parameter.
It doesn’t seem likely that the number of embeddings would have gone down from GPT-3 to GPT-4o, but it’s possible. Releases like GPT-4 Turbo were actually faster and cheaper than earlier version, and a reduction in embedding dimension may have been part of that if the developers had benchmarks showing that the smaller size was just as good in terms of quality.
“Interest rates may go up, they may go down, or they may stay the same. I’m sorry, but I really can’t be any more vague than that.” —Alan Greenspan
Given all that, it’s likely that the number of feature dimensions used inside of GPT-4o is one of these:
Dimension | Prime Factors |
---|---|
$1{,}536$ | $3 \cdot 2^9$ |
$2{,}048$ | $2^{11}$ |
$3{,}072$ | $3 \cdot 2^{10}$ |
$4{,}096$ | $2^{12}$ |
$12{,}228$ | $3 \cdot 2^{12}$ |
$16{,}384$ | $2^{14}$ |
$24{,}576$ | $3 \cdot 2^{13}$ |
For the sake of argument, I’ll assume that GPT-4o is using 12,228 for the dimension of its embedding vectors. It doesn’t really matter if we’re off by a factor of 2 or 4; the same arguments will work.
Embedding Images
Image tiles are square, so are likely represented by a square grid of tokens. 170 is very close to $13 \times 13$. The extra token could be a single embedding vector which encodes a kind of gestalt impression of the entire image, exactly as CLIP does (and similar to their strategy of using an 85 token “master thumbnail” for each image.)
So, the question is, how do we go from 512x512x3
to 13x13x12228
?
Strategy 1: Raw Pixels
Here’s an extremely simple way to stuff an image into a vector space:
- Divide the
512x512
image into a8x8
grid of “mini-tiles.” - Each mini-tile is
64x64x3
; flatten it a vector of dimension 12,228. - Each mini-tile is a single embedding vector.
- The entire image tile is represented as 64 consecutive embedding vectors.
There are two problems with this approach:
- 64 ≠ 170, and
- it’s extremely stupid.
By “extremely stupid” I mean that it doesn’t make any sense to embed using raw RGB values and then just cross your fingers and hope the transformer will sort it out. Transformers aren’t really designed to handle the spatial structure of 2D images, especially not when it’s embedded in such a braindead way as this.
To see why, imagine the image is shifted a few pixel to the left. The dot product between the embedding vectors of the original and shifted images would immediately drop close to zero. The same would happen if we resize the image.
Ideally we’d want a model that was robust to these kinds of transforms—we’d like it to have translational and scale invariance, to use the technical jargon.
Strategy 2: CNN
Luckily, there already exists a model with those characteristics, with over a decade-long track record of successfully handling image data: the Convolutional Neural Network. (Here, I’m using the term to describe the broad family of deep learning models which use convolution layers somewhere inside them.)
Just to get a sense of what the options are, let’s take look at a classic CNN architecture introduced in 2012, AlexNet:
The basic building blocks are:
- Convolution Layer. These scan over an image in $k \times k$ sized blocks, training a small neural network.
- Max Pool Layer. These also look at $k \times k$ block, but simply take the maximum value from each.
You should spot two key trends as we move into the deeper layers of the network: the height and width get smaller, while the number of “channels” (sometimes called “filters”) gets larger. That means we’re incrementally digesting many low-level features into fewer high level concepts until, at the very end, AlexNet has turned the entire image into a single categorical concept representing something like a “cat” or “dog.” CNNs are essentially funnels that squeeze the lemons of raw pixels into the lemonade of semantic vectors.
If you’re following my somewhat strained analogy, you should see how a CNN can turn an image into a single embedding vector. To see how (and why) a CNN can turn an image into many embedding vectors, let’s take a look at a slightly newer (circa 2018) CNN architecture, one that’s a little closer in spirit to what we’ll need for GPT-4o. It’s called YOLO, short for “You Only Look Once.”
Here, the notation “xN” means that the entire block is repeated N times. YOLOv3 is 10 times as deep as AlexNet but is still very similar in some regards. It has a somewhat more modern design: stride 2 convolutional layers instead of max pooling layers to reduce dimensionality, residual layers to preserve good gradients in very deep networks, etc.
But the key difference is that it doesn’t reduce the image to a single flat
vector, but stops at 13x13
. There are no fully connected layers after that;
the output of YOLOv3 is in fact 169 different vectors, laid out in a 13x13
grid, each of dimension 1,024, and each representing the class (and some
bounding box data we’ll ignore) of the object found in or near a particular
cell of the grid. This means that YOLO doesn’t see just one object in the
image—it can see many in a single pass. That’s why it’s said to “only
look once.”
These examples give us a rough sense of what GPT-4o’s (hypothetical) image
embedding CNN might be shaped like. All we have to do now is play a little game
of connect the dots: how do we go from 512x512x3
to 13x13x12228
using
standard CNN layers?
The moves in this game are the standard building blocks we’ve seen in the
above CNN architectures. We can choose the layer types and play around with
hyperparameters like kernel size, stride length, padding strategy, etc. Note
that we ignore things like residual layers, repeated blocks, batch/layer normalization,
or 1x1
convolutional layers as these don’t affect the overall tensor size.
The goal is to suggest a workable CNN architecture that connects the known
input size (512x512
images with 3 RGB color channels) to the assumed output
shape (13x13
embedding vectors with 12,228 dimensions each.)
I tried several different variations, but most of these required special cases on one or more layers to “fit.” Until I found this one, which steps down elegantly with no special cases at all:
It very neat, isn’t it? It’s almost identical to AlexNet, and it steps down
from from 512 to 13 in five identical repeating blocks, while simultaneously
quadrupling the number of channels with each block to hit 12,228 on the bottom
layer. Unfortunately, it also feels a little outdated due to the 5x5
kernels
and max pool layers. AlexNet was a breakthrough in 2012 but I would be suprised
if OpenAI was using something similar in 2024.
Here’s an alternative that almost worked (got to 12x12
instead of 13x13
)
while staying closer to the more modern YOLO design:
While impossible to prove, these speculative designs demonstrate that there are plausible CNN architectures that could represent an image as a $k \times k$ grid of embedding vectors.
Experimental Validation
Does GPT-4o really see a 13x13
grid of embedding vectors? I invented
a task, loosely inspired by Zener cards, to test this. The task is to
identify the color and shape for every symbol on a grid in an image.
A simple program generates test grids that look like this:
I then used this prompt to obtain comparison data:
"""Divide this image into a {k}x{k} grid and describe the shape and color of
each cell. Report your results as a JSON array of arrays (no objects or row
labels) and use all lowercase."""
If the 13x13
hypothesis is correct, GPT-4o should do well on this task
up to about 13 and then performance should degrade. The CNN isn’t literally
dividing the image up into a grid so it might start to degrade before then,
and might struggle more with even grid sizes than odd because the cells won’t
be aligned.
However, that is not in fact what happens. Instead, performance is perfect for
5x5
grids and smaller (e.g. 2x2
, 3x3
, 4x4
) but begins
to degrade rapidly after that. By the time we reach 13x13
it was no better
than chance.
For example, here is a 7x7
grid:
For which GPT-4o returned this:
[
["blue diamond", "purple triangle", "red plus", "purple circle", "orange square", "purple circle", "blue circle"],
["blue diamond", "red circle", "green plus", "red square", "orange diamond", "blue triangle", "purple plus"],
["blue diamond", "orange triangle", "red square", "orange square", "blue plus", "purple circle", "blue circle"],
["green diamond", "blue circle", "orange circle", "green circle", "purple diamond", "green triangle", "orange diamond"],
["purple square", "purple circle", "green plus", "green diamond", "green circle", "blue triangle", "purple triangle"],
["red plus", "red triangle", "purple circle", "blue triangle", "orange triangle", "red diamond", "orange diamond"],
["orange plus", "blue diamond", "green triangle", "green plus", "green triangle", "purple diamond", "purple square"]
]
It got 38⁄49 correct—an accuracy of 76%. The exact pattern of hits and misses looks like this (yellow is correct, purple incorrect):
Performance continues to degrade as the grid size increases and by the time we
get to the 13x13
grid:
The results are no better than chance:
Does that mean I was wrong about 169 tokens representing a 13x13
grid?
Yes. Yes it does. My disappointment is immeasurable and my day is ruined.
“The great tragedy of science: the slaying of a beautiful hypothesis by an ugly fact.” —Thomas Huxley
But the 5x5
grid results are suggestive. GPT-4o really can keep track of 25
distinct objects and their absolute positions within in an image. Maybe the
basic concept is right; I just got the dimension wrong. It would be easy
to tack on another couple of layers to our CNN to get down to 5x5
instead
of 13x13
:
How could we structure the output to reach 170 tokens if we assume we only use
5x5
grids and smaller?
Pyramid Strategy
One way to get close to both 85 and 170 is to assume that we encode the image
in a series of increasingly granular levels, like a pyramid. We start with one
embedding vector to capture a gestalt impression of the whole image, add a
3x3
to capture left/middle/right and top/middle/bottom, then adding a 5x5
,
7x7
etc.
This strategy gets us very close to 85 tokens for the ‘master thumbnail’ if
we stop at 7x7
:
$1^2 + 3^2 + 5^2 + 7^2 = 1 + 9 + 25 + 49 = 84$
And very close to 170 if we add one final 9x9
grid:
$1^2 + 3^2 + 5^2 + 7^2 + 9^2 = 1 + 9 + 25 + 49 + 81 = 165$
If we throw in an ad hoc 2x2
grid for the 512x512
tile and assume one
special <|image start|>
token for each, we can get a perfect match:
$1 + 1^2 + 3^2 + 5^2 + 7^2 = 1 + 1 + 9 + 25 + 49 = 85$
$1 + 1^2 + 2^2 + 3^2 + 5^2 + 7^2 + 9^2 = 1 + 1 + 4 + 9 + 25 + 49 + 81 = 170$
This scheme lacks any sort of delimiters for the start and end of a row, but I think that could be handled with positional encoding similar to the way RoPE is used to encode position information for text tokens, but in 2D.
The above takes only odd grid sizes and goes past 5x5
; given that the Zener
grid performance starts to fall off after 5x5
this does not entirely concord
with the evidence.
As an alternative, we could try taking all the grids (even and odd) up to 5x5
:
This approach gives us 55 tokens:
$1^2 + 2^2 + 3^2 + 4^2 + 5^2 = 55$
If we assume 3 tokens per mini-tile and a delimiter token between each, we can get to 170:
$3 \times (1^2 + 2^2 + 3^2 + 4^2 + 5^2) + 5 = 170$
This isn’t fully satisfactory on numerological grounds but does jive well with
the empirical results. The pyramid strategy has a lot of intuitive appeal—it
feels like an almost “obvious” way to encode spatial information at
different zoom levels - and may explain why it does so well with the 5x5
grid
and below and so poorly on 6x6
and above.
It’s maddening that every hypothesis seems to come tantalizingly close to explaining everything but the numbers never quite seem to work out neatly… Still, these pyramid strategies are the best I’ve been able to come up with.
Optical Character Recognition
The one thing that none of the above hypotheses explain is how GPT-4o is doing OCR. CLIP can’t natively do OCR very well, at least not for big blocks of text. (The fact that it can do it all is actually pretty amazing - a clear example of an emergent ability!) And yet GPT-4o patently can do high-quality OCR: it can transcribe long blocks of text, read handwritten text, or text which has been shifted, rotated, projected, or partially occluded.
It’s important to keep in mind that state-of-the-art OCR engines do a great deal of work to clean up images, find bounding boxes and strips of characters, and then run specialized character recognition models along those strips, one character or word at a time. They aren’t just big CNNs.
I guess in theory OpenAI could have built a model that really is just that
good, but that doesn’t concord with its relatively weak performance on the
Zener grid task. I mean, if it can’t read off 36 symbols in a neat 6x6
grid
from an image, it certainly can’t read off a several hundred text characters
flawlessly.
I have a simple theory to explain this discrepancy: I think OpenAI is running an off-the-shelf OCR tool like Tesseract (or more likely some proprietary, state-of-the-art tool) and feeding the identified text into the transformer alongside the image data. I mean, that’s what I would do.
This would explain why the early versions were so easily confused by text hidden in images: from its POV, that text was part of the prompt. (This is fixed now; GPT-4o is good at ignoring malicious prompts hidden inside images.)
However, this does not explain why there’s no charge per token for the text found in an image.
Interestingly enough, it’s actually more efficient to send text as images: A
512x512
image with a small but readable font can easily fit 400-500 tokens
worth of text, yet you’re only charged for 170 input tokens plus the 85 for the
‘master thumbnail’ for a grand total of 255 tokens—far less than the
number of words on the image.
This theory explains why there is additional latency when processing images. The CNN would be essentially instantaneous, but 3rd-party OCR would add additional time. By the way, (and I’m not saying this proves anything) but the Python environment used by the OpenAI code interpreter has PyTesseract installed. You can literally just ask it to run PyTesseract on any image you’ve uploaded to get a second opinion.
Conclusion
Well, we’ve made a lot of speculative hay out of what is essentially only one morsel of hard fact: that OpenAI used the magic number 170.
However, there does seem to be a complete plausible approach—very much in line with other CNN architectures such as YOLO—for mapping from image tiles to embedding vectors.
As such, I don’t think 170 tokens is just an approximation used to bill for roughly the amount of compute it takes to process an image. And I don’t think they’re concatenating layers to join image and text data the way some other multi-modal models do.
No, I think GPT-4o is literally representing 512x512
images as 170
embedding vectors, using an CNN architecture that’s a mixture of CLIP and YOLO
to embed the image directly into the transformer’s semantic vector space.
When I started this article, I was entirely convinced that I had cracked it
entirely, that I was going to find that the 170 tokens were for a 13x13
grid
and one additional “gestalt impression” token. That got blown out of the water
when performance on the Zener task started to degrade after 5x5
—whatever
they’re doing internally, it seems to be a lot smaller than 13x13
.
Still the analogy to YOLO is compelling, and the performance on the 5x5
Zener
task all but confirms that they’re doing some kind of grid. This theory has a
lot of predictive power in other areas as well: it explains how GPT-4o is able
to handle multiple images, and tasks like comparing two images, for example. It
explains how it’s able to see multiple objects in the same image, but gets
overwhelmed when there are too many objects in a busy scene. It explains
why GPT-4o seems extremely vague about the absolute and relative positions of
separate objects within the scene, and why it can’t count objects accurately
in images: when an object spans two adjacent grid cells the same classes are
activated in both so it’s not sure if it’s one object or two.
Ironically, the one thing this theory can’t cleanly explain is the question
which motivated this article in the first place: why 170 tokens in particular?
The pyramid theory (1x1 + 2x2 + 3x3 + 4x4 + 5x5
) was the best I was able to
come up with, and it’s not particularly neat.
I’d love to hear from anyone who has a theory that fits a little better (or even actual knowledge, assuming it doesn’t run afoul of an NDA!)
Postscript: Alpha Channel Shenanigans
One other thing I noticed while working on this project is that GPT-4o ignores the alpha channel, resulting in somewhat counter-intuitive behavior.
When I say, “ignores”, I don’t mean that it gets rid of transparency by compositing it onto some default background, the way an image editor might when converting PNG to JPG. No, I mean it literally just grabs the RGB channels and ignores the alpha channel.
We can illustrate this with four carefully prepared images. For convenience, I’ve used HTML and CSS to display these images on top of a checkerboard pattern—the images themselves have flat, transparent backgrounds. However, half have transparent black backgrounds, and half have transparent white backgrounds.
What do I mean by “transparent black” or “transparent white?” Well, when we
represent an RGBA color with four bytes, the RGB bytes are still there even
when alpha is 100%. Thus, (0, 0, 0, 255)
and (255, 255, 255, 255)
are in
some sense different colors, even though there’s no situation where a correct
renderer would display them differently since they’re both 100% transparent.
Let’s ask GPT-4o what it “sees” on these four images:
What’s going on here? The pattern that emerges is that GPT-4o can read the text if and only if the text color is different than the “color” of the transparent background.
This tells us that GPT-4o disregards the alpha channel and only looks at the RGB channels. To it, transparent black is black, transparent white is white.
We can see this even more clearly if we mess with an image to preserve the three RGB channels while setting the alpha channel to 100%. Here’s a little Pillow function to do that:
from PIL import Image
def set_alpha(image, output_path, alpha_value):
# copy the image and ensure it's RGBA
image = image.convert("RGBA")
# set the alpha channel of every pixel to the given value
pixels = image.getdata()
new_pixels = [(r, g, b, alpha_value) for r, g, b, a in pixels]
image.putdata(new_pixels)
return image
I used that to make the two images below; they have identical RGB data, and only differ in the alpha channel:
GPT-4o has no trouble seeing the hidden platypus:
You can try downloading the
hidden_platypus.png
image and dropping it into ChatGPT yourself; it will correctly describe it.
You may also note the image is 39.3 KB, the same size as
platypus.png
even though PNG compression should have made it much smaller if it was really
a perfectly blank, transparent image. Or you can use the above function to
set the alpha channel back to 255, recovering the original image.
I’m not sure if this is bug but it’s certainly surprising behavior; in fact, it feels like something a malicious user could use to smuggle information past humans and directly to GPT-4o. However, GPT-4o is much better at detecting and ignoring malicious prompts hidden in images than GPT-4v was:
(You can find other examples of GPT-4o successfully detecting and
ignoring malicious prompts hidden in images in my gallery of GPT-4o test
images generated by my image_tagger
utility.)
So, even if it is a bug, it’s not obvious it can be exploited. Still, it would be less surprising if GPT-4o “saw” the same thing that a human would in a browser.