How is a Cloud Vision unit actually counted?

One unit equals one feature applied to one image. If you call annotate on an image with LABEL_DETECTION + TEXT_DETECTION + SAFE_SEARCH_DETECTION, that one image bills three units. The free tier gives 1,000 units per feature per month, not 1,000 units total. Teams that assume 'per-image pricing' under-budget by exactly the number of features they stack.

Which features are the budget killers?

WEB_DETECTION at $3.50 per 1,000 units and OBJECT_LOCALIZATION at $2.25 per 1,000 units in the 1,001-5M band. Most other features (LABEL_DETECTION, TEXT_DETECTION, DOCUMENT_TEXT_DETECTION, FACE_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, IMAGE_PROPERTIES, SAFE_SEARCH_DETECTION) are $1.50. CROP_HINTS is $0.60. The default 'enable all features just in case' code path stacks WEB and OBJECT on top of the cheap features and silently 3x's the bill. Only enable features you actually parse downstream.

What's the right way to process multi-page PDFs?

DOCUMENT_TEXT_DETECTION via v1/files:asyncBatchAnnotate, not TEXT_DETECTION via v1/images:annotate in a page loop. The async files endpoint takes up to 2,000 pages per document, accepts both input and output as GCS URIs, returns a long-running operation name, and writes JSON output to the GCS prefix you specify (one file per batch of up to 100 pages). The synchronous v1/files:annotate caps at 5 pages. Looping TEXT_DETECTION over rasterized PDF pages costs the same money but loses the paragraph and line-break structure that DOCUMENT_TEXT_DETECTION returns.

Can Cloud Vision do face recognition?

No. FACE_DETECTION returns landmarks (eye, nose, mouth coordinates) and emotion likelihoods (joy, sorrow, anger, surprise) on an UNLIKELY-to-VERY_LIKELY scale. It does not return an identity, and there is no SearchFacesByImage equivalent. Google has explicitly excluded face identification from the public Cloud Vision product on policy grounds. For 1:1 or 1:N face matching you have to switch to AWS Rekognition or run a self-hosted model such as ArcFace or FaceNet.

Inline base64 or GCS URI for image input?

GCS URI for anything production. Inline base64 caps the entire HTTP payload at 10MB, which means batch size in images:annotate is constrained more by encoded payload than by the 16-image batch limit. GCS URIs let you reference files up to 20MB each and avoid the base64 33% size expansion entirely. The HTTP/HTTPS URL input mode exists in the API but is for demos only. Google explicitly recommends against it for production because there is no retry / caching contract on the fetch.

Articles/Google Cloud Vision API in 2026: Why Your Bill Is 3x Your Image Count and the PDF Pattern That Saves It

Tool Reviews

Google Cloud Vision API in 2026: Why Your Bill Is 3x Your Image Count and the PDF Pattern That Saves It

Cloud Vision bills per feature per image, not per image. Stack three features on 1M photos and you owe for 3M units. Plus the multi-page PDF mistake that turns a $40 job into $400. The pricing model, the cliffs, and what to actually do.

May 28, 2026Read time: 12 min0 topic signals

Reading runway

Context above, deep read below. Use the TOC to move section by section without losing the thread.

Tool Reviews9 sections

Contents

Reading positionSection 1 / 9

How "one unit" is actually counted The eleven standard features and their prices The 5M-units tier cliff The multi-page PDF mistake Input shape: base64, GCS, or URL Quotas, batching, and what 1,800 RPM actually means The face recognition wall A working LABEL_DETECTION in fifteen lines What to do before you ship 1M images a day

The bill that made me actually read the Cloud Vision pricing page came from a moderation pipeline for a marketplace listings product. The plan looked safe: 800K user-uploaded product photos per month, run SAFE_SEARCH_DETECTION to gate them, run LABEL_DETECTION to auto-tag, run OBJECT_LOCALIZATION because the PM thought bounding boxes would be useful for the gallery crop. I budgeted $1,200 based on the back-of-envelope "800K images at $1.50 per 1,000."

The actual bill was $4,320.

If you got here from a query like "cloud vision api," the Cloud Vision API is probably already on your shortlist for OCR, image moderation, or visual tagging. This piece is about the part of the pricing and architecture that the docs cover but very few people internalize until the first invoice lands: the per-feature-per-image billing model, the WEB and OBJECT feature surcharge, the multi-page PDF mistake, and the face recognition wall that disappoints everyone on day one.

How "one unit" is actually counted

Cloud Vision's pricing page is explicit on the billing unit, but the wording is easy to skim past:

"Pricing is calculated by the number of Feature units billed."

A feature unit is one feature applied to one image. If your annotate request enables three features on one image, you bill three units. The free tier is 1,000 units per feature per month, billed independently across features.

For my moderation pipeline, I had three features per image (SAFE_SEARCH + LABEL + OBJECT_LOCALIZATION). 800,000 images at three units each = 2.4M units. After the 3,000 free units (1,000 per feature), I was billing 2,397,000 paid units. Spread across the three feature prices ($1.50 + $1.50 + $2.25 per 1,000 units), the bill works out to almost exactly $4,320. Linear "per image" math gave me $1,200 because I had unconsciously assumed each photo cost one unit total, not three.

The rule generalizes. For any annotate request with N features on M images, you bill N × M units, with each feature's price applied to the M units it contributes.

The eleven standard features and their prices

The price table is the load-bearing piece of context for any cost discussion. Below is the 2026 standard pricing per 1,000 units, after the first 1,000 free units per feature per month.

Feature	What it returns	1,001 – 5M	5M+
LABEL_DETECTION	Open-vocabulary tags	$1.50	$1.00
TEXT_DETECTION	Scene text OCR	$1.50	$0.60
DOCUMENT_TEXT_DETECTION	Document/handwriting OCR with block/paragraph/word/symbol structure	$1.50	$0.60
FACE_DETECTION	Face landmarks + emotion likelihood (no identity)	$1.50	$0.60
LANDMARK_DETECTION	Recognizable landmarks	$1.50	$0.60
LOGO_DETECTION	Brand logos	$1.50	$0.60
IMAGE_PROPERTIES	Dominant colors	$1.50	$0.60
SAFE_SEARCH_DETECTION	Adult / spoof / medical / violence / racy likelihood	$1.50	$0.60
OBJECT_LOCALIZATION	Bounded objects with bounding boxes	$2.25	$1.50
CROP_HINTS	Smart crop rectangles	$0.60	$0.30
WEB_DETECTION	Reverse image search + best-guess label	$3.50	$2.00

Two specialty features sit outside the standard list:

PRODUCT_SEARCH has its own pricing model: a per-product indexing fee plus a query fee, separate from the standard tiers.
CELEBRITY_RECOGNITION is allow-listed and requires a separate Google Cloud review.

The two outliers in the standard table are WEB_DETECTION ($3.50, more than 2x the typical $1.50) and OBJECT_LOCALIZATION ($2.25, 1.5x). When a code path naively enables "every feature that might be useful," those two dominate the bill. In the marketplace pipeline I described, OBJECT_LOCALIZATION alone was $1,800 of the $4,320, and the PM later admitted nobody had wired up the bounding boxes in the UI yet.

The optimization is mechanical: every feature on every annotate call has to be on the list because some downstream consumer actually reads it. Strip features when they fall out of use.

The 5M-units tier cliff

The second pricing tier kicks in at 5,000,000 units per month per feature. The drop is feature-specific:

LABEL_DETECTION: $1.50 → $1.00 (33% off)
TEXT_DETECTION, DOCUMENT_TEXT_DETECTION, FACE, LANDMARK, LOGO, IMAGE_PROPERTIES, SAFE_SEARCH: $1.50 → $0.60 (60% off)
OBJECT_LOCALIZATION: $2.25 → $1.50 (33% off)
WEB_DETECTION: $3.50 → $2.00 (43% off)
CROP_HINTS: $0.60 → $0.30 (50% off)

The 5M boundary is per feature, not pooled. If your traffic mixes 6M LABEL units with 4M TEXT units, only LABEL gets the discount; TEXT stays on the tier-1 rate. This is what causes the next round of bill confusion at scale: teams that hit 5M total units assume they crossed the cliff, but each feature has its own counter.

The other implication is that stacking the right combination of features can keep all of them below the 5M cliff longer than it has to. If your application can run sentiment-style features against a sample and only run WEB_DETECTION against the cases that need it, the cheap features ride the volume curve while the expensive feature stays on a smaller denominator.

The multi-page PDF mistake

A pattern I have seen in three separate codebases now: a developer rasterizes a multi-page PDF into per-page PNGs, then loops TEXT_DETECTION over each page via images:annotate. The code works. The output is wrong in two ways and the bill is the same.

The right pattern uses DOCUMENT_TEXT_DETECTION via the asynchronous files endpoint:

from google.cloud import vision_v1
from google.protobuf import json_format

client = vision_v1.ImageAnnotatorClient()

input_config = vision_v1.InputConfig(
    gcs_source=vision_v1.GcsSource(uri="gs://my-bucket/inputs/contract.pdf"),
    mime_type="application/pdf",
)

output_config = vision_v1.OutputConfig(
    gcs_destination=vision_v1.GcsDestination(
        uri="gs://my-bucket/outputs/contract-",
    ),
    batch_size=100,  # pages per output JSON file
)

feature = vision_v1.Feature(type_=vision_v1.Feature.Type.DOCUMENT_TEXT_DETECTION)

request = vision_v1.AsyncAnnotateFileRequest(
    features=[feature],
    input_config=input_config,
    output_config=output_config,
)

operation = client.async_batch_annotate_files(requests=[request])
print(f"Submitted: {operation.operation.name}")
result = operation.result(timeout=600)
print(f"Done. Outputs at gs://my-bucket/outputs/contract-*")

The async files endpoint accepts up to 2,000 pages per document, takes the PDF as a GCS URI, returns a long-running operation name, and writes JSON results into the GCS prefix you specify, one file per batch_size pages. Synchronous files:annotate caps at 5 pages.

What you save by using the right endpoint:

Structure. DOCUMENT_TEXT_DETECTION returns a hierarchical block / paragraph / word / symbol breakdown with bounding boxes per element. TEXT_DETECTION returns flat text annotations designed for scene OCR (think street signs) and loses paragraph and line-break information.
Rate-limit headroom. The default 1,800 requests/minute project quota burns fast at one request per page. A 200-page PDF as 200 separate sync requests can spike a meaningful share of your minute budget. One async file request is one request.
Operational simplicity. No retry plumbing per page. The long-running operation handles internal retries.

What you do not save is dollars. Billing is still per page per feature; an async 200-page document bills 200 units of DOCUMENT_TEXT_DETECTION, the same as 200 sync per-page calls. The right pattern saves engineering time, not money.

Input shape: base64, GCS, or URL

Three input modes, each with sharp edges:

Inline base64. Embed the bytes in the request JSON. Capped by the HTTP payload limit (~10MB after base64 expansion). Each annotate request, even with the 16-image batch limit, is constrained by how many bytes you can fit in 10MB. Two 5MB photos already saturate one request. Fine for ad-hoc work, awkward in production.
GCS URI (gs://bucket/path). Single file ceiling is 20MB. Vision pulls the bytes from GCS in-region (assuming the bucket and API call are in compatible regions), which dodges the base64 expansion and lets the 16-image batch limit actually mean 16 images. This is the right production default.
HTTP/HTTPS URL. The API accepts a public URL and fetches the image. There is no retry contract or caching guarantee from Google's side, and the docs are explicit: this is a demo affordance, not a production input. Use it for quickstarts and switch off when you ship.

The cross-region pull cost from GCS to Vision in a different region is non-trivial at volume. Keep the bucket and the API in the same region when you can; both are configured by setting the client's endpoint and the bucket's location separately.

Quotas, batching, and what 1,800 RPM actually means

The default project quota is 1,800 requests per minute for the Vision API, plus a 16-image cap on synchronous images:annotate and a 5-page cap on synchronous files:annotate. The async file endpoint counts as one request per submission, regardless of page count, which is one reason it scales better even when latency is fine.

A few practical implications:

Batch synchronously up to 16 images per request. Each request still counts as one against the RPM cap, so 16x batching = effectively 28,800 images/minute headroom on the default quota. Worth doing even when latency-insensitive because it preserves quota.
Request a quota bump before launch, not at the first 429. Cloud Quotas console processes Vision increases in hours to a day. If your steady-state projection puts you above ~1,200 RPM, file the request a week ahead.
The 1,800 RPM cap is project-wide across all Vision endpoints, including async submissions. Submitting 200 async PDFs in a burst still consumes the project's per-minute budget.

The face recognition wall

This one is not a pricing surprise, it is a capability surprise, and it comes up on every project that lists "facial recognition" in the requirements.

Cloud Vision's FACE_DETECTION feature returns the things needed for face analysis: bounding box, landmarks (eye, nose, mouth, ear coordinates), and likelihoods for joy, sorrow, anger, surprise, headwear, blur, under-exposure, and a few others. It does not return any identity. There is no SearchFacesByImage style endpoint, no face collection to enroll users into, no embedding vector you can compare across calls.

This is a deliberate policy choice. AWS Rekognition exposes a face identification API; Google does not. If your requirement is genuinely 1:1 ("is this the same person as the one in this other photo") or 1:N ("which of these enrolled users does this photo match"), Vision is the wrong service. The realistic options are:

AWS Rekognition Face Compare or SearchFacesByImage
Azure Face API (with the caveat that Microsoft has been narrowing its public access too)
Self-hosted models like ArcFace, FaceNet, or InsightFace, served behind your own gRPC endpoint

Worth confirming this on the requirements call before you start. "Detect faces" and "recognize faces" are two different products, and getting them confused is the most common Day 1 disappointment with Cloud Vision.

A working LABEL_DETECTION in fifteen lines

The shortest path to a useful Vision response, in Python:

from google.cloud import vision_v1

client = vision_v1.ImageAnnotatorClient()

image = vision_v1.Image(source=vision_v1.ImageSource(
    image_uri="gs://my-bucket/photos/cat.jpg"
))

response = client.annotate_image({
    "image": image,
    "features": [
        {"type_": vision_v1.Feature.Type.LABEL_DETECTION, "max_results": 10},
        {"type_": vision_v1.Feature.Type.SAFE_SEARCH_DETECTION},
    ],
})

for label in response.label_annotations:
    print(f"{label.score:.2f}  {label.description}")

ss = response.safe_search_annotation
print(f"adult={ss.adult.name}  racy={ss.racy.name}  violence={ss.violence.name}")

Three things in this snippet worth flagging on first integration:

max_results is per feature. Default is around 10 for LABEL_DETECTION; bump it if you want long-tail labels.
SafeSearch returns enum-typed likelihood, not a number. The five values are UNKNOWN, VERY_UNLIKELY, UNLIKELY, POSSIBLE, LIKELY, VERY_LIKELY. Pick the threshold based on your tolerance, but anchor on POSSIBLE for fail-closed UGC gates.
annotate_image is synchronous and single-image. For real batch work, batch_annotate_images takes up to 16 images per request and still costs one quota unit.

What to do before you ship 1M images a day

Concrete checklist for taking a Cloud Vision integration from prototype to production volume:

Audit the feature list. Every feature on every annotate call has to be wired to a downstream consumer. Strip features that nobody parses; each one is a hidden multiplier on the bill.
Plan budget against feature × image, not image. Multiply your image volume by the count of features per call to get your unit projection. If you stack three features, your unit count is 3x your image count.
Decide PDF path before writing the loop. Multi-page PDFs go through DOCUMENT_TEXT_DETECTION + files:asyncBatchAnnotate, not TEXT_DETECTION in a per-page loop.
Move input to GCS URIs for production. Drop base64 once you exit prototype. The 33% size expansion plus the 10MB payload cap stack badly at scale.
Co-locate the bucket with the API region to avoid cross-region pull charges and added latency.
Set a Cloud Billing budget alert at 50% and 90%. The per-feature-per-image surprise hits hardest in the first full cycle.
File a quota increase if projected RPM > 1,200. Leave headroom for spikes.
Confirm "face recognition" actually means face detection in the requirements doc. If it means identification, Vision is the wrong service and the project's architecture is wrong.

If you are still comparing image APIs, the photography category in our API directory lists alternatives, complements, and adjacent services worth comparing against Cloud Vision before committing.

Contents

Jump to a section

Reading positionSection 1 / 9

Share this article

Pass this article along

Send it to your preferred platform or copy the link.

X LinkedIn Reddit Telegram Weibo

Article overview