Google Cloud Vision API logo

Google Cloud Vision API

Google Cloud Vision API

Google Cloud's image-analysis API: 11 features (labels, OCR, document OCR, face attributes, landmarks, logos, object localization, safe search, crop hints, web detection) billed per feature per image

Visit site ↗Documentation ↗Health checked 12h ago
Use it when

Eleven standard features stack in a single request: LABEL_DETECTION, TEXT_DETECTION, DOCUMENT_TEXT_DETECTION, FACE_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, OBJECT_LOCALIZATION, IMAGE_PROPERTIES, SAFE_SEARCH_DETECTION, CROP_HINTS, WEB_DETECTION. Billing adds per feature, but only one network round-trip is consumed

Watch for

No face identification. The API returns face landmarks and emotion likelihoods (joy, sorrow, anger, surprise) but never an identity. For 1:1 or 1:N face matching you need AWS Rekognition or a self-hosted model such as ArcFace

First check

Open cloud.google.com/vision, enable the Cloud Vision API on your GCP project, and link a billing account. Authenticate locally with `gcloud auth application-default login` (ADC) or drop a service account JSON for CI/CD. Install @google-cloud/vision (Node.js) or google-cloud-vision (Python) and call LABEL_DETECTION first to confirm the per-feature-per-image billing model. For PDFs and scanned documents, go straight to DOCUMENT_TEXT_DETECTION via files:asyncBatchAnnotate — do not use TEXT_DETECTION for multi-page docs.

Auth
oauth
CORS
No
HTTPS
Yes
Signup
Required
Latency
166 ms
Protocol
REST, gRPC
Pricing
freemium

Uptime · 30-day window

Probes: 30Uptime: 93%Avg latency: 493ms
01

About this API

Cloud Vision API is Google Cloud's core image-understanding service, exposing eleven standard features. LABEL_DETECTION returns open-vocabulary object and scene tags. TEXT_DETECTION runs scene OCR for short strings in natural images. DOCUMENT_TEXT_DETECTION targets documents and handwriting, returning a hierarchical block / paragraph / word / symbol structure. FACE_DETECTION returns face landmarks and emotion likelihoods (joy, sorrow, anger, surprise) but never an identity. LANDMARK_DETECTION, LOGO_DETECTION, and OBJECT_LOCALIZATION cover landmarks, brand logos, and bounded objects. IMAGE_PROPERTIES returns dominant colors and crop hints. SAFE_SEARCH_DETECTION scores adult / spoof / medical / violence / racy on a five-step UNLIKELY-to-VERY_LIKELY scale. CROP_HINTS suggests smart crops. WEB_DETECTION runs a reverse-image lookup and returns a best-guess label. Billing is feature-times-image: one image with LABEL_DETECTION + TEXT_DETECTION bills 2 units. The free tier covers the first 1,000 units per month per feature. From 1,001 to 5,000,000 units per month, most features run $1.50 per 1,000 units; OBJECT_LOCALIZATION is $2.25, WEB_DETECTION is $3.50, and CROP_HINTS is $0.60. Above 5M units per month, TEXT_DETECTION, DOCUMENT_TEXT_DETECTION, FACE, LANDMARK, and LOGO drop to $0.60, and LABEL_DETECTION drops to $1.00. There are four endpoint shapes: v1/images:annotate (synchronous, up to 16 images per batch), v1/images:asyncBatchAnnotate (async with output to GCS), v1/files:annotate (synchronous PDF/TIFF, 5 pages max), and v1/files:asyncBatchAnnotate (async PDF/TIFF up to 2,000 pages per document). Image input takes three forms: inline base64 (HTTP payload capped at 10MB), GCS URI (single file capped at 20MB), or HTTP/HTTPS URL (demo-only). The default project quota is 1,800 requests per minute; higher concurrency needs a Cloud Quotas adjustment. Authentication follows the Google Cloud stack: ADC, service account JSON, OAuth bearer, or a limited API key. Seven official client libraries are maintained: C#, Go, Java, Node.js, PHP, Python, Ruby. Two gotchas dominate field experience. First, teams used to per-request pricing under-budget because Vision bills per feature per image — stacking three features triples the unit count. Decide which features you actually need before estimating cost. Second, multi-page PDFs should not be looped through synchronous TEXT_DETECTION one page at a time. Use DOCUMENT_TEXT_DETECTION via files:asyncBatchAnnotate and let Google process the whole document in one async job that writes JSON output to GCS. For 1:N face identification, Vision is the wrong tool — switch to AWS Rekognition or a self-hosted face model.

02

What you can build

  • 1Document OCR pipelines: DOCUMENT_TEXT_DETECTION extracts structured text from scanned contracts, invoices, and handwritten notes. PDFs and TIFFs go through files:asyncBatchAnnotate with output written to GCS for downstream parsing
  • 2UGC safety moderation: SAFE_SEARCH_DETECTION returns five-axis likelihood scores (adult, spoof, medical, violence, racy) on an UNLIKELY-to-VERY_LIKELY scale, ideal for fail-closed upload gates
  • 3E-commerce product tagging: LABEL_DETECTION plus OBJECT_LOCALIZATION returns category labels and bounding boxes, with PRODUCT_SEARCH layered on top for reverse-image lookup against an indexed catalog
03

Strengths & limitations

Strengths

  • Eleven standard features stack in a single request: LABEL_DETECTION, TEXT_DETECTION, DOCUMENT_TEXT_DETECTION, FACE_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, OBJECT_LOCALIZATION, IMAGE_PROPERTIES, SAFE_SEARCH_DETECTION, CROP_HINTS, WEB_DETECTION. Billing adds per feature, but only one network round-trip is consumed
  • Volume pricing drops at the 5M-units/month tier: LABEL_DETECTION goes from $1.50 to $1.00 per 1,000 units, TEXT_DETECTION and DOCUMENT_TEXT_DETECTION go from $1.50 to $0.60
  • Image input has three shapes: inline base64 (payload capped at 10MB), GCS URI (gs://bucket/path, single file capped at 20MB), or an HTTP/HTTPS URL (demo-only — not for production). PDFs and TIFFs go through the files: endpoints with a 2,000-page ceiling per document

Limitations

  • No face identification. The API returns face landmarks and emotion likelihoods (joy, sorrow, anger, surprise) but never an identity. For 1:1 or 1:N face matching you need AWS Rekognition or a self-hosted model such as ArcFace
  • Synchronous annotate caps at 16 images per request; files:annotate (PDF/TIFF) caps at 5 pages per request. Anything larger has to switch to files:asyncBatchAnnotate with results written to GCS
  • WEB_DETECTION and OBJECT_LOCALIZATION are the most expensive features at the tier-1 ($3.50 and $2.25 per 1,000 units respectively). When stacking features, these two are the first to blow the budget — turn them off when not needed
04

Official quickstart

Read the official quickstart at cloud.google.com.

05

Getting started

Open cloud.google.com/vision, enable the Cloud Vision API on your GCP project, and link a billing account. Authenticate locally with `gcloud auth application-default login` (ADC) or drop a service account JSON for CI/CD. Install @google-cloud/vision (Node.js) or google-cloud-vision (Python) and call LABEL_DETECTION first to confirm the per-feature-per-image billing model. For PDFs and scanned documents, go straight to DOCUMENT_TEXT_DETECTION via files:asyncBatchAnnotate — do not use TEXT_DETECTION for multi-page docs.

06

FAQ

Does Google Cloud Vision API have a free tier?+

Yes. Each feature has its own monthly free quota of 1,000 units, billed independently across the 11 standard features. A unit equals one feature applied to one image, so an annotate request using LABEL_DETECTION + TEXT_DETECTION on one image draws from both free quotas. PRODUCT_SEARCH and CELEBRITY_RECOGNITION (allow-listed) follow separate pricing models.

Can Vision API do face recognition?+

No. FACE_DETECTION returns face landmarks (eye, nose, mouth coordinates) and emotion likelihoods (joy, sorrow, anger, surprise) but never an identity, and the API does not support 1:1 or 1:N face matching. Google explicitly excludes face identification from its public Cloud product. For identity matching, use AWS Rekognition (Face Compare, SearchFacesByImage) or a self-hosted model such as ArcFace or FaceNet.

Which endpoint should I use for multi-page PDFs?+

Use DOCUMENT_TEXT_DETECTION with v1/files:asyncBatchAnnotate. That endpoint accepts up to 2,000 pages per document, takes both input and output as GCS URIs, returns a long-running operation name, and writes results asynchronously to the GCS output prefix (one JSON file per batch of up to 100 pages). The synchronous files:annotate caps at five pages. TEXT_DETECTION is designed for scene text in natural images — it loses paragraph and line-break structure on document pages.

Does stacking features in one request reduce cost?+

Only the network round-trip is saved, not the bill. Requesting N features on one image bills N units, each at its own feature-specific rate. The "I am already calling it, might as well enable a few more features" instinct burns money — only enable the features your application actually consumes.

07

Technical details

CORS: NoHTTPS: YesSignup: YesOpen source: No
Auth type
oauth
Pricing
freemium
Rate limit
Default project quota: 1,800 requests/minute; synchronous annotate accepts up to 16 images per request, files:annotate (PDF/TIFF) up to 5 pages per request; async batch supports up to 2,000 pages per document with output written to GCS
Free tier quota
Per-feature monthly free units: first 1,000 units/month free for each feature. A unit = one feature applied to one image, so an annotate request with LABEL_DETECTION + TEXT_DETECTION on one image bills 2 units. PRODUCT_SEARCH and CELEBRITY_RECOGNITION (allow-listed) have separate pricing models.
Protocols
REST, gRPC
SDKs
C#, Go, Java, Node.js, PHP, Python, Ruby
Response time
166 ms
Last health check
6/26/2026, 6:23:30 AM
08

Endpoints

Parsed from the OpenAPI spec. Showing 8 of 8 non-deprecated endpoints.

POST
/v1p1beta1/{parent}/files:annotateprojects
parent:path*
POST
/v1p1beta1/{parent}/files:asyncBatchAnnotateprojects
parent:path*
POST
/v1p1beta1/{parent}/images:annotateprojects
parent:path*
POST
/v1p1beta1/{parent}/images:asyncBatchAnnotateprojects
parent:path*
POST
/v1p1beta1/files:annotatefiles
POST
/v1p1beta1/files:asyncBatchAnnotatefiles
POST
/v1p1beta1/images:annotateimages
POST
/v1p1beta1/images:asyncBatchAnnotateimages
09

Tags

10

More from Google