Articles/Google Cloud DLP API in 2026: Inspect vs Deidentify, Custom InfoTypes, and the Per-GB Pricing Trap
Tool Reviews

Google Cloud DLP API in 2026: Inspect vs Deidentify, Custom InfoTypes, and the Per-GB Pricing Trap

Cloud DLP (now Sensitive Data Protection) docs cover the API surface but skip the half-day every dev loses to custom infoType regex, quota 429s, and the moment the per-GB bill hits a BigQuery scan.

May 25, 2026Read time: 12 min0 topic signals
Reading runway

Context above, deep read below. Use the TOC to move section by section without losing the thread.

Tool Reviews8 sections

Last sprint I was wiring up a PII scrubbing layer for a fintech team feeding support tickets into a fine-tuning pipeline, and Google Cloud DLP was the only GCP-native option that would scan and transform inline without forcing the data out of the project. The first day I shipped a working content.inspect call. The second day I tried to add a custom infoType for the client's internal account-ID format and watched it return zero findings against a sample I knew contained ten matches. The third day I learned what RE2 was.

If you arrived here from a query like "cloud data loss prevention api," that custom-regex tar pit is probably the first wall you are about to hit. This piece walks through the Cloud DLP API (rebranded but not renamed Sensitive Data Protection) at the layer the docs skip: when to use inspect vs deidentify vs a storage job, how to write custom infoTypes that actually match, how the pricing math gets ugly at scale, and which transformations you can reverse.

The rebrand is a docs problem, not a product problem

Google rebranded Cloud Data Loss Prevention to Sensitive Data Protection sometime in 2023, and the rollout is still uneven in 2026. The product page in the Console says "Sensitive Data Protection," but every artifact you touch as a developer still says "DLP":

Surface Name in 2026
Marketing page, Console UI Sensitive Data Protection
API endpoint dlp.googleapis.com
gcloud commands gcloud dlp ...
Python client library google-cloud-dlp
Service account roles roles/dlp.user, roles/dlp.admin
Pricing line item "Sensitive Data Protection"
Stack Overflow tag google-cloud-dlp

Practical consequence: search the new name when you want the high-level product overview or the pricing page; search the old name when you want a code example. The two product names lead to the same docs eventually, but the first-page Google results split between the two and you can lose a half hour bouncing if you do not know they are the same thing.

The four API surfaces and when each one fits

DLP has more endpoints than most teams realize, and picking the wrong one costs latency or money or both. The decision tree is mostly about whether your data is inline or at rest:

Surface Endpoint Sync? Use when
Content inspect projects.content.inspect yes Scanning a single payload < 0.5MB inline (a chat message, a ticket, a form submission)
Content deidentify projects.content.deidentify yes Same as above, but transforming the data on the way through
Content reidentify projects.content.reidentify yes Reversing a deterministic or FPE transformation
Storage job projects.dlpJobs.create (type INSPECT_JOB) no Auditing a whole GCS bucket / BigQuery table / Datastore kind
Discovery projects.discoveryConfigs.create no Continuous data profiling across many datasets

The split that catches teams off guard is inline vs at-rest pricing models. Content endpoints are billed per GB of content sent in the request. Storage jobs are billed per GB of source data they scan in place. Discovery is billed at a separate, lower rate for periodic profiling. A team that uses content.inspect in a loop to scan a million BigQuery rows pays roughly 100x what they would pay using a single dlpJobs.create over the same table, because the content path counts every request whereas the job amortizes over the table scan.

A working content.inspect in twenty lines

Here is the smallest useful inspect call, using the Python client.

from google.cloud import dlp_v2

PROJECT_ID = "your-project"

client = dlp_v2.DlpServiceClient()
parent = f"projects/{PROJECT_ID}/locations/global"

inspect_config = {
    "info_types": [
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "CREDIT_CARD_NUMBER"},
    ],
    "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
    "include_quote": True,
    "limits": {"max_findings_per_request": 100},
}

response = client.inspect_content(
    request={
        "parent": parent,
        "inspect_config": inspect_config,
        "item": {"value": "Reach me at user@example.com or 415-555-0142."},
    }
)

for f in response.result.findings:
    print(f"{f.info_type.name}: {f.quote} ({f.likelihood.name})")

Output:

EMAIL_ADDRESS: user@example.com (LIKELY)
PHONE_NUMBER: 415-555-0142 (LIKELY)

Three things to internalize from this:

  1. min_likelihood is the single most important knob. DLP reports findings with a five-level confidence scale: VERY_UNLIKELY, UNLIKELY, POSSIBLE, LIKELY, VERY_LIKELY. Setting POSSIBLE catches most real PII at the cost of some false positives (random numbers tagged as phone numbers). Setting LIKELY cuts noise but misses partial matches. Default is POSSIBLE.
  2. include_quote: True is on by default but worth knowing. The quote field returns the matched substring, which is essential for debugging your infoType selection but is also itself sensitive data that now lives in your logs. Disable it in production paths where you only need the count, not the value.
  3. The parent uses locations/global. DLP also supports regional endpoints (e.g. locations/us-central1) which keep the request inside a single region for data-residency requirements. The global endpoint is faster for ad-hoc scans but does not give you the regional guarantee.

Custom infoTypes: regex (RE2), dictionary, hotword

The 150+ built-in infoTypes cover the common identifiers (CC numbers, government IDs, names, addresses, healthcare codes), but every project ends up needing at least one custom infoType. Three flavors:

Regex. A pattern that DLP matches against the content. The trap is that DLP uses RE2, not PCRE or Python re. RE2 has no lookbehind, no lookahead, no backreferences. If your pattern uses any of those, it will silently return zero findings — no syntax error, no warning, just empty results. Symptom: your test sample has obvious matches and the API insists there are none.

custom_info_types = [
    {
        "info_type": {"name": "ACCOUNT_ID"},
        "regex": {"pattern": r"ACC-[0-9]{8}"},
        "likelihood": dlp_v2.Likelihood.LIKELY,
    }
]

The trick to debugging: paste your pattern into regex101.com with flavor set to RE2 (Go), or compile against the re2 package locally. If it does not compile there, it will not match in DLP.

Dictionary. A list of exact-match words/phrases. DLP can take the list inline (word_list) or as a reference to a GCS-hosted file (cloud_storage_path). Inline is capped at ~50 KB total dictionary size. For larger lists (employee names, product SKUs, customer IDs) put a one-word-per-line text file in GCS and reference it.

Hotword (proximity rule). Modifies the likelihood of another infoType when a "hot word" appears within N bytes. Useful for catching things like a 9-digit number near the word "SSN" — the number alone is POSSIBLE, the number near "SSN" is VERY_LIKELY. The proximity field is measured in bytes, not characters, which matters for UTF-8 content where Chinese characters consume 3 bytes each.

Deidentify: choose reversibility before you scale

Deidentify transformations split into one-way and reversible. Picking wrong is expensive to fix later because all the historic transformed data uses the choice you made on day one.

Transformation Reversible? Use when
ReplaceConfig no "Replace EMAIL with [REDACTED_EMAIL]" — simplest
RedactConfig no Remove the value entirely, no placeholder
MaskConfig no "Replace each digit with X" — preserves shape but not value
CryptoHashConfig (HMAC) no Stable pseudonym across runs, but cannot recover original
CryptoDeterministicConfig yes Tokenization for join keys — same input always produces same token
CryptoReplaceFfxFpeConfig (FPE) yes Preserve format (16-digit input → 16-digit output) for downstream systems that validate format
DateShiftConfig partial Shift dates by a per-context random offset; the offset is the key
BucketingConfig no Replace ages with brackets like 30-39

The CryptoReplaceFfxFpeConfig choice is the one teams revisit most often. FPE keeps the output in the same alphabet as the input — a credit-card number deidentified with FPE is still 16 digits, still passes a Luhn check if you opt in to that, still fits a VARCHAR(16) column. The cost is operational: FPE needs a key wrapper (typically a Cloud KMS-wrapped AES key) and the key rotation story is something you have to design.

If the only downstream consumer of the deidentified data is a model or an analytics dashboard, CryptoDeterministicConfig is usually enough — it produces a stable but format-free token, which is fine if your schema can hold it.

Pricing math at scale

Cloud DLP pricing is one of those services where the per-unit cost is low and the volume math is brutal. Per Google's published pricing, the rates as of writing (verify on the Sensitive Data Protection pricing page before locking your architecture):

Operation Rate (per Google pricing)
content.inspect after 1GB free monthly tier ~$1.00 per GB of content sent
content.deidentify / reidentify ~$0.50 per GB
Storage job inspect ~$1.00 per GB of source data scanned
Discovery profile ~$0.03 per GB profiled (first 1TB), tiered down

[community-verified, exact tier rates change — pull current numbers from the pricing page before sizing]

Three rough-cut scenarios to anchor the math:

  • A chat app scrubbing every message inline. Average message 200 bytes, 1M messages/day. Daily volume 200MB, well under the 1GB free tier — call it free.
  • Daily audit of a 100GB BigQuery table. Using a storage job: ~$100/day, ~$3,000/month. Using Discovery for the same dataset: ~$3 for the initial profile, then incremental costs as the table grows. Discovery wins by two orders of magnitude.
  • One-time scan of a 10TB historical log archive. Storage job: ~$10,240. Cannot avoid; budget it explicitly before kicking off.

The mistake almost every team makes once is running content.inspect in a loop over BigQuery rows, because the code is shorter than wiring up a job. A 100GB table scanned row-by-row through the content endpoint can produce a bill an order of magnitude higher than a single storage job, because you also pay the per-request overhead, not just the GB volume.

Quota 429s and how to back off

DLP enforces project-level quotas on each surface. The common ones you will see in real bots:

Quota Default (per Google docs, as of writing) Symptom when hit
content.inspect requests per minute per project 600 429 RESOURCE_EXHAUSTED
dlpJobs.create per minute per project low (single digits) 429 on job submission
Content size per request 0.5 MB raw / 1MB structured 400 INVALID_ARGUMENT
Findings per request 3000 hard cap Findings truncated, no error

The right response to a 429 is exponential backoff with jitter, and the Google Cloud client libraries do this for you if you let them — retry=DEFAULT_RETRY on the call. The wrong response is to retry immediately on the same thread, which racks up the per-request count without making progress.

When the 600 RPM ceiling is actually a problem (it usually is not, but high-volume real-time scrubbing can hit it), the lever to pull is batching: send multiple items in a single inspect_content call via the table field, which counts as one request against the quota but processes up to the per-request size limit of structured content.

What to do after this article

Open the Cloud DLP API entry for the endpoint catalog, then run the twenty-line inspect_content example above against your own GCP project. If you get findings back, the rest of the API is a matter of mixing and matching infoTypes and transformations — the surface is wide but the patterns repeat. The next thing to design is your custom-infoType strategy: dictionary file in GCS, RE2-compatible regexes, and hotword proximity rules for the values that depend on context. After that, decide whether your real workload is inline (content endpoints) or at-rest (jobs and Discovery), and route the spend accordingly before the bill teaches you.

Share this article

Article overview

Before you move on

Category
Tool Reviews
Read time
12 min
Mentioned tools
0
Back to all articles →

Next step

Finished reading? Continue comparing tools in the directory.

Browse tools