Is Cloud DLP the same product as Sensitive Data Protection?

Yes. Google rebranded Cloud Data Loss Prevention to Sensitive Data Protection in 2023, but the API namespace stayed `dlp.googleapis.com`, the Python client is still `google-cloud-dlp`, and gcloud commands still use `gcloud dlp`. Marketing pages and the Console UI use the new name; everything you actually type in code uses the old one. Search results for both names land on the same docs.

Why does my custom regex infoType not match anything?

DLP custom infoTypes use Google's RE2 engine, not Python's `re` or PCRE. RE2 does not support lookbehind, lookahead, or backreferences. Patterns that work in your local Python regex tester will fail silently inside DLP — silently meaning the API returns 200 OK with zero findings, not an error. Rewrite the regex against the RE2 syntax reference, and validate by running an inspect against a sample that contains exactly one match before scaling up.

How much does it cost to scan a 1TB BigQuery table with DLP?

Per Google's published pricing, inspect runs at $1.00 per GB after the free tier, so a 1TB table is roughly $1,024 for a single scan [community-verified, verify current rates on the Sensitive Data Protection pricing page]. The first GB per month is free. If you intend to run DLP across BigQuery on a schedule, switch to DLP Discovery profiles instead of `dlpJobs` — Discovery is priced separately per GB profiled and is cheaper for whole-dataset reconnaissance.

Can I reverse a deidentified value back to the original?

Only if you used a cryptographic transformation that preserves reversibility. `CryptoDeterministicConfig` and Format-Preserving Encryption (`CryptoReplaceFfxFpeConfig`) both support `content.reidentify`. `CryptoHashConfig`, `RedactConfig`, `ReplaceConfig`, `MaskConfig`, `BucketingConfig`, and `DateShiftConfig` are one-way. If you need tokenization for a workflow that later rejoins the original value (analytics on PII, debugging a fraud signal), pick FPE or deterministic encryption up front.

What's the difference between content.inspect and a DLP job?

`content.inspect` is synchronous and capped at 0.5MB of content per request (raw bytes), so it fits a single document, a row, or a chat message. DLP jobs (`dlpJobs.create`) are asynchronous and target whole storage objects — GCS buckets, BigQuery tables, Datastore kinds. For inline scrubbing on the data plane, use `content.inspect`; for periodic audit scans on stored data, use a storage-typed job.

Articles/Google Cloud DLP API in 2026: Inspect vs Deidentify, Custom InfoTypes, and the Per-GB Pricing Trap

Tool Reviews

Google Cloud DLP API in 2026: Inspect vs Deidentify, Custom InfoTypes, and the Per-GB Pricing Trap

Cloud DLP (now Sensitive Data Protection) docs cover the API surface but skip the half-day every dev loses to custom infoType regex, quota 429s, and the moment the per-GB bill hits a BigQuery scan.

May 25, 2026Read time: 12 min0 topic signals

Reading runway

Context above, deep read below. Use the TOC to move section by section without losing the thread.

Tool Reviews8 sections

Contents

Reading positionSection 1 / 8

The rebrand is a docs problem, not a product problem The four API surfaces and when each one fits A working content.inspect in twenty lines Custom infoTypes: regex (RE2), dictionary, hotword Deidentify: choose reversibility before you scale Pricing math at scale Quota 429s and how to back off What to do after this article

Last sprint I was wiring up a PII scrubbing layer for a fintech team feeding support tickets into a fine-tuning pipeline, and Google Cloud DLP was the only GCP-native option that would scan and transform inline without forcing the data out of the project. The first day I shipped a working content.inspect call. The second day I tried to add a custom infoType for the client's internal account-ID format and watched it return zero findings against a sample I knew contained ten matches. The third day I learned what RE2 was.

If you arrived here from a query like "cloud data loss prevention api," that custom-regex tar pit is probably the first wall you are about to hit. This piece walks through the Cloud DLP API (rebranded but not renamed Sensitive Data Protection) at the layer the docs skip: when to use inspect vs deidentify vs a storage job, how to write custom infoTypes that actually match, how the pricing math gets ugly at scale, and which transformations you can reverse.

The rebrand is a docs problem, not a product problem

Google rebranded Cloud Data Loss Prevention to Sensitive Data Protection sometime in 2023, and the rollout is still uneven in 2026. The product page in the Console says "Sensitive Data Protection," but every artifact you touch as a developer still says "DLP":

Surface	Name in 2026
Marketing page, Console UI	Sensitive Data Protection
API endpoint	`dlp.googleapis.com`
gcloud commands	`gcloud dlp ...`
Python client library	`google-cloud-dlp`
Service account roles	`roles/dlp.user`, `roles/dlp.admin`
Pricing line item	"Sensitive Data Protection"
Stack Overflow tag	`google-cloud-dlp`

Practical consequence: search the new name when you want the high-level product overview or the pricing page; search the old name when you want a code example. The two product names lead to the same docs eventually, but the first-page Google results split between the two and you can lose a half hour bouncing if you do not know they are the same thing.

The four API surfaces and when each one fits

DLP has more endpoints than most teams realize, and picking the wrong one costs latency or money or both. The decision tree is mostly about whether your data is inline or at rest:

Surface	Endpoint	Sync?	Use when
Content inspect	`projects.content.inspect`	yes	Scanning a single payload < 0.5MB inline (a chat message, a ticket, a form submission)
Content deidentify	`projects.content.deidentify`	yes	Same as above, but transforming the data on the way through
Content reidentify	`projects.content.reidentify`	yes	Reversing a deterministic or FPE transformation
Storage job	`projects.dlpJobs.create` (type `INSPECT_JOB`)	no	Auditing a whole GCS bucket / BigQuery table / Datastore kind
Discovery	`projects.discoveryConfigs.create`	no	Continuous data profiling across many datasets

The split that catches teams off guard is inline vs at-rest pricing models. Content endpoints are billed per GB of content sent in the request. Storage jobs are billed per GB of source data they scan in place. Discovery is billed at a separate, lower rate for periodic profiling. A team that uses content.inspect in a loop to scan a million BigQuery rows pays roughly 100x what they would pay using a single dlpJobs.create over the same table, because the content path counts every request whereas the job amortizes over the table scan.

A working content.inspect in twenty lines

Here is the smallest useful inspect call, using the Python client.

from google.cloud import dlp_v2

PROJECT_ID = "your-project"

client = dlp_v2.DlpServiceClient()
parent = f"projects/{PROJECT_ID}/locations/global"

inspect_config = {
    "info_types": [
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "CREDIT_CARD_NUMBER"},
    ],
    "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
    "include_quote": True,
    "limits": {"max_findings_per_request": 100},
}

response = client.inspect_content(
    request={
        "parent": parent,
        "inspect_config": inspect_config,
        "item": {"value": "Reach me at user@example.com or 415-555-0142."},
    }
)

for f in response.result.findings:
    print(f"{f.info_type.name}: {f.quote} ({f.likelihood.name})")

Output:

EMAIL_ADDRESS: user@example.com (LIKELY)
PHONE_NUMBER: 415-555-0142 (LIKELY)

Three things to internalize from this:

min_likelihood is the single most important knob. DLP reports findings with a five-level confidence scale: VERY_UNLIKELY, UNLIKELY, POSSIBLE, LIKELY, VERY_LIKELY. Setting POSSIBLE catches most real PII at the cost of some false positives (random numbers tagged as phone numbers). Setting LIKELY cuts noise but misses partial matches. Default is POSSIBLE.
include_quote: True is on by default but worth knowing. The quote field returns the matched substring, which is essential for debugging your infoType selection but is also itself sensitive data that now lives in your logs. Disable it in production paths where you only need the count, not the value.
The parent uses locations/global. DLP also supports regional endpoints (e.g. locations/us-central1) which keep the request inside a single region for data-residency requirements. The global endpoint is faster for ad-hoc scans but does not give you the regional guarantee.

Custom infoTypes: regex (RE2), dictionary, hotword

The 150+ built-in infoTypes cover the common identifiers (CC numbers, government IDs, names, addresses, healthcare codes), but every project ends up needing at least one custom infoType. Three flavors:

Regex. A pattern that DLP matches against the content. The trap is that DLP uses RE2, not PCRE or Python re. RE2 has no lookbehind, no lookahead, no backreferences. If your pattern uses any of those, it will silently return zero findings — no syntax error, no warning, just empty results. Symptom: your test sample has obvious matches and the API insists there are none.

custom_info_types = [
    {
        "info_type": {"name": "ACCOUNT_ID"},
        "regex": {"pattern": r"ACC-[0-9]{8}"},
        "likelihood": dlp_v2.Likelihood.LIKELY,
    }
]

The trick to debugging: paste your pattern into regex101.com with flavor set to RE2 (Go), or compile against the re2 package locally. If it does not compile there, it will not match in DLP.

Dictionary. A list of exact-match words/phrases. DLP can take the list inline (word_list) or as a reference to a GCS-hosted file (cloud_storage_path). Inline is capped at ~50 KB total dictionary size. For larger lists (employee names, product SKUs, customer IDs) put a one-word-per-line text file in GCS and reference it.

Hotword (proximity rule). Modifies the likelihood of another infoType when a "hot word" appears within N bytes. Useful for catching things like a 9-digit number near the word "SSN" — the number alone is POSSIBLE, the number near "SSN" is VERY_LIKELY. The proximity field is measured in bytes, not characters, which matters for UTF-8 content where Chinese characters consume 3 bytes each.

Deidentify: choose reversibility before you scale

Deidentify transformations split into one-way and reversible. Picking wrong is expensive to fix later because all the historic transformed data uses the choice you made on day one.

Transformation	Reversible?	Use when
`ReplaceConfig`	no	"Replace EMAIL with `[REDACTED_EMAIL]`" — simplest
`RedactConfig`	no	Remove the value entirely, no placeholder
`MaskConfig`	no	"Replace each digit with X" — preserves shape but not value
`CryptoHashConfig` (HMAC)	no	Stable pseudonym across runs, but cannot recover original
`CryptoDeterministicConfig`	yes	Tokenization for join keys — same input always produces same token
`CryptoReplaceFfxFpeConfig` (FPE)	yes	Preserve format (16-digit input → 16-digit output) for downstream systems that validate format
`DateShiftConfig`	partial	Shift dates by a per-context random offset; the offset is the key
`BucketingConfig`	no	Replace ages with brackets like 30-39

The CryptoReplaceFfxFpeConfig choice is the one teams revisit most often. FPE keeps the output in the same alphabet as the input — a credit-card number deidentified with FPE is still 16 digits, still passes a Luhn check if you opt in to that, still fits a VARCHAR(16) column. The cost is operational: FPE needs a key wrapper (typically a Cloud KMS-wrapped AES key) and the key rotation story is something you have to design.

If the only downstream consumer of the deidentified data is a model or an analytics dashboard, CryptoDeterministicConfig is usually enough — it produces a stable but format-free token, which is fine if your schema can hold it.

Pricing math at scale

Cloud DLP pricing is one of those services where the per-unit cost is low and the volume math is brutal. Per Google's published pricing, the rates as of writing (verify on the Sensitive Data Protection pricing page before locking your architecture):

Operation	Rate (per Google pricing)
`content.inspect` after 1GB free monthly tier	~$1.00 per GB of content sent
`content.deidentify` / `reidentify`	~$0.50 per GB
Storage job inspect	~$1.00 per GB of source data scanned
Discovery profile	~$0.03 per GB profiled (first 1TB), tiered down

[community-verified, exact tier rates change — pull current numbers from the pricing page before sizing]

Three rough-cut scenarios to anchor the math:

A chat app scrubbing every message inline. Average message 200 bytes, 1M messages/day. Daily volume 200MB, well under the 1GB free tier — call it free.
Daily audit of a 100GB BigQuery table. Using a storage job: ~$100/day, ~$3,000/month. Using Discovery for the same dataset: ~$3 for the initial profile, then incremental costs as the table grows. Discovery wins by two orders of magnitude.
One-time scan of a 10TB historical log archive. Storage job: ~$10,240. Cannot avoid; budget it explicitly before kicking off.

The mistake almost every team makes once is running content.inspect in a loop over BigQuery rows, because the code is shorter than wiring up a job. A 100GB table scanned row-by-row through the content endpoint can produce a bill an order of magnitude higher than a single storage job, because you also pay the per-request overhead, not just the GB volume.

Quota 429s and how to back off

DLP enforces project-level quotas on each surface. The common ones you will see in real bots:

Quota	Default (per Google docs, as of writing)	Symptom when hit
`content.inspect` requests per minute per project	600	429 `RESOURCE_EXHAUSTED`
`dlpJobs.create` per minute per project	low (single digits)	429 on job submission
Content size per request	0.5 MB raw / 1MB structured	400 `INVALID_ARGUMENT`
Findings per request	3000 hard cap	Findings truncated, no error

The right response to a 429 is exponential backoff with jitter, and the Google Cloud client libraries do this for you if you let them — retry=DEFAULT_RETRY on the call. The wrong response is to retry immediately on the same thread, which racks up the per-request count without making progress.

When the 600 RPM ceiling is actually a problem (it usually is not, but high-volume real-time scrubbing can hit it), the lever to pull is batching: send multiple items in a single inspect_content call via the table field, which counts as one request against the quota but processes up to the per-request size limit of structured content.

What to do after this article

Open the Cloud DLP API entry for the endpoint catalog, then run the twenty-line inspect_content example above against your own GCP project. If you get findings back, the rest of the API is a matter of mixing and matching infoTypes and transformations — the surface is wide but the patterns repeat. The next thing to design is your custom-infoType strategy: dictionary file in GCS, RE2-compatible regexes, and hotword proximity rules for the values that depend on context. After that, decide whether your real workload is inline (content endpoints) or at-rest (jobs and Discovery), and route the spend accordingly before the bill teaches you.

Contents

Jump to a section

Reading positionSection 1 / 8

Share this article

Pass this article along

Send it to your preferred platform or copy the link.

X LinkedIn Reddit Telegram Weibo

Article overview