← Back to Integration & Data
INTG-013 Integration & Data 22 min read For: Salesforce Architects & Tech Leaders

Data Cloud Unification: Identity Resolution at Scale

Identity resolution is the hardest part of building a Customer Data Platform. In Data Cloud, it is also the part most commonly misconfigured. This tutorial goes deep on how Data Cloud's identity resolution actually works — the algorithms, the data structures, and the failure modes you will encounter at scale.

VS

Vishal Sharma

Salesforce Architecture Specialist · Updated May 2026

What you will learn...
  • How Data Cloud's identity resolution algorithm works under the hood
  • The role of the Individual, Contact Point, and Party Identifier DMOs in matching
  • Ruleset types: deterministic matching vs probabilistic matching, and when to use each
  • The identity graph: how edges are created, how profiles are merged, and what the unified Individual looks like
  • Common misconfiguration patterns that produce over-merge and under-merge failures
  • Measuring resolution quality and the iterative tuning process

The Identity Resolution Problem

A modern consumer interacts with a brand across a dozen channels — website visits as an anonymous cookie, email opens as a hashed email, in-store purchases as a loyalty card, customer service calls as a phone number, and CRM records as a name and company. Each of these interactions generates a separate identifier in a separate system. Identity resolution is the process of determining that all these identifiers belong to the same real person and linking them into a single unified profile.

The challenge is that no single identifier is universal. Email addresses change. People use multiple email addresses. Phone numbers are shared within households. Cookies are cleared and regenerated. Names have variations (legal name vs. preferred name, maiden name vs. married name). A resolution algorithm that is too strict (requires exact matches on multiple identifiers) will produce many separate profiles for the same person. One that is too lenient (matches on weak identifiers like common names) will merge profiles that belong to different people. Calibrating this balance is the core challenge of identity resolution configuration.

Data Cloud approaches this through configurable rulesets — ordered sequences of matching rules that each specify which identifiers to match on, whether matching must be exact or approximate, and the confidence threshold at which a match is accepted. The output is an identity graph — a network of nodes (individual source records) connected by edges (match relationships) that Data Cloud traverses to determine which records should be unified into the same Unified Individual profile.

💡
Identity resolution is not a one-time configuration: The matching rules you deploy in month one will not be optimal at month twelve. As new data sources are onboarded, as the customer base grows, and as you measure resolution quality against known ground truth, rulesets require iterative tuning. Budget for ongoing resolution quality analysis as an operational activity, not just a one-time configuration task.

The Data Model Foundation: Individuals, Contact Points, Party Identifiers

Identity resolution in Data Cloud operates on three foundational DMO types. The Individual DMO represents a single entity in a single source system — one Contact record from Salesforce CRM, one registered user from the e-commerce platform, one loyalty member from the POS system. Each source system contributes Individual records.

The Contact Point DMOs (Contact Point Email, Contact Point Phone, Contact Point Address) represent the specific identifier values associated with an Individual — the email address associated with a CRM Contact, the phone number from a loyalty member record. Contact Points are how the identity resolution engine finds connections between Individual records that share a common identifier across different sources.

The Party Identifier DMO represents system-specific identifiers — a Salesforce Record ID, a loyalty member ID, a cookie value, a device ID. Party Identifiers link Individual records to their unique identifiers in each source system. When two Individuals from different source systems share a Party Identifier value (e.g., the same email address), the resolution engine uses this as evidence that they may represent the same person.

-- Conceptual view of identity resolution data model
-- Individual: one per source-system entity
Individual_Source: { Id, SourceSystem, SourceRecordId, Name, ... }

-- Contact Point Email: email addresses per Individual
ContactPointEmail: { IndividualId, EmailAddress, isPrimary }

-- Party Identifier: cross-system linking identifiers
PartyIdentifier: {
  IndividualId,
  IdentifierType: 'email' | 'phone' | 'loyalty_id' | 'cookie',
  IdentifierValue: 'john.smith@example.com'
}

-- Resolution creates a Unified Individual linking matched records:
UnifiedIndividual: {
  UnifiedId,
  SourceIndividuals: [crm-contact-001, ecomm-user-445, loyalty-mb-78],
  PrimaryEmail: 'john.smith@example.com',
  IdentityGraph: { edges: [...match relationships...] }
}

Deterministic vs Probabilistic Matching

Deterministic matching uses exact matches on high-confidence identifiers. Two Individual records match deterministically if they share the same email address, the same phone number, or the same loyalty member ID. Deterministic matches produce high-precision results — when two records share an exact email address, there is high confidence they represent the same person. False matches from deterministic rules are rare but do occur (shared email addresses within families, corporate email addresses used by multiple employees over time).

Probabilistic matching uses weighted scoring across multiple weaker signals. Two records score highly if they share a similar name (phonetic match), a similar address (normalised and compared), and a similar phone number (last 7 digits match). No single signal is definitive, but the combination exceeds a configurable confidence threshold. Probabilistic matching captures matches that deterministic rules miss but produces more false positives, particularly at large scale where coincidental attribute similarity increases.

Best practice for ruleset design is deterministic-first, probabilistic-as-supplement. Deterministic rules run first and capture high-confidence matches quickly. Probabilistic rules run on records that deterministic rules did not match, capturing the remaining lower-confidence connections. This ordered approach maximises precision for the easy matches and applies probabilistic scoring only where it is genuinely needed — avoiding the false positive amplification that comes from applying probabilistic matching to records that could be matched deterministically.

The Identity Graph and Profile Merging

When identity resolution runs, it produces an identity graph — a directed graph where nodes are Individual records and edges represent match relationships. A cluster of connected nodes (nodes connected directly or transitively via match edges) constitutes a single Unified Individual. The resolution engine traverses these clusters and creates a Unified Individual record for each, inheriting attributes from all the source Individual records in the cluster according to survivorship rules.

Survivorship rules determine which attribute value "wins" when multiple source records have different values for the same attribute — name, address, email. Data Cloud supports several survivorship strategies: "most recent" (use the value from the most recently modified source record), "most complete" (use the first non-null value found, ordered by source priority), and "source priority" (always use the value from the highest-priority source system, regardless of recency). Define survivorship rules explicitly for every key attribute — allowing them to default can produce unexpected profile values.

Profile merging is not instantaneous — identity resolution runs as a batch job on a configurable schedule (hourly, every 4 hours, daily, depending on data volume and org configuration). When a new Individual record is ingested, it does not immediately appear in the Unified Individual profile. It participates in resolution at the next scheduled run. This latency window is architectural — design downstream processes that depend on unified profiles to tolerate it.

Over-Merge and Under-Merge Failure Modes

Over-merge occurs when the resolution algorithm incorrectly links records that belong to different people into a single Unified Individual profile. Common causes: matching on email domains rather than full email addresses (merging everyone at the same company), probabilistic matching with a too-low confidence threshold, or matching on non-unique phone numbers (shared household numbers or call center phone numbers used as customer service contacts). Over-merged profiles contaminate personalisation — sending a "welcome back" email based on a profile that combined two different customers' purchase histories creates incorrect recommendations.

Under-merge occurs when the resolution algorithm fails to link records that do belong to the same person, leaving multiple separate Unified Individual profiles for the same real customer. Common causes: email address format variations (john@company.com vs. j.smith@company.com for the same person), phone number format differences (+1-555-0101 vs. 5550101), or missing Party Identifier mapping for a source system. Under-merged profiles cause duplicate communications — the same customer receiving the same promotional email twice from two separate profile activations.

Measuring resolution quality requires a ground truth dataset — a sample of known-same-person records that you can test the resolution rules against. Resolution quality metrics are precision (what percentage of matches are correct) and recall (what percentage of true same-person pairs were matched). Improving precision reduces over-merges; improving recall reduces under-merges. These metrics trade off against each other — a stricter ruleset improves precision at the cost of recall. Set target thresholds based on the business impact of each error type in your specific use case.

💡
Email normalisation before ingestion: Email address matching is the highest-confidence deterministic rule in most rulesets. But email addresses are frequently inconsistent in source systems — uppercase, leading/trailing spaces, alias variations (john+salesforce@gmail.com vs. john@gmail.com). Normalise email addresses to lowercase and strip whitespace before ingestion into Data Cloud. This single pre-processing step typically improves deterministic match rates by 5-15%.

Key Takeaways

  • Identity resolution creates Unified Individual profiles by matching Individual records across source systems using Contact Points and Party Identifiers as matching keys. The output is an identity graph of connected source records.
  • Deterministic matching (exact identifier matches) provides high precision; probabilistic matching (weighted scoring across multiple weak signals) captures additional matches at the cost of more false positives. Use deterministic-first, probabilistic-as-supplement.
  • Survivorship rules determine which attribute value is used in the Unified Individual when source systems disagree. Define survivorship explicitly for all key attributes — defaults produce surprising results.
  • Identity resolution runs as a batch job — newly ingested records do not immediately appear in Unified Individual profiles. Downstream processes must tolerate this latency window.
  • Over-merge (incorrectly combining different people) and under-merge (failing to combine the same person) are the two failure modes. Measure both using a ground truth test dataset, and set precision/recall targets based on the relative business impact of each error type.
  • Email normalisation (lowercase, trim whitespace) before ingestion is the highest-ROI pre-processing step for improving deterministic match rates in Data Cloud.

Test Your Understanding

1. A Data Cloud ruleset uses probabilistic matching as the primary (only) rule, with a confidence threshold of 70%. At scale with 50 million profiles, the team observes many unified profiles that appear to combine different customers. What is the most likely cause?

The Data Cloud org has insufficient storage for 50 million profiles — adding storage will improve matching accuracy
At large scale, coincidental attribute similarity across different people becomes statistically significant — probabilistic matching with a 70% threshold will generate over-merges between different people who happen to share similar names and partial addresses. Add deterministic rules first and reserve probabilistic matching for unresolved records.
The identity resolution job is running too frequently — increasing the interval between runs will allow more data to accumulate and improve match accuracy

2. A retail customer appears twice in marketing campaign activations receiving duplicate emails. Investigation shows two separate Unified Individual profiles for the same person — one from Salesforce CRM and one from the e-commerce platform. The person's email address is "John.Smith@example.com" in CRM and "john.smith@example.com" in the e-commerce system. What is the most likely cause and fix?

Email matching is case-sensitive in Data Cloud by default — enable case-insensitive matching in the email Contact Point ruleset configuration
The email addresses are identical when normalised to lowercase but were not normalised before ingestion. Pre-processing email addresses to lowercase before Data Cloud ingestion (in the data transformation layer) would allow the deterministic email match to link these two Individual records.
The two source systems need to be reconfigured to use the same email address format before Data Cloud can match them

3. After a scheduled identity resolution run, an analyst queries a Unified Individual profile and finds the billing address from an e-commerce record rather than the verified address from Salesforce CRM, even though the CRM record was updated more recently. What most likely explains this?

Data Cloud always uses the first-ingested source record for address survivorship — there is no way to configure this behavior
The survivorship rule for address is configured as "source priority" with e-commerce set as higher priority than CRM, or it defaulted to a rule that does not match the expectation. Review and update the survivorship rule for the address attribute to use "most recent" or set CRM as higher source priority.
The Salesforce CRM source connector has not fully synced the updated record — the address will correct itself at the next full sync

Discussion & Feedback