- How Salesforce's built-in duplicate management works and its scale limitations
- The matching algorithm: fuzzy matching, weighted fields, and match score thresholds
- Integration-created duplicates: why API upserts bypass duplicate management and how to prevent it
- Bulk duplicate identification: running large-scale duplicate scans without timing out
- Merge consequences: what happens to child records, activities, and related objects
- Preventive architecture: designing integrations to prevent duplicates from entering
How Salesforce Duplicate Management Actually Works
Salesforce's duplicate management has two components: Matching Rules (define how to compare records — which fields, what algorithms, what score threshold constitutes a match) and Duplicate Rules (define what action to take when a match is found — alert the user, block the save, or allow with logging). Matching Rules run the comparison logic; Duplicate Rules control the response to matches.
The standard matching algorithms available in Matching Rules are Exact (field values must match exactly), Fuzzy Name (phonetic matching for name fields that handles common variations), First Name, Last Name (separate name matching that handles partial name entry), and Email (exact email matching with normalisation). These algorithms handle the most common duplicate-creation scenarios at point of entry — a user manually creating a record that already exists with a slightly different name or email.
The fundamental scale limitation is that Matching Rules run synchronously at record save time and check against a sample of records, not the full org dataset. For large orgs with millions of records, the matching algorithm compares the new record against a representative sample — not every record in the org. This means a duplicate that matches a record outside the comparison sample is not detected. The standard duplicate management is a guardrail for interactive users, not a comprehensive deduplication engine.
Integration-Created Duplicates
Integration-created duplicates are the dominant source of duplicates in enterprise Salesforce orgs. The pattern: an ETL job extracts Account records from a source system and loads them to Salesforce via Bulk API. The source system contains duplicate Accounts (it always does — legacy systems almost always have data quality issues). The Bulk API load creates all source records without deduplication, populating Salesforce with the source system's duplicates.
Prevention requires deduplication at the source, before data enters Salesforce. The ETL transformation stage should include a deduplication step: group source records by matching key (email, phone, company name + city), select the highest-quality record from each group as the canonical record, and load only canonical records to Salesforce. This is straightforward for deterministic duplicate pairs (exact email match) but requires probabilistic matching logic for the harder cases.
For integration scenarios where pre-deduplication is not feasible, External ID-based upsert provides a form of duplicate prevention. If every source record has a stable, unique External ID (a CRM ID, a customer number), loading via upsert rather than insert means repeated loads of the same source records create or update a single Salesforce record rather than creating additional duplicates. External ID upsert is the minimum required pattern for any recurring integration — never use insert for recurring data loads.
// SOQL to identify potential Account duplicates
// Run as a batch query job for large orgs
SELECT Name, BillingCity, BillingCountry, Phone, COUNT(Id)
FROM Account
GROUP BY Name, BillingCity, BillingCountry, Phone
HAVING COUNT(Id) > 1
ORDER BY COUNT(Id) DESC
LIMIT 1000
// Follow-up: get the actual duplicate record IDs
SELECT Id, Name, BillingCity, CreatedDate, OwnerId
FROM Account
WHERE Name = 'Acme Corporation'
AND BillingCity = 'San Francisco'
ORDER BY CreatedDate ASC
Bulk Duplicate Identification at Scale
For orgs with millions of records and thousands of existing duplicates, the standard Salesforce "Find Duplicates" button and the duplicate record reports are insufficient for systematic identification and remediation. At scale, duplicate identification requires batch queries that group records by matching attributes and surface clusters of likely duplicates for steward review.
The DuplicateRecordSet and DuplicateRecordItem objects (standard Salesforce objects) store the results of duplicate rule evaluations. These can be queried to identify records that Salesforce's matching rules have already flagged as potential duplicates. For orgs with active duplicate rules, this is the starting point for bulk remediation — work through the existing DuplicateRecordSet inventory before launching additional scanning.
For deeper duplicate identification that Salesforce's standard rules miss — variants that require phonetic matching, cross-field correlation, or probabilistic scoring — DemandTools (covered in INTG-011) is the appropriate tool. DemandTools can run comprehensive duplicate scans against the full Account or Contact population, present matches for steward review in batches, and execute bulk merges. For large-scale duplicate remediation programs, DemandTools reduces a months-long manual effort to weeks.
Merge Consequences and Cascade Effects
Merging two Salesforce records is a significant operation that requires understanding the cascade effects. When two Account records are merged, one Account (the "master") survives and the other (the "loser") is deleted. All child records associated with the loser — Contacts, Opportunities, Cases, Activities, custom related objects — are automatically re-parented to the master Account. This child re-parenting is immediate and cannot be easily undone.
Field values in the merge: the user (or the automated merge process) selects which field values to keep for the merged record. Fields from either the master or the loser can be selected for each field. Default behavior uses the master record's values for all fields, but this is not always correct — the loser record may have a more accurate billing address, a more recent phone number, or a higher-quality record owner. Review field values explicitly during merge, especially for data-quality-critical fields.
The loser Account's Salesforce ID is permanently deleted after merge. Any external system that holds the loser ID as a foreign key must be updated to use the master ID. This is the integration coordination challenge that makes large-scale merge programs operationally complex. For each batch of merges, generate a ID remapping table (loser ID → master ID) and propagate it to all integrated systems that held the loser IDs. Never merge without capturing this ID remapping.
Preventive Architecture
The most effective long-term duplicate management strategy is prevention rather than remediation. Three architectural patterns prevent duplicate creation at the integration boundary. First, External ID uniqueness constraints: define External ID fields on Account and Contact objects with the "Unique" checkbox enabled. Integration upserts using these fields automatically prevent duplicate creation for records with matching External IDs — Salesforce enforces uniqueness at the database level.
Second, pre-insert duplicate check in Apex: for integrations that create records via synchronous Apex (rather than Bulk API), implement a pre-insert SOQL query that checks for matching existing records before the insert. If a match is found, update the existing record rather than creating a new one. This adds latency but prevents duplicates for real-time, record-by-record integrations.
Third, data quality gates at the integration boundary: for ETL integrations, implement a data quality validation stage that rejects records with missing or invalid matching keys (no email, no phone, no External ID) before they reach Salesforce. Records without identifying attributes cannot be matched to existing records and will always create duplicates — they should be routed to a data quality exception queue for human review, not loaded to Salesforce blindly.
Key Takeaways
- Salesforce's built-in duplicate management (Matching Rules + Duplicate Rules) works for interactive UI saves but does not fire for Bulk API operations by default. Integration loads are the primary source of enterprise-scale duplicates.
- The matching algorithm compares against a sample of records — not the full org — limiting effectiveness for large orgs where the matching record may be outside the comparison sample.
- External ID upsert is the minimum pattern for recurring integrations. Never use insert for recurring data loads — only upsert via External ID provides idempotent, duplicate-preventing behavior.
- Merging records re-parents all child objects (Contacts, Opportunities, Cases) to the master and permanently deletes the loser ID. Capture the loser-to-master ID remapping table before each merge batch and propagate it to all integrated systems.
- DemandTools enables systematic bulk duplicate identification and merge at a scale that Salesforce's standard tools cannot handle — comprehensive phonetic matching, probabilistic scoring, and batch merge execution.
- Prevention is more efficient than remediation: External ID uniqueness constraints, pre-insert duplicate checks in real-time Apex integrations, and data quality gates at the ETL boundary are the architectural patterns that prevent duplicates from entering Salesforce.
Test Your Understanding
1. An ETL integration loads 50,000 Contact records from a marketing automation platform to Salesforce using Bulk API insert operation nightly. After three months, the Salesforce org has 150,000 Contact records for what should be 50,000 unique contacts. What is the root cause?
2. An Account merge eliminates a duplicate. The losing Account ID "001ABC" had 45 associated Opportunities. After the merge, where are those Opportunities?
3. A company has 800,000 Account records with an estimated 15% duplicate rate (approximately 120,000 duplicates). The data stewardship team plans to use the Salesforce standard Find Duplicates tool to identify and merge all duplicates. What is the primary limitation of this approach at this scale?
Discussion & Feedback