Real-World Data Migration Case Study: 50M Records from SAP to Salesforce

What you will learn...

How a 50M-record SAP-to-Salesforce migration was structured across a 14-month programme
Why data quality discovery consumed 40% of total project effort and how to budget for it
The technical pipeline: SAP BAPI extraction, Informatica transformation, Bulk API 2.0 load
How delta synchronisation was maintained during the 9-month parallel-run period
Three near-catastrophic decisions that were reversed in time and what replaced them
What the post-go-live data landscape looked like after 90 days — including what was still broken

The Starting Point: What SAP Held and Why It Was Messy

The organisation — a global manufacturing company with operations in 23 countries — had run SAP ECC 6.0 as its system of record for customers, materials, sales orders, and financial postings for 17 years. When the board approved a Salesforce Sales Cloud and Service Cloud implementation, the integration team inherited roughly 50 million records across seven primary object types: Business Partners (KNA1/KNB1), Materials (MARA/MAKT), Sales Orders (VBAK/VBAP), Contracts (VBKD), Service Notifications (QMEL), Equipment master data (EQUI), and Functional Locations (IFLOT).

The data quality picture was grim from the outset. A preliminary profiling exercise, completed in week three, found 2.3 million Business Partner records with no email address, 840,000 with duplicate names differing only in whitespace or punctuation, 1.1 million materials records referencing discontinued product lines, and over 400,000 sales orders with a document currency that did not match the sales area currency. These were not edge cases — they were systemic artefacts of 17 years of manual data entry across dozens of regional SAP clients that had been merged through acquisitions.

Project Budget Warning: The original project plan allocated six weeks to data quality. The actual data quality phase consumed 28 weeks. Any project that does not complete a thorough profiling exercise before committing to a timeline will face this problem. Profiling is not preparatory work — it is the work.

The migration scope was negotiated down three times before an executable plan emerged. The original scope included all historical sales orders going back to 2007. After profiling, the team agreed to migrate only orders with a document date after 2018, reducing the order volume from 31 million to 7.4 million records. Equipment and Functional Location data was descoped entirely and federated via Salesforce Connect to SAP instead — a decision that saved the programme and is now considered one of the best architectural choices made.

Pipeline Architecture: From SAP BAPI to Bulk API 2.0

The extraction layer used SAP BAPIs (Business Application Programming Interfaces) called via RFC (Remote Function Call) connections from Informatica IICS. Custom BAPI wrappers were written in ABAP for object types where standard BAPIs did not return the required fields in a single call — Business Partner with all address roles and Sales Order with line items and pricing conditions both required custom extraction logic. Full extraction of the 50M in-scope records took 72 hours of continuous extraction windows across a 4-week period to avoid impacting SAP production performance.

Informatica IICS served as the transformation layer. The mapping logic was substantial: a single Business Partner in SAP mapped to a combination of Account, Contact, and Person Account in Salesforce depending on the partner function (sold-to, ship-to, bill-to, payer). A Business Partner with four partner functions could generate up to four Salesforce records, each requiring a cross-reference maintained in an external ID mapping table stored in an intermediate PostgreSQL database on AWS RDS. This mapping table eventually grew to 9.1 million rows and became the critical artefact for every subsequent delta migration run.

External ID Design: Every migrated record received a custom external ID field populated with the SAP source system key — for example, SAP_ECC_BP_ID__c on Account. This allowed upserts during delta runs without requiring SOQL queries to find existing records, and it provided a permanent audit trail of where each Salesforce record originated.

The load layer used Salesforce Bulk API 2.0 exclusively. REST API was considered but rejected on the basis that at 50M records even a 3-second per-record call would take 1,736 days of sequential processing. Bulk API 2.0 CSV jobs were structured at a maximum of 150MB per file, parallelised across 4 concurrent jobs per object type, and monitored via a custom Python script that polled the job status endpoint every 60 seconds and posted results to a Slack channel.

# Python snippet: poll Bulk API 2.0 job and report results
import requests, time

def poll_bulk_job(instance_url, access_token, job_id, interval=60):
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json"
    }
    url = f"{instance_url}/services/data/v60.0/jobs/ingest/{job_id}"
    while True:
        resp = requests.get(url, headers=headers)
        job = resp.json()
        state = job.get("state")
        print(f"Job {job_id}: state={state}, "
              f"processed={job.get('numberRecordsProcessed',0)}, "
              f"failed={job.get('numberRecordsFailed',0)}")
        if state in ("JobComplete", "Failed", "Aborted"):
            return job
        time.sleep(interval)

# After completion, retrieve failed records
def get_failed_results(instance_url, access_token, job_id):
    headers = {"Authorization": f"Bearer {access_token}"}
    url = f"{instance_url}/services/data/v60.0/jobs/ingest/{job_id}/failedResults"
    resp = requests.get(url, headers=headers)
    return resp.text  # CSV of failed records with sf__Error column

The Delta Synchronisation Problem

The SAP system remained live throughout the migration. Business did not stop for 14 months. New Business Partners were created daily. Existing orders were updated, closed, and cancelled. The migration team needed a mechanism to keep the Salesforce data current during the parallel-run period without running full extractions every night — a full extraction took 72 hours and would have created a perpetual catch-up situation.

The delta synchronisation solution combined three mechanisms. For Business Partners and Materials (master data), SAP Change Pointers were enabled on the relevant message types (DEBMAS for customers, MATMAS for materials). Informatica read these change pointers on a 4-hour schedule and produced delta extract files containing only modified records. The change pointer mechanism had existed in SAP since version 3.1 but had never been activated in this customer's landscape — it required ABAP Basis support to enable and a brief SAP performance impact assessment before production enablement.

For Sales Orders (transactional data), a custom ABAP program used the SAP change document objects (VERKBELEG for sales documents) to identify orders modified since the last extraction timestamp. This timestamp was stored in a custom Z-table in SAP and updated by the ABAP program after each successful extraction, creating an idempotent extraction mechanism. For Service Notifications, a similar approach used the SAP notification change history table QMEL_AEND.

Timestamp Management: Never rely on a timestamp managed outside the source system for delta extraction. If the extraction job fails halfway through and the external timestamp has already been updated, you will miss records. Store the high-water mark inside the source system, updated only after a verified successful extraction, and use a slightly overlapping window (e.g., last-modified minus 15 minutes) to handle clock skew.

The delta volumes were manageable: approximately 12,000 Business Partner changes per day, 85,000 order changes per day, and 3,200 material changes per day. The 4-hour Bulk API 2.0 upsert jobs completed consistently within 45 minutes, leaving sufficient buffer for reruns if needed. The external ID upsert pattern was essential here — without it, every delta load would have required a SOQL lookup to find the existing Salesforce record before updating it, which would have consumed API limits at significant scale.

Three Near-Catastrophic Decisions That Were Reversed

Every large migration project has decisions that look reasonable in the planning phase but collapse under contact with reality. This one had three of significance.

The first was the decision to use Salesforce Flows for post-load data enrichment. The original plan called for a Flow that would fire on Account creation and populate billing hierarchy fields, territory assignments, and account scoring attributes by calling external services. During the initial load test of 2 million accounts, this triggered 2 million Flow executions, consumed the entire daily API limit in 6 hours, and caused the Salesforce org to begin throttling all API calls — including the Bulk API 2.0 jobs already running. The fix was to move all post-load enrichment to a separate batch Apex job that ran after each load window with explicit API call tracking, and to temporarily deactivate the Flow trigger during migration windows using a custom metadata flag.

The second near-catastrophic decision was the choice to migrate order line items (VBAP) before order headers (VBAK). A junior data engineer, working from the Informatica job sequence documentation, ran the line item load before the team had confirmed that all parent order IDs were present in Salesforce. This resulted in 340,000 orphaned OrderProduct records with a null OrderId that triggered a cascade of validation rule failures. Recovery required a targeted Bulk API delete job, a resequenced load, and two days of reconciliation work. The lesson: always load parent objects before child objects, and enforce a dependency check in the job orchestration layer before any load job starts.

Object Load Sequencing: Define a formal load order document and make it an artefact that must be signed off before any load job can be submitted to the job scheduler. In Salesforce, the sequence for sales data is typically: Account → Contact → Pricebook → Product2 → PricebookEntry → Order → OrderItem. Any deviation requires explicit justification.

The third was the initial refusal to implement a data reconciliation framework. The team lead believed that the Bulk API job success/failure reports were sufficient to confirm data accuracy. They were sufficient for load confirmation — but not for data accuracy. Two months into the parallel run, a spot-check comparison between SAP and Salesforce showed that 1.2% of Account records had incorrect billing city values due to a character encoding issue with non-ASCII city names (cities with umlauts, accented characters, and Cyrillic script from Eastern European subsidiaries). The records had loaded successfully; the data was simply wrong. A reconciliation framework comparing record counts and key field hash values across both systems was implemented retroactively, but it would have caught this issue three months earlier if deployed at the start.

Go-Live Cutover: The 72-Hour Window

The cutover plan was designed around a 72-hour blackout window: SAP would accept no new transactions from Friday 18:00 to Monday 18:00 local time at headquarters. During this window, the final full-refresh delta loads would run, all outstanding enrichment jobs would complete, and the validation suite would confirm record counts and sample field accuracy before go-live was declared. Users would access Salesforce on Monday morning.

The cutover runbook ran to 47 pages and included 214 individual steps, each with an owner, an estimated duration, and a rollback instruction. A shared Confluence page tracked step completion in real time during the cutover weekend. The migration lead had a war room with representatives from the Salesforce team, the SAP Basis team, Informatica support, and the business super-users who were responsible for spot-checking data in their respective domains (sales, service, finance).

The actual cutover took 68 hours — within the window but with only 4 hours of buffer. The main time overrun came from the Account hierarchy reconciliation step: the SAP Partner hierarchy (KNVH table) encoded parent-child relationships using SAP customer numbers, and the hierarchy rebuild in Salesforce required processing the 9.1 million row mapping table twice to resolve the parent Account IDs correctly. This step had been tested in staging but the staging mapping table was only 2.1 million rows — the performance did not scale linearly.

Cutover Testing: Always run at least one complete cutover rehearsal using production-scale data volumes, not staging samples. Performance characteristics of hierarchical data processing, external ID resolution, and enrichment jobs change significantly between a 2M-row mapping table and a 9M-row mapping table. Discovering this during the actual cutover leaves no time to optimise.

Post-Go-Live: What Was Still Broken After 90 Days

Go-live on Monday was declared successful. The 47-page runbook had been completed, record counts matched within 0.2%, and the business super-users had signed off on their domain data. But successful go-live is not the same as a clean migration, and the next 90 days revealed several categories of ongoing issues.

Data freshness for Equipment and Functional Location data (federated via Salesforce Connect to SAP) performed worse than expected on high-latency connections from the company's Asian offices. The OData adapter response time from Singapore to the SAP system in Germany averaged 4.2 seconds per record retrieval. Users who opened a Service Case and needed to view related Equipment data waited 4+ seconds per field population. This was architecturally known but the business impact was worse than accepted in the design review. Caching strategies using Platform Cache were implemented post-go-live to reduce repeat lookups within a session.

The duplicate management problem emerged at week six. The SAP system had contained duplicates, the deduplication rules applied during migration had caught the obvious ones, but 1.4 million Account pairs that were genuinely different entities in SAP (different customer numbers) turned out to represent the same legal entity — just enrolled in SAP by different regional offices. These were not detectable by name-matching rules because the names were legitimately different (a German office might have used "BASF SE" while a UK office used "BASF United Kingdom Limited"). Resolving this required a manual review workflow built in Salesforce that is still processing records 90 days post-go-live.

Financial data reconciliation between SAP and Salesforce revealed a 0.8% variance in total order value — approximately €4.7M against a total migrated order book of €590M. Investigation traced the discrepancy to orders with retroactive pricing adjustments that were applied in SAP after the final delta extraction window closed. These adjustments had not been captured in the cutover plan. A supplementary delta load of 22,000 order records was performed at day 45 to close the variance.

What Every Architect Should Extract From This

Case studies are only valuable if they produce transferable insight. Several principles from this migration apply regardless of source system, target system, or record volume.

Data quality is not a remediation activity — it is a discovery activity. The goal of data profiling is not to fix data before migration; it is to understand what the data actually contains so that scope, timeline, and budget can be set accurately. Every week spent on profiling before a project commitment is worth four weeks of unplanned work after the commitment is made.

External IDs are not optional. Every record migrated into Salesforce should carry a field that identifies its origin: the source system, the source system ID, and ideally the extraction timestamp. This field has no operational value after go-live in most cases — but it is the only thing that makes delta sync, reconciliation, and post-migration debugging tractable. The cost of adding the field is trivial; the cost of not having it is not.

Parallel run periods are not a safety net — they are a measurement instrument. The value of a parallel run is the data you collect about divergence between systems, not the comfort of having a fallback. If the parallel run data is not being actively reviewed and discrepancies resolved on a weekly basis, the parallel run is providing false confidence rather than real risk reduction.

Key Takeaways

Data quality discovery consumed 40% of total project effort in this migration — budget for it explicitly or the timeline will absorb the cost anyway.
The external ID pattern (SAP source key stored in a custom Salesforce field) was the single most important technical decision for enabling reliable delta sync and reconciliation.
Object load sequencing must be formally documented and enforced in the job orchestration layer — loading child records before parents causes cascading failures that are expensive to recover from.
Post-load enrichment logic (Flows, triggers) must be disabled or rate-controlled during migration windows to prevent API limit exhaustion from overwhelming ongoing load jobs.
Cutover rehearsals must use production-scale data volumes; performance characteristics of hierarchy resolution and enrichment jobs do not scale linearly from staging samples.
Post-go-live issues (latency on federated data, delayed duplicates, pricing adjustment gaps) are normal — plan a 90-day hypercare period with dedicated resources before declaring the migration complete.

Check Your Understanding

During a large SAP-to-Salesforce migration, the team notices that post-load Flows are consuming the daily API limit within hours. What is the recommended fix?

Increase the Salesforce API limit by purchasing additional capacity from Salesforce support.

Switch from Bulk API 2.0 to REST API to reduce the number of API calls during the load.

Disable or gate the post-load Flows using a custom metadata flag during migration windows, and run enrichment as a separate batch job with explicit API tracking after each load window.

Reduce the number of records per Bulk API 2.0 job to spread the load over more days.

A data engineer is designing delta synchronisation for a migration with a 9-month parallel run. Where should the extraction high-water mark timestamp be stored?

In the job orchestration tool's configuration database, updated at the start of each extraction run.

Inside the source system (e.g., a custom Z-table in SAP), updated only after verified successful extraction, with a small overlap window to handle clock skew.

In Salesforce custom metadata, so it can be read by both the extraction tool and the Salesforce validation framework.

In a shared S3 bucket, using the last-modified timestamp of the most recently uploaded extract file.

A migration cutover rehearsal was run successfully on a 2M-row staging dataset. During the actual cutover using production data (9M rows), the Account hierarchy rebuild step takes three times longer than expected and nearly misses the cutover window. What was the root cause?

The production Salesforce org has more active users than staging, causing API throttling during the hierarchy rebuild.

The mapping table query used a full table scan without indexes, which performs adequately at 2M rows but poorly at 9M rows due to I/O constraints.

Performance of hierarchical data processing with external ID resolution does not scale linearly with dataset size — the rehearsal at staging scale did not reveal the production-scale performance profile.

The SAP system was still processing business transactions during the hierarchy rebuild, causing lock contention on the source tables.

Real-World Data Migration Case Study: 50M Records from SAP to Salesforce

The Starting Point: What SAP Held and Why It Was Messy

Pipeline Architecture: From SAP BAPI to Bulk API 2.0

The Delta Synchronisation Problem

Three Near-Catastrophic Decisions That Were Reversed

Go-Live Cutover: The 72-Hour Window

Post-Go-Live: What Was Still Broken After 90 Days

What Every Architect Should Extract From This

Key Takeaways

Check Your Understanding

Continue Reading

API-First vs Integration-First Architecture: The Strategic Choice

The Middleware Decision: When You Need an ESB vs Point-to-Point

Salesforce Data Pipelines: From Ingestion to CRM Analytics

Discussion & Feedback