- What ETL means in a Salesforce context versus traditional data warehousing ETL
- How Informatica and Talend compare for Salesforce integration workloads
- The emerging ELT pattern and why it matters for Salesforce data pipelines
- Where simpler alternatives (Fivetran, dbt, custom Python) beat enterprise ETL tools
- Key architectural decisions: change data capture, delta processing, and error handling
- The Salesforce API constraints that every ETL design must respect
ETL in a Salesforce Context
ETL (Extract, Transform, Load) in the Salesforce world covers two distinct problems that are often conflated. The first is operational integration — moving data between Salesforce and other operational systems (ERP, order management, billing) on a scheduled or near-real-time basis. The second is analytical integration — extracting Salesforce data to a data warehouse or data lake for reporting and analytics. The tool choice and design patterns differ substantially between these two use cases.
Operational ETL requires bidirectional data flow, low latency (often sub-hourly), robust error handling with business-logic-aware retry, and conflict resolution when the same record is updated in both systems. Analytical ETL is typically unidirectional (Salesforce to warehouse), tolerant of higher latency, and focused on completeness and consistency rather than real-time availability.
The Salesforce API is the boundary condition for all ETL designs. Whether using Informatica, Talend, Fivetran, or a custom script, data must enter and exit Salesforce through one of its APIs — REST, SOAP, Bulk API 2.0, or the Streaming API. API rate limits, governor limits, and API version compatibility are constraints that apply regardless of what ETL tool sits above them. An ETL design that ignores Salesforce API limits will fail in production regardless of how well the ETL tool itself is configured.
Informatica for Salesforce
Informatica PowerCenter and Informatica Cloud (IICS) are the most widely deployed enterprise ETL tools in Salesforce implementations at large organisations. Informatica's Salesforce connector is mature — it handles the full Salesforce object model, supports bulk API extraction, manages OAuth authentication automatically, and has built-in delta detection using the SystemModstamp field or Change Data Capture.
Informatica IICS (the cloud-hosted version) has largely supplanted PowerCenter for new Salesforce integrations. IICS supports real-time integration through its Application Integration capability (event-triggered mappings) alongside batch ETL through Data Integration. The licensing model is consumption-based in IICS, which can be significantly cheaper than PowerCenter for organisations with moderate data volumes.
Informatica's strengths for Salesforce are its metadata-driven mapping approach (mappings are maintained visually without code), its enterprise-grade lineage and impact analysis, and its handling of Salesforce's relationship model — parent-child objects, lookup fields, and external ID-based upsert operations are handled natively without custom scripting. Its weakness is cost — Informatica licensing is among the most expensive in the ETL market, and the complexity of IICS for simple use cases is often disproportionate to the value it adds.
Talend and the Open Source Alternative
Talend Open Studio is the open-source ETL platform most commonly used as an alternative to Informatica at cost-constrained organisations. Talend's Salesforce connector covers the main API types — SOAP API for small datasets, Bulk API for high volumes — and the component library for transformation is extensive. The Talend Salesforce components handle upsert operations using External ID fields, batch size optimisation for the Bulk API, and OAuth credential management.
Talend's limitation in the Salesforce context is its Java code generation model. Talend compiles each job into executable Java code, which gives it good runtime performance but makes jobs opaque to non-developers during debugging and maintenance. When a Talend job fails during a bulk upsert due to a field validation error on 3% of records, diagnosing and fixing the rejection without Java knowledge is difficult. Organisations using Talend for Salesforce typically need a developer-level resource to maintain the integration, not just an integration administrator.
Talend Cloud (the SaaS-hosted version) has improved the operational experience significantly, adding monitoring, scheduling, and a more modern UI. But even Talend Cloud carries per-user licensing costs that make it non-trivial for smaller Salesforce integrations. The sweet spot for Talend is organisations already standardised on it for non-Salesforce data warehousing that want to extend it to Salesforce without introducing another tool.
// Salesforce Bulk API 2.0 query — what ETL tools use under the hood
// Step 1: Create a query job
POST /services/data/v60.0/jobs/query
{
"operation": "query",
"query": "SELECT Id, AccountId, Name, Amount, CloseDate, StageName,
SystemModstamp FROM Opportunity
WHERE SystemModstamp > 2026-05-18T00:00:00Z",
"contentType": "CSV",
"columnDelimiter": "COMMA",
"lineEnding": "LF"
}
// Step 2: Poll for completion, then download result batches
// This is the pattern Informatica/Talend abstract into their connectors
The ELT Alternative: Fivetran and dbt
ELT (Extract, Load, Transform) inverts the traditional ETL order — data is extracted from Salesforce and loaded raw into a data warehouse (Snowflake, BigQuery, Databricks, Redshift), and transformation happens inside the warehouse using SQL. This pattern has largely replaced traditional ETL for analytical use cases, and for many Salesforce-to-warehouse scenarios it is architecturally simpler and cheaper than Informatica or Talend.
Fivetran is the dominant ELT connector for Salesforce-to-warehouse. It uses the Salesforce Bulk API to extract objects incrementally using SystemModstamp-based delta queries, handles schema discovery automatically (new fields added to Salesforce objects are detected and added to the warehouse schema), and loads raw data to the warehouse typically within 15-60 minutes of a record change. Fivetran's pricing is per monthly active rows — for many organisations, this is significantly cheaper than Informatica for analytical workloads.
dbt (data build tool) handles the transformation layer on top of Fivetran's raw Salesforce data. dbt models are SQL SELECT statements that transform raw Salesforce tables into dimensional models, business metrics, and reporting-ready entities. The combination of Fivetran + dbt has become a standard analytics stack for Salesforce customers that rivals the functionality of Informatica at a fraction of the cost for analytics-only use cases.
Delta Detection and Change Tracking
Every ETL design for Salesforce must solve the delta problem — how to extract only records that have changed since the last extraction, rather than re-extracting the entire dataset. For large orgs with millions of records, full re-extraction is impractical — a full extract of 10 million Opportunity records takes hours and consumes significant API allocations. Delta extraction based on change detection reduces extraction time to minutes and preserves API budget for operational integrations.
Salesforce provides three delta mechanisms: SystemModstamp filtering (extract all records WHERE SystemModstamp > last_run_timestamp), Change Data Capture (subscribe to change events published by Salesforce as records are modified), and the Replication API (a newer capability for efficient bulk extraction with built-in change tracking). SystemModstamp filtering is the most widely supported across ETL tools and the simplest to implement, but it has an edge case: records deleted from Salesforce are not captured by SystemModstamp queries. Deletions require separate handling — either querying the RecycleBin, maintaining a delete-capture mechanism via CDC, or accepting that deletions are not tracked in the warehouse.
Change Data Capture (CDC) is the most comprehensive delta mechanism — it captures creates, updates, deletes, and undeletes as structured events on the Salesforce Streaming API. The challenge is that CDC events are delivered via a long-polling subscription (CometD or Pub/Sub API) that requires a persistent connection, which does not fit the batch-oriented ETL model. Most ETL tools that claim CDC support implement it by polling the event log rather than maintaining a true streaming subscription.
Error Handling and Partial Failure Patterns
Salesforce bulk operations use partial success semantics — if you submit 50,000 records for upsert and 2,000 fail validation, the remaining 48,000 succeed. This is different from all-or-nothing database transaction semantics. ETL designs must handle partial failures explicitly: capturing the failed records, logging the error reasons (which Salesforce returns in the job result files), and routing them to a dead-letter queue or error remediation workflow.
The most common error categories in Salesforce ETL are validation rule failures (data from the source system violates a Salesforce validation rule), duplicate management blocks (a deduplication rule prevents the record from being created), and field-level security failures (the integration user lacks write access to a field that the source record is trying to update). All three categories require different remediation approaches and should be tracked separately in the ETL monitoring layer.
Idempotency is a critical design principle for Salesforce ETL — the ability to re-run a failed job without creating duplicate records. External ID fields are the primary mechanism. If every source record carries a stable External ID value, Salesforce upsert operations using that External ID are idempotent — running the same upsert twice creates one record, not two. Design source-to-Salesforce mappings with an External ID strategy before any ETL implementation begins.
Key Takeaways
- ETL for Salesforce splits into operational integration (bidirectional, low-latency, conflict-aware) and analytical integration (unidirectional to warehouse, latency-tolerant). Tool selection should follow use case, not brand preference.
- Informatica IICS is the most mature enterprise ETL for Salesforce but carries high licensing and complexity costs. Best for large organisations already standardised on Informatica or with complex bidirectional integration requirements.
- Talend is a lower-cost alternative with strong Salesforce connector support, but its Java code generation model requires developer-level maintenance skills — not suitable as a no-code integration tool.
- Fivetran + dbt has become the standard ELT stack for Salesforce analytics. Much cheaper than Informatica for analytics-only workloads but not appropriate for operational bidirectional integration.
- Delta detection must use SystemModstamp filtering, CDC, or the Replication API — with explicit handling for deletes, which SystemModstamp filtering does not capture.
- Idempotency via External ID fields and partial failure handling via dead-letter queues are non-negotiable design requirements for any Salesforce ETL implementation.
Test Your Understanding
1. An organisation needs to sync Salesforce Opportunity data to Snowflake for dashboards. Data freshness within 1 hour is acceptable. They have no existing ETL tooling. What is the most architecturally appropriate and cost-effective approach?
2. An ETL job upserts 100,000 Account records to Salesforce. The job completes but the ETL tool reports 3,200 failures. What is the most important first diagnostic step?
3. A nightly ETL job extracts changed Accounts from Salesforce using SystemModstamp filtering. The data warehouse team notices that Accounts deleted in Salesforce are still appearing in warehouse reports. What is the cause and solution?
Discussion & Feedback