Data Cloud Architecture: Real-Time Data at Salesforce Scale

What you will learn...

How Data Cloud is architecturally different from Salesforce CRM and traditional CDPs
The ingestion layer: connectors, streaming ingestion, and the data lake architecture underneath
Data Model Objects (DMOs) and how they relate to the standard data model
Identity resolution: how Data Cloud unifies profiles across multiple identity graphs
Real-time data activation and the concept of Calculated Insights
The architectural constraints that architects commonly miss before deployment

What Data Cloud Actually Is

Data Cloud (formerly Salesforce CDP, formerly Customer 360 Audiences) is Salesforce's hyperscale data platform. It is architecturally separate from the core Salesforce CRM — it runs on a different infrastructure stack, uses a different data model, has different query semantics, and is licensed separately. Understanding this separation is the foundation of understanding Data Cloud architecture.

The infrastructure underneath Data Cloud is built on a distributed data lake architecture, not the traditional Salesforce multi-tenant Oracle database. Salesforce acquired Tableau in 2019 and Slack in 2021; Data Cloud's data infrastructure reflects a newer generation of Salesforce engineering, leveraging columnar storage, distributed query engines, and streaming data pipelines that the original Salesforce platform was never designed to support. When you query Data Cloud, you are not running SOQL against an Oracle database — you are running queries against a distributed data store designed for analytical workloads.

The use case target for Data Cloud is unified customer profiles at scale. A large retailer might have customer data in Salesforce CRM, a commerce platform, a loyalty system, a mobile app event stream, and a physical POS system. Data Cloud ingests all of these, resolves identity across them (determining that the same human appears in each system under different identifiers), and builds a unified profile that can be segmented and activated. This is a different problem from CRM data management, and the architectural difference reflects that.

💡

Data Cloud is not a replacement for CRM: A common misconception is that Data Cloud replaces the Salesforce CRM data model. It does not — Data Cloud is a parallel data platform that ingests from CRM (and many other sources) and returns enriched data back to CRM through activation. The two platforms are designed to work together, not to substitute for each other.

The Ingestion Layer and Data Lake Architecture

Data Cloud supports multiple ingestion patterns. The Salesforce CRM connector is the native path — it ingests standard and custom Salesforce objects directly using a managed connector that handles schema mapping and incremental sync. External data can be ingested via the Ingestion API (a REST-based streaming endpoint for real-time events), via cloud storage connectors (Amazon S3, Azure Data Lake Storage, Google Cloud Storage), and via direct streaming from platforms like Kafka or MuleSoft using the Streaming Ingestion connector.

The underlying storage architecture uses a time-partitioned data lake. Incoming data is written to partitioned storage organised by ingestion time and Data Model Object (DMO) type. Query execution uses a distributed query engine that scans these partitions efficiently for analytical queries — large-scale aggregations and segment computations that would be prohibitively slow in the CRM database.

Streaming ingestion via the Ingestion API delivers sub-minute latency for profile updates. When a customer takes an action on a website — view a product, abandon a cart — this event can be written to Data Cloud within seconds, updating the unified profile and making it available for real-time segmentation and activation. This is fundamentally different from the batch ETL paradigm that traditional CDPs operated under.

// Data Cloud Ingestion API — streaming event push
POST /api/v1/ingest/sources/{sourceApiName}/{objectApiName}
Authorization: Bearer {access_token}
Content-Type: application/json

{
  "data": [
    {
      "EventId__c": "evt-20260519-001",
      "IndividualId__c": "ind-98765",
      "EventType__c": "ProductView",
      "ProductSku__c": "SKU-4421",
      "Timestamp__c": "2026-05-19T14:32:00Z",
      "ChannelType__c": "Web"
    }
  ]
}
// Response: 202 Accepted — data queued for processing
// Profile update visible in segmentation within ~30-60 seconds

Data Model Objects and Schema Design

Data Cloud organises data into Data Model Objects (DMOs) — a structured schema layer that sits on top of the raw data lake. DMOs are categorised by their role in the data model: Individual (a person entity), Contact Point (email, phone, address associated with an Individual), Engagement (a customer interaction or event), Party Identifier (an identifier like email address or customer ID used to link records across systems), and unified profile objects that result from identity resolution.

The standard Data Cloud data model is based on the industry-standard Customer Data Model, which Salesforce has extended. Incoming data from source systems is mapped to DMOs through a data mapping step that declares which source fields correspond to which DMO fields. This mapping is critical — poor mapping decisions at this stage cascade into incorrect identity resolution and broken segmentation downstream.

Custom DMOs allow organisations to extend the standard model with domain-specific objects. A healthcare organisation might add a Clinical Event DMO; a financial services firm might add a Financial Product DMO. Custom DMOs participate in the same identity resolution and segmentation framework as standard DMOs. Schema design for Data Cloud follows different principles than CRM schema design — denormalisation is acceptable (the query engine handles it efficiently), and optimising for segmentation query patterns matters more than relational normalisation.

Identity Resolution Architecture

Identity resolution is the process of determining that multiple records in different source systems represent the same real-world individual. This is Data Cloud's most technically complex capability and the one most commonly underestimated in architecture planning. An individual might appear as a Contact in Salesforce CRM, as a registered user in a commerce platform, as an anonymous cookie in a web analytics system, and as a loyalty card number in a POS system. Identity resolution links these disparate identities into a single unified profile.

Data Cloud's identity resolution uses configurable ruleset-based matching. A ruleset defines the match criteria — exact match on email address, fuzzy match on name plus exact match on phone number, probabilistic match across multiple weak identifiers. Multiple rulesets can be applied in sequence, with higher-confidence rules applied first. The output of identity resolution is a Unified Individual record that aggregates all matched source records into a single profile with a canonical Identity Graph.

Identity resolution is not instantaneous — it runs as a batch process on a configurable schedule (typically every few hours for large datasets). Real-time streaming data is ingested immediately but participates in identity resolution at the next scheduled run. This means there is a window where newly ingested records are present in Data Cloud but not yet unified into a resolved profile. Architectures that depend on fully resolved profiles for real-time activation must account for this latency.

⚠️

Identity resolution creates merge conflicts: When identity resolution determines two previously separate Individual records are the same person, it merges them. Downstream segments, journeys, and activations that referenced the pre-merge Individual IDs must be evaluated for impact. High-match-rate rulesets can produce unexpected merges — always validate resolution quality with sample data before enabling production resolution rules.

Calculated Insights and Activation

Calculated Insights are pre-computed metrics derived from Data Cloud data — total purchase value in the last 90 days, churn propensity score, product category affinity, days since last engagement. They are defined as SQL-like queries that run against the Data Cloud data model on a scheduled basis, and the results are stored as attributes on the Unified Individual profile. This enables segmentation and personalisation based on complex derived metrics without computing them at query time.

Activation is the process of pushing Data Cloud segments and profile attributes back to consuming systems. Salesforce CRM is the primary activation target — a Data Cloud segment can be activated as a Salesforce Contact list, or profile attributes can be written back to custom fields on Contact or Lead records. This is the feedback loop that makes Data Cloud valuable for CRM-based processes: segments built from unified, cross-channel data driving Salesforce sales and service workflows.

External activation targets include Marketing Cloud (Journey Builder audience entry), Google Ads, Facebook Audiences, Amazon DSP, and any system supporting a file-based or API-based activation connector. This makes Data Cloud a central activation layer for cross-channel personalisation — build the segment once in Data Cloud, activate it simultaneously to multiple channels.

🔑

Activation latency is not real-time: Despite the "real-time" positioning, Data Cloud activation to external systems typically has latency of minutes to hours depending on the activation connector and schedule. True real-time activation (sub-second, on individual event) requires a different architecture — typically direct API calls from the event processing layer to the consuming system, with Data Cloud used for profile enrichment rather than real-time triggering.

Architectural Constraints and Common Gotchas

Data Cloud queries use SOQL-like syntax (called SAQL — Salesforce Analytics Query Language in CRM Analytics context, or direct SQL-like syntax in Data Cloud's query interface) but do not support all SOQL features. Complex subqueries, relationship traversals, and certain aggregate functions behave differently or are not supported. Architects migrating reporting logic from SOQL to Data Cloud must validate query compatibility explicitly.

Data retention in Data Cloud is governed by retention policies configured per DMO. The default retention for engagement data is typically 2 years, but this is configurable. Unlike CRM records (which exist until deleted), Data Cloud retention policies automatically purge data older than the retention window. Compliance-driven retention requirements must be mapped to DMO-level retention configuration before go-live.

The Data Cloud org relationship to the CRM org matters architecturally. Data Cloud is provisioned as a separate "tenant" connected to the CRM org via a trusted relationship. Multi-org Salesforce configurations — where an enterprise has multiple Salesforce production orgs — require careful consideration of which CRM org Data Cloud is connected to, how data from other orgs is ingested, and how activation data flows back to the correct CRM org.

Key Takeaways

Data Cloud is architecturally separate from Salesforce CRM — it runs on a distributed data lake infrastructure designed for analytical workloads, not the multi-tenant Oracle database underlying standard Salesforce objects.
Ingestion supports CRM connector, streaming Ingestion API (sub-minute latency), cloud storage connectors (S3, ADLS, GCS), and Kafka streaming — covering both real-time event streams and batch data feeds.
Data Model Objects (DMOs) provide the schema layer. Individual, Contact Point, Engagement, and Party Identifier DMOs are foundational — mapping incoming data accurately to these is critical for identity resolution quality.
Identity resolution runs as a scheduled batch process, not in real-time. Newly ingested records participate in resolution at the next scheduled run, creating a latency window for fully resolved profiles.
Calculated Insights are pre-computed derived metrics (purchase totals, propensity scores) stored on the Unified Individual profile — enabling complex segmentation without real-time computation overhead.
Activation to external systems (CRM, Marketing Cloud, ad platforms) has latency of minutes to hours. True sub-second real-time activation requires event-driven architectures beyond what Data Cloud natively provides.

Test Your Understanding

1. A retail client wants to use Data Cloud to personalise their website in real-time — showing different product recommendations within 100ms of a customer clicking a category page. Is Data Cloud's activation framework the right tool for this?

Yes — Data Cloud's streaming ingestion and real-time activation support sub-second response times suitable for page-level personalisation

No — Data Cloud activation latency is minutes to hours, not sub-second. The website personalisation layer needs a real-time decision API that queries Data Cloud-enriched profiles, not direct Data Cloud activation

Possibly — it depends on whether the retailer has a Real-Time Identity Resolution licence add-on

2. An organisation ingests customer event data into Data Cloud and then immediately queries for a fully resolved unified profile to use in a Journey Builder entry. The profile is not found. What is the most likely cause?

The Ingestion API failed to process the event — events must be re-submitted if not immediately visible in unified profiles

Identity resolution runs as a scheduled batch process — the event was ingested but the unified profile has not been updated until the next resolution run, which may be hours away

Journey Builder requires direct CRM Contact records and cannot consume Data Cloud unified profiles directly

3. A client has poor identity resolution quality — many duplicate Unified Individual records exist when the same person appears in multiple source systems. What is the most impactful first step to diagnose this?

Increase the number of identity resolution rulesets to cast a wider matching net across all available identifiers

Rebuild the Data Cloud org from scratch with a fresh data model to eliminate accumulated matching errors

Review the Party Identifier DMO mappings — if source system identifiers (email, customer ID, cookie) are not correctly mapped to Party Identifier records, the resolution engine cannot match across systems regardless of ruleset quality

Data Cloud Architecture: Real-Time Data at Salesforce Scale

What Data Cloud Actually Is

The Ingestion Layer and Data Lake Architecture

Data Model Objects and Schema Design

Identity Resolution Architecture

Calculated Insights and Activation

Architectural Constraints and Common Gotchas

Key Takeaways

Test Your Understanding

Continue Reading

ETL for Salesforce: Talend, Informatica, and the Alternatives

MuleSoft vs Salesforce Native Integration: The Decision Framework

Salesforce Bulk API 2.0: High-Volume Data Operations

Discussion & Feedback