Large Data Volumes: How Salesforce Handles 100M+ Records

What you will learn...

What "large data volume" means in Salesforce context and where the thresholds are
How Salesforce's shared database architecture handles data isolation and query performance at scale
Indexing strategy: when standard indexes are sufficient and when custom indexes are needed
The Bulk API and parallel processing patterns for large-scale data operations
Big Objects: the archiving tier for historical data that doesn't need to be queried regularly
Data archiving strategies and the operational patterns for managing record volume over time

What Large Data Volume Actually Means

Salesforce's definition of "large data volume" (LDV) is context-dependent. There is no single threshold above which an implementation is "large data volume" — the relevant scale depends on the object, the query patterns, the relationship depth, and the sharing model complexity.

As a working guideline: objects with more than 1 million records warrant LDV design attention. Objects with more than 10 million records require explicit indexing strategy and query optimisation. Objects approaching 100 million records require data archiving planning and potentially platform-level discussions with Salesforce support.

The distinction that matters architecturally is between records that are actively queried and records that exist. A financial services firm with 500 million historical transaction records is not necessarily operating in an LDV crisis — if those transactions are in a Big Object or an archiving tier and only the last 12 months of data (5 million records) is in the active object, the query performance is manageable. The problem is when all 500 million records are in the active standard object being queried by users, flows, and integrations simultaneously.

💡

Active vs archival separation: The fundamental LDV architecture principle is separating active data from archival data. Active data is what users query regularly and what automation runs against. Archival data is historical records that need to exist for compliance or analysis but don't need sub-second query performance. Designing this separation early changes the scale you have to manage operationally.

How Salesforce Stores Data for Multiple Tenants

Understanding Salesforce's multi-tenant data storage model explains the LDV query performance characteristics. Salesforce does not give each tenant a separate database table for each custom object. Instead, all tenant data is stored in shared tables, with each row tagged with an organisation ID.

When you query Account records in your org, the database engine is executing a query against a shared table that contains Account records from many organisations, filtered by your org ID. This means every SOQL query inherently has an implied AND OrgId = :currentOrgId filter — the platform adds this automatically, but it is always present in the execution plan.

This architecture is why custom indexes in Salesforce are called "two-column indexes" — they index on the combination of OrgId + the indexed field. A standard B-tree index on just the custom field would be useless in a multi-tenant table; the index must include the org ID for the query to use it efficiently.

The implication for LDV design is that index selectivity must be evaluated within the context of your org's data, not the full table. A field that is highly selective across all orgs (like a globally unique external ID) may still be non-selective within your org if your data has many records with the same value.

-- Conceptual representation of the multi-tenant table structure
-- (not actual Salesforce SQL, but illustrative)
SELECT OrgId, Id, AccountId, CloseDate, Amount, StageName
FROM Opportunity_Data
WHERE OrgId = 'YOUR_ORG_ID'   -- always present, added by platform
AND CloseDate > '2025-01-01'   -- your filter
AND Amount > 100000             -- your filter

-- Custom index on (OrgId, CloseDate) enables this query
-- to use the index efficiently within your org's data partition

Indexing Strategy for Large Objects

Salesforce automatically indexes certain fields on every object: the record ID, all fields marked as External ID, all fields marked as Unique, and a limited set of standard fields. For LDV scenarios, these automatic indexes are often insufficient — the fields being queried most frequently are not in the auto-indexed set.

Custom indexes are requested via Salesforce support. The process: identify the queries that are slow or timing out, submit a support case requesting a custom index on the relevant field(s), and Salesforce creates the index in a background maintenance window. This is not an instant process — custom index creation can take days for large tables, and the maintenance windows are off-peak.

Not all fields benefit from indexing. Indexes help selective queries — queries that return a small fraction of total records (typically less than ~5-10%). Non-selective queries will not use the index because the database query planner determines that a full table scan is faster. Common non-selective fields include status fields with few distinct values (e.g., Active/Inactive on a field where 95% of records are Active), boolean fields, and short text fields with many repeated values.

🔑

Selectivity analysis before requesting indexes: Before requesting a custom index via support, calculate the field's selectivity in your org. A field where the query filter would return > 10% of all records in the object will not benefit from an index. Request indexes for high-selectivity fields — date ranges on recent records, external IDs, unique identifiers, and status fields where the queried value represents a minority of records.

Bulk API and Parallel Processing

The Bulk API is the correct mechanism for large-scale data operations — inserts, updates, deletes, and queries against millions of records. Unlike the REST/SOAP API (which is synchronous and operates on small batches), the Bulk API is asynchronous, processes data in parallel, and is optimised for high throughput at the cost of some latency.

Bulk API v2 (the current version) simplifies the v1 job/batch/result model into a simpler upload-and-poll pattern. You upload a CSV file to a job, the platform processes it in parallel chunks, and you poll for results. For queries, Bulk API v2 returns results as CSV files that you download in chunks.

The architectural decision point is when to use Bulk API vs REST API. As a general rule: operations on more than 2,000 records should use Bulk API. Below that threshold, the REST API's synchronous processing is faster because it avoids the overhead of job creation, parallel setup, and polling. Above 2,000 records, Bulk API's parallelism and optimised batch processing delivers significantly better throughput.

# Bulk API v2 data load example (using SFDX CLI)
# Create a Bulk API job for Opportunity insert
sf data import bulk \
  --file opportunities.csv \
  --sobject Opportunity \
  --target-org myorg

# Monitor the job
sf data bulk results \
  --job-id <jobId> \
  --target-org myorg

# The CSV format for bulk operations:
# Name,AccountId,CloseDate,StageName,Amount
# Deal A,001...,2026-12-31,Prospecting,50000
# Deal B,001...,2026-11-30,Qualification,75000

💡

Parallel processing in Batch Apex: Batch Apex can run multiple instances in parallel using Database.executeBatch() with different scope parameters or using the Database.BatchableContext to coordinate. For very large datasets, multiple parallel batch jobs (each processing a date partition of the data) can complete significantly faster than a single sequential batch job, subject to the platform's limit on concurrent batch jobs.

Big Objects: The Archiving Tier

Big Objects are a Salesforce platform feature specifically designed for storing and archiving high-volume historical data. Unlike standard custom objects, Big Objects are optimised for high-volume write and append operations, can store tens of billions of records, and are queryable via SOQL with restrictions.

The key architectural difference between Big Objects and standard custom objects is the index model. Big Objects require you to define a composite index at creation time — the combination of fields that will be used as the query filter. You can only query Big Objects via SOQL using the fields in that composite index, in left-to-right order. This is a significant constraint — it means the access pattern must be designed at schema creation time, not retrofitted later.

Big Objects are immutable — records can be inserted or deleted but not updated. They do not participate in the standard Salesforce sharing model (all users with access to the object can see all records). They do not support triggers, workflows, or flows. They are a pure data storage tier.

The standard LDV architecture pattern using Big Objects: active records live in standard custom objects (the last 12-24 months), and a scheduled batch job archives older records to a Big Object. Reports and dashboards query the standard object for operational use. Historical analysis queries the Big Object directly or via a connected data warehouse that syncs from the Big Object.

Data Archiving Strategy and Operational Patterns

For organisations that do not need the Big Object tier, simpler archiving strategies can manage active record volumes while meeting retention requirements.

Soft delete / status-based archiving: Records are marked with an "Archived" status and excluded from active reports and list views via default filters. The records remain in the standard object but are operationally invisible. This is the simplest approach but provides no query performance benefit — all records (including archived) are scanned in non-selective queries.

Field value migration to attachment/file: For objects where the "heavy" data is in long text fields (notes, descriptions, log entries), migrating field content to files (ContentDocument) and leaving only metadata in the record reduces the row width and improves performance for queries that don't need the full text content.

External data lake archiving: Mature data management strategies move old records out of Salesforce entirely — to a data lake (Snowflake, BigQuery, Databricks) via the Salesforce Bulk API or an ETL pipeline. Salesforce retains only a reference (an External ID pointing to the warehouse record) for compliance traceability. Queries for recent data are fast in Salesforce; historical analysis runs in the warehouse at warehouse scale.

⚠️

Archiving and regulatory requirements: Before archiving records out of Salesforce, verify the regulatory retention requirements. GDPR, HIPAA, SOX, and financial services regulations specify minimum retention periods and access requirements for archived data. Records archived to an external data lake must remain accessible per the regulation's access time requirements — "in the warehouse" is acceptable only if the SLA for data retrieval from the warehouse meets the regulatory access requirement.

Key Takeaways

LDV relevance starts at ~1M records and becomes critical at 10M+. The key design principle is separating actively-queried records from archival records — the volume you must manage operationally should be the active set, not the historical total.
Salesforce stores all tenant data in shared tables tagged by OrgId. Custom indexes are two-column (OrgId + field) and are requested via Salesforce support. Index only selective fields — fields where the query filter returns less than ~5-10% of total org records.
Bulk API is the correct mechanism for data operations on more than 2,000 records. Bulk API v2's asynchronous, parallel processing provides dramatically better throughput than synchronous REST API for large datasets.
Big Objects provide a storage tier for tens of billions of historical records with restricted query access (composite index only, immutable, no sharing model). They are best used as the destination for scheduled batch archiving of records older than the active retention window.
Data archiving strategies range from soft-delete status flags (simplest, no query benefit) to external data lake migration (most scalable, requires warehouse infrastructure). Choice depends on volume, access requirements, and regulatory retention needs.
Regulatory requirements must be validated before archiving — records archived to an external system must still meet the access SLA requirements of applicable regulations.

Test Your Understanding

1. You request a custom index on the "Status__c" field of a 10-million-record object where 90% of records have Status__c = 'Active'. Will this index improve query performance for queries filtering on Status__c = 'Active'?

Yes — an index always improves query performance regardless of the field's data distribution

No — querying for 'Active' returns 90% of records. This is non-selective, and the query planner will ignore the index in favour of a full table scan, which is faster for non-selective queries.

Yes — a compound index on OrgId + Status__c is always used by Salesforce regardless of selectivity

2. An integration team inserts 500,000 Contact records nightly. Which API should they use and why?

REST API — it supports parallel calls from multiple integration nodes and achieves higher throughput through concurrency

Bulk API v2 — it is designed for high-volume asynchronous operations, processes data in parallel chunks, and provides significantly better throughput than synchronous APIs for 500K+ records

SOAP API with batch calls — SOAP handles large XML payloads more efficiently than REST for bulk data

3. An architect designs a Big Object schema with a composite index on (Account_Id__c, Transaction_Date__c, Transaction_Id__c). A query is submitted with only Transaction_Id__c in the WHERE clause. What happens?

The query uses Transaction_Id__c as the rightmost key in the composite index, returning results efficiently

The query fails or returns no results — Big Objects require composite index fields to be used in left-to-right order. A query using only Transaction_Id__c (the rightmost field) without Account_Id__c and Transaction_Date__c cannot use the index.

The query performs a full table scan and returns results, but more slowly than an indexed query

Large Data Volumes: How Salesforce Handles 100M+ Records

What Large Data Volume Actually Means

How Salesforce Stores Data for Multiple Tenants

Indexing Strategy for Large Objects

Bulk API and Parallel Processing

Big Objects: The Archiving Tier

Data Archiving Strategy and Operational Patterns

Key Takeaways

Test Your Understanding

Continue Reading

The Truth About Salesforce Caching: What's Cached, What Isn't

Flow vs Apex: The Architectural Decision Framework

Package Development Model vs Org Development Model

Discussion & Feedback