- What "large data volume" means in Salesforce context and where the thresholds are
- How Salesforce's shared database architecture handles data isolation and query performance at scale
- Indexing strategy: when standard indexes are sufficient and when custom indexes are needed
- The Bulk API and parallel processing patterns for large-scale data operations
- Big Objects: the archiving tier for historical data that doesn't need to be queried regularly
- Data archiving strategies and the operational patterns for managing record volume over time
What Large Data Volume Actually Means
Salesforce's definition of "large data volume" (LDV) is context-dependent. There is no single threshold above which an implementation is "large data volume" — the relevant scale depends on the object, the query patterns, the relationship depth, and the sharing model complexity.
As a working guideline: objects with more than 1 million records warrant LDV design attention. Objects with more than 10 million records require explicit indexing strategy and query optimisation. Objects approaching 100 million records require data archiving planning and potentially platform-level discussions with Salesforce support.
The distinction that matters architecturally is between records that are actively queried and records that exist. A financial services firm with 500 million historical transaction records is not necessarily operating in an LDV crisis — if those transactions are in a Big Object or an archiving tier and only the last 12 months of data (5 million records) is in the active object, the query performance is manageable. The problem is when all 500 million records are in the active standard object being queried by users, flows, and integrations simultaneously.
How Salesforce Stores Data for Multiple Tenants
Understanding Salesforce's multi-tenant data storage model explains the LDV query performance characteristics. Salesforce does not give each tenant a separate database table for each custom object. Instead, all tenant data is stored in shared tables, with each row tagged with an organisation ID.
When you query Account records in your org, the database engine is executing a query against a shared table that contains Account records from many organisations, filtered by your org ID. This means every SOQL query inherently has an implied AND OrgId = :currentOrgId filter — the platform adds this automatically, but it is always present in the execution plan.
This architecture is why custom indexes in Salesforce are called "two-column indexes" — they index on the combination of OrgId + the indexed field. A standard B-tree index on just the custom field would be useless in a multi-tenant table; the index must include the org ID for the query to use it efficiently.
The implication for LDV design is that index selectivity must be evaluated within the context of your org's data, not the full table. A field that is highly selective across all orgs (like a globally unique external ID) may still be non-selective within your org if your data has many records with the same value.
-- Conceptual representation of the multi-tenant table structure
-- (not actual Salesforce SQL, but illustrative)
SELECT OrgId, Id, AccountId, CloseDate, Amount, StageName
FROM Opportunity_Data
WHERE OrgId = 'YOUR_ORG_ID' -- always present, added by platform
AND CloseDate > '2025-01-01' -- your filter
AND Amount > 100000 -- your filter
-- Custom index on (OrgId, CloseDate) enables this query
-- to use the index efficiently within your org's data partition
Indexing Strategy for Large Objects
Salesforce automatically indexes certain fields on every object: the record ID, all fields marked as External ID, all fields marked as Unique, and a limited set of standard fields. For LDV scenarios, these automatic indexes are often insufficient — the fields being queried most frequently are not in the auto-indexed set.
Custom indexes are requested via Salesforce support. The process: identify the queries that are slow or timing out, submit a support case requesting a custom index on the relevant field(s), and Salesforce creates the index in a background maintenance window. This is not an instant process — custom index creation can take days for large tables, and the maintenance windows are off-peak.
Not all fields benefit from indexing. Indexes help selective queries — queries that return a small fraction of total records (typically less than ~5-10%). Non-selective queries will not use the index because the database query planner determines that a full table scan is faster. Common non-selective fields include status fields with few distinct values (e.g., Active/Inactive on a field where 95% of records are Active), boolean fields, and short text fields with many repeated values.
Bulk API and Parallel Processing
The Bulk API is the correct mechanism for large-scale data operations — inserts, updates, deletes, and queries against millions of records. Unlike the REST/SOAP API (which is synchronous and operates on small batches), the Bulk API is asynchronous, processes data in parallel, and is optimised for high throughput at the cost of some latency.
Bulk API v2 (the current version) simplifies the v1 job/batch/result model into a simpler upload-and-poll pattern. You upload a CSV file to a job, the platform processes it in parallel chunks, and you poll for results. For queries, Bulk API v2 returns results as CSV files that you download in chunks.
The architectural decision point is when to use Bulk API vs REST API. As a general rule: operations on more than 2,000 records should use Bulk API. Below that threshold, the REST API's synchronous processing is faster because it avoids the overhead of job creation, parallel setup, and polling. Above 2,000 records, Bulk API's parallelism and optimised batch processing delivers significantly better throughput.
# Bulk API v2 data load example (using SFDX CLI)
# Create a Bulk API job for Opportunity insert
sf data import bulk \
--file opportunities.csv \
--sobject Opportunity \
--target-org myorg
# Monitor the job
sf data bulk results \
--job-id <jobId> \
--target-org myorg
# The CSV format for bulk operations:
# Name,AccountId,CloseDate,StageName,Amount
# Deal A,001...,2026-12-31,Prospecting,50000
# Deal B,001...,2026-11-30,Qualification,75000
Big Objects: The Archiving Tier
Big Objects are a Salesforce platform feature specifically designed for storing and archiving high-volume historical data. Unlike standard custom objects, Big Objects are optimised for high-volume write and append operations, can store tens of billions of records, and are queryable via SOQL with restrictions.
The key architectural difference between Big Objects and standard custom objects is the index model. Big Objects require you to define a composite index at creation time — the combination of fields that will be used as the query filter. You can only query Big Objects via SOQL using the fields in that composite index, in left-to-right order. This is a significant constraint — it means the access pattern must be designed at schema creation time, not retrofitted later.
Big Objects are immutable — records can be inserted or deleted but not updated. They do not participate in the standard Salesforce sharing model (all users with access to the object can see all records). They do not support triggers, workflows, or flows. They are a pure data storage tier.
The standard LDV architecture pattern using Big Objects: active records live in standard custom objects (the last 12-24 months), and a scheduled batch job archives older records to a Big Object. Reports and dashboards query the standard object for operational use. Historical analysis queries the Big Object directly or via a connected data warehouse that syncs from the Big Object.
Data Archiving Strategy and Operational Patterns
For organisations that do not need the Big Object tier, simpler archiving strategies can manage active record volumes while meeting retention requirements.
Soft delete / status-based archiving: Records are marked with an "Archived" status and excluded from active reports and list views via default filters. The records remain in the standard object but are operationally invisible. This is the simplest approach but provides no query performance benefit — all records (including archived) are scanned in non-selective queries.
Field value migration to attachment/file: For objects where the "heavy" data is in long text fields (notes, descriptions, log entries), migrating field content to files (ContentDocument) and leaving only metadata in the record reduces the row width and improves performance for queries that don't need the full text content.
External data lake archiving: Mature data management strategies move old records out of Salesforce entirely — to a data lake (Snowflake, BigQuery, Databricks) via the Salesforce Bulk API or an ETL pipeline. Salesforce retains only a reference (an External ID pointing to the warehouse record) for compliance traceability. Queries for recent data are fast in Salesforce; historical analysis runs in the warehouse at warehouse scale.
Key Takeaways
- LDV relevance starts at ~1M records and becomes critical at 10M+. The key design principle is separating actively-queried records from archival records — the volume you must manage operationally should be the active set, not the historical total.
- Salesforce stores all tenant data in shared tables tagged by OrgId. Custom indexes are two-column (OrgId + field) and are requested via Salesforce support. Index only selective fields — fields where the query filter returns less than ~5-10% of total org records.
- Bulk API is the correct mechanism for data operations on more than 2,000 records. Bulk API v2's asynchronous, parallel processing provides dramatically better throughput than synchronous REST API for large datasets.
- Big Objects provide a storage tier for tens of billions of historical records with restricted query access (composite index only, immutable, no sharing model). They are best used as the destination for scheduled batch archiving of records older than the active retention window.
- Data archiving strategies range from soft-delete status flags (simplest, no query benefit) to external data lake migration (most scalable, requires warehouse infrastructure). Choice depends on volume, access requirements, and regulatory retention needs.
- Regulatory requirements must be validated before archiving — records archived to an external system must still meet the access SLA requirements of applicable regulations.
Test Your Understanding
1. You request a custom index on the "Status__c" field of a 10-million-record object where 90% of records have Status__c = 'Active'. Will this index improve query performance for queries filtering on Status__c = 'Active'?
2. An integration team inserts 500,000 Contact records nightly. Which API should they use and why?
3. An architect designs a Big Object schema with a composite index on (Account_Id__c, Transaction_Date__c, Transaction_Id__c). A query is submitted with only Transaction_Id__c in the WHERE clause. What happens?
Discussion & Feedback