- The three-layer metrics framework for AI feature evaluation: technical, usage, and business outcome
- How to establish baselines before go-live — what measurements must be in place to make post-launch comparison valid
- Feature-specific metrics for the most commonly deployed Salesforce AI features
- The signals that indicate a false positive — high usage metrics masking poor quality outcomes
- How to structure a 30/60/90-day post-go-live review that drives actionable decisions
Why Most AI Evaluations Fail
AI features are routinely declared successful at go-live based on adoption metrics alone — users are engaging with the feature, so it must be working. This is the most common evaluation failure. Adoption tells you whether users are using a feature; it does not tell you whether the feature is producing the outcomes it was designed to produce.
A case triage agent that routes 95% of cases automatically has high adoption. But if 30% of those cases are misrouted and have to be manually corrected, the actual workflow impact is negative — the agent created more work than it saved. This kind of false positive is invisible without outcome measurement, and it is more common than programmes typically acknowledge.
Establish your baseline metrics before go-live, not after. Post-hoc baselines are unreliable because memory and documentation of pre-AI performance are both imprecise. The measurement framework should be part of the delivery plan, not an afterthought following go-live.
The Three-Layer Metrics Framework
A complete AI evaluation framework has three layers, each measuring something different. All three are required; none is sufficient alone.
Technical metrics measure whether the AI is functioning correctly: API success rate, latency (p50 and p99), error rate, confidence score distribution, and model accuracy (for classification features). These are monitoring metrics — they tell you whether the system is healthy. They should be tracked continuously, not just at go-live.
Usage metrics measure whether users are engaging with the feature: feature activation rate, acceptance rate for AI suggestions (the proportion of AI-generated drafts or recommendations that users act on without modification), override rate (how often users reject or significantly edit AI output), and time-to-action (how quickly users act after receiving an AI recommendation). Usage metrics are leading indicators of business outcome — if users are consistently overriding AI suggestions, either the AI quality is poor or user trust has not been established.
Business outcome metrics measure whether the deployment is delivering its intended value: case resolution time, first-contact resolution rate, sales pipeline progression rate, forecast accuracy, customer satisfaction scores. These are the metrics that translate into ROI. They typically require 4–8 weeks of post-go-live data to show statistically reliable differences from baseline.
The acceptance rate for AI suggestions is the most revealing usage metric. A low acceptance rate (below 40%) indicates that the AI output quality is insufficient for users to trust it. A very high acceptance rate (above 90%) can indicate rubber-stamping — users accepting suggestions without review — which is a different risk. A healthy acceptance rate in the 60–80% range suggests that users are engaging critically and finding the AI genuinely useful.
Feature-Specific Metrics
Each AI feature type has metrics specific to its purpose. The following are the key measurement points for the most commonly deployed Salesforce AI features.
Agentforce autonomous agents: autonomous resolution rate (conversations fully resolved without human escalation), escalation rate by reason (explicit customer request vs agent uncertainty vs high-stakes action), post-conversation CSAT score, case re-open rate for agent-resolved cases, average conversation duration.
Einstein case triage: classification accuracy (validated by sampling 100 routed cases weekly and checking whether the assigned queue was correct), misroute rate, time from case creation to first queue assignment, volume of manually overridden routings.
Einstein Lead Scoring: conversion rate of leads by score tier (high / medium / low), correlation between score and actual close — if high-scored leads are not converting at higher rates than low-scored leads after 90 days, the model is not performing), sales team score utilisation rate (are reps actually prioritising by score?).
Generative email drafting: acceptance rate, edit distance (average words changed by users — a proxy for output quality), time saved per draft versus baseline, email response rate for AI-drafted versus manually written outreach.
Identifying False Positives
A false positive in AI evaluation is a situation where headline metrics look positive but underlying quality or business impact is negative. The most common false positives to watch for:
High autonomous resolution rate, rising re-open rate: agent is "resolving" conversations without genuinely answering questions. Customers contact again because the issue was not resolved, just closed.
High email acceptance rate, falling response rates: users are accepting AI drafts because they trust the feature, but the AI-generated emails perform worse than manually written ones because they lack personalisation or relevant context.
High classification confidence, high misroute rate: the model reports high confidence but is systematically wrong for a specific input category. Confidence score and accuracy are not the same — validate accuracy by sampling, not by reading confidence scores.
Vanity metrics — feature usage counts, total AI interactions, "AI-assisted" case volumes — are the most commonly reported post-go-live metrics and the least useful for evaluating AI quality. Ensure your go-live reporting framework leads with outcome metrics, not activity metrics. A stakeholder presentation full of AI interaction counts obscures whether the AI is helping.
The 30/60/90-Day Review Structure
A structured post-go-live review cadence prevents AI deployments from drifting without visibility. Three review points produce the most actionable signal.
30-day review focuses on technical and usage health: Is the system stable? Are API error rates acceptable? Are users engaging with the feature? Are there obvious quality failures in the output that need immediate action? Decisions at 30 days are typically operational — fix errors, address user training gaps, tune confidence thresholds.
60-day review introduces business outcome data for the first time. Compare case resolution times, routing accuracy, or lead score conversion rates against baseline. This is the first moment at which you have statistically meaningful outcome data. Decisions at 60 days may include feature configuration changes, data quality remediation, or scope adjustments if certain use cases are not performing.
90-day review is the first substantive ROI assessment. With 90 days of data, the noise from the go-live period is absorbed and the underlying performance trend is visible. This review should produce either confirmation of the business case or a decision to modify the deployment scope, retrain the model, or address the data quality issues that are limiting performance.
Pre-agree with stakeholders that the 90-day review may result in a deployment scope reduction. AI features that are not performing should be switched off or narrowed, not kept live because of sunk cost. A smaller deployment that genuinely works is worth more than a broad deployment with poor performance — and it is less damaging to user trust in AI capability broadly.
Key Takeaways
- Adoption metrics alone are insufficient — AI evaluation requires technical metrics, usage metrics, and business outcome metrics; all three layers are necessary
- Baselines must be established before go-live — post-hoc baselines are unreliable and make outcome comparison invalid
- Acceptance rate (60–80% is healthy) is the most revealing usage metric; very high rates may indicate rubber-stamping rather than genuine quality
- False positives — high usage metrics masking poor outcomes — are common and require explicit outcome measurement to detect
- The 30/60/90-day review structure separates operational stabilisation (30 days), initial outcome data (60 days), and substantive ROI assessment (90 days)
- Pre-agree that the 90-day review can produce scope reduction — a smaller deployment that works is better than a broad deployment with poor quality
Checkpoint: Test Your Understanding
1. An Agentforce deployment shows a 78% autonomous resolution rate and very high user satisfaction scores from the programme team. Case re-open rates have increased by 35% since go-live. How should this be interpreted?
2. What does an AI suggestion acceptance rate of 94% most likely indicate?
3. Why must AI feature baselines be established before go-live rather than estimated afterwards?
Discussion & Feedback