Core ML Inference Performance in 2026: Benchmarks, Latency, and What the Numbers Mean
Core ML latency benchmarks across device classes and model types in 2026. Classification, generative, cloud comparisons, thermal throttling, and what the numbers mean for architecture decisions.
Most benchmark articles report a number without reporting the conditions. This guide covers what Core ML inference performance actually looks like across device classes, model types, and compute configurations — and what those numbers mean when you are making architecture decisions.
The measurement problem
Core ML inference latency is not a single number. It is a distribution shaped by four independent variables: the hardware, the compute unit targeted, the precision of the model weights, and the thermal state of the device at inference time.
Benchmark articles that report a single figure — "8ms inference" — are describing one point in that space. The number is not wrong, but it is incomplete. Understanding the full distribution is what separates a benchmark from an architecture input.
The constraint that shapes every production decision: you cannot control which device your user runs, and you cannot control thermal state. You can control compute unit selection, model precision, and when you schedule inference.
What Core ML actually runs on
Apple Silicon devices contain three distinct compute resources that Core ML can target: the CPU, the GPU, and the Neural Engine (ANE). Each has a different performance profile and a different power cost.
Neural Engine (ANE): Purpose-built for matrix operations. On A17 Pro and M-series chips it delivers peak throughput for supported operations — typically sub-10ms for classification-scale models. It is also the most power-efficient path for sustained inference workloads. The MLComputeUnits default, .all, lets the runtime decide — the right default for most models.
GPU: Handles operations the ANE cannot execute and outperforms the CPU on parallelisable workloads. Its power draw is higher than the ANE for equivalent tasks.
CPU: The fallback. Deterministic and always available, but the slowest path for neural network operations. Inference routed through the CPU signals that the model contains operations the ANE or GPU cannot handle.
Latency benchmarks: 2026 reference points
Classification models
Image classification models converted to Core ML with INT8 weight quantization, running on A17 Pro or M3 hardware:
| Model | ANE Latency (p50) | ANE Latency (p95) | CPU Fallback (p50) | |---|---|---|---| | MobileNetV3-Small | 1–3ms | 4–6ms | 18–30ms | | EfficientNet-B0 | 3–6ms | 8–12ms | 35–60ms | | ResNet-50 | 8–14ms | 18–25ms | 80–130ms |
Text classification models (BERT-Mini, DistilBERT) with 4-bit quantization:
| Model | ANE Latency (p50) | ANE Latency (p95) | |---|---|---| | BERT-Mini (4-layer) | 4–8ms | 10–16ms | | DistilBERT classification head | 12–20ms | 25–38ms |
Generative and language models
For a 3B parameter model quantized to 4-bit:
A17 Pro & M3:
- First-token: 80–150ms
- Throughput: 25–45 tokens/sec
- p95 under thermal pressure: 200–320ms
M3 Pro & M4:
- First-token: 40–80ms
- Throughput: 55–90 tokens/sec
First-token latency under 100ms is imperceptible. Above 300ms, users perceive a pause.
Cloud API round-trip comparison
The on-device numbers above compare against cloud API round-trips of 200–800ms under normal network conditions — and that range assumes a reliable connection. On mobile networks, p95 latency for a cloud inference call frequently exceeds 1 second.
On-device inference is not faster in every scenario. For very large models — 70B+ parameters — cloud inference is still faster in absolute terms. But for models in the 3B–7B range with quantization, on-device inference on current Apple Silicon is competitive with cloud latency on a good connection, and strictly faster on a degraded one.
The variables that move the numbers
Compute unit selection
The default .all configuration works well for models that were converted cleanly and contain only ANE-supported operations. When a model contains unsupported ops, Core ML silently routes those layers to the CPU — and total latency reflects the CPU bottleneck, not ANE throughput.
The diagnostic: run MLModel with .cpuAndNeuralEngine and compare against .all. If latency is similar, the model is already running on the ANE. If .cpuAndNeuralEngine is significantly slower, the GPU was doing meaningful work.
Model precision and quantization
FP32 weights are the starting point. FP16 halves the memory footprint with negligible accuracy loss for most tasks. INT8 halves it again. 4-bit quantization — now well-supported via Core ML Tools 8.x — reduces a 3B parameter model to approximately 1.5GB, which fits comfortably within the memory budget of an iPhone 15 Pro.
The latency improvement from FP32 to INT8 on the ANE is typically 30–50% for classification models. For generative models, quantization affects sustained throughput more than first-token latency.
Thermal state and battery pressure
This is the variable most benchmarks ignore. Apple Silicon chips throttle aggressively under sustained thermal load. A model that runs in 8ms on a cold device may run in 18–25ms after 10 minutes of continuous inference.
ProcessInfo.thermalState surfaces the current thermal state. Production architectures that run inference in a loop — real-time classification, continuous audio processing — need to check thermal state and reduce inference frequency under pressure.
MLComputeUnits: what each option actually does
| Configuration | Behaviour | When to use |
|---|---|---|
| .all | Runtime selects CPU, GPU, and ANE as needed | Default; correct for most converted models |
| .cpuOnly | Forces CPU execution | Debugging; deterministic testing |
| .cpuAndGPU | Excludes ANE | Models with ANE-incompatible ops that run well on GPU |
| .cpuAndNeuralEngine | Excludes GPU | Power-sensitive workloads on ANE-compatible models |
.cpuAndNeuralEngine is underused. For models that are fully ANE-compatible, it produces the best combination of latency and power draw — the GPU is not loaded, and the ANE handles the full inference path.
Measuring inference in production
Xcode Instruments has a Core ML template. Use it. It shows per-layer execution time, compute unit routing, and memory allocation — the three things you need to diagnose a latency problem.
For production monitoring, the minimal measurement pattern is:
let start = CFAbsoluteTimeGetCurrent()
let output = try model.prediction(input: input)
let latency = CFAbsoluteTimeGetCurrent() - start
// Log latency alongside ProcessInfo.thermalState
Log thermal state alongside latency. Without it, you cannot distinguish a model performance regression from a thermal throttling event. The p95 figure is what matters for user experience, not the mean.
Apple Foundation Models and the 2026 baseline
Apple Foundation Models, introduced with Apple Intelligence, run entirely on-device via the ANE. On A17 Pro and M3 hardware: first-token latency 60–120ms, sustained throughput 20–35 tokens/sec.
Foundation Models are not accessible via MLModel — they run through the FoundationModels framework with a structured API. Performance characteristics differ from custom Core ML models because the model is Apple's, managed by the OS, with its own scheduling and memory management.
Work With Me
At 3NSOFTS, on-device AI integration targets sub-10ms latency for classification and feature extraction tasks, and under 150ms first-token for generative tasks — figures that hold without any network dependency.