Insights / On-Device AI
Core ML Inference Performance in 2026: Benchmarks, Latency, and What the Numbers Mean
Most benchmark articles report a number without reporting the conditions. This guide covers what Core ML inference performance actually looks like across device classes, model types, and compute configurations — and what those numbers mean when you are making architecture decisions.
By Ehsan Azish · 3NSOFTS · May 2026The measurement problem
Core ML inference latency is not a single number. It is a distribution shaped by four independent variables: the hardware, the compute unit targeted, the precision of the model weights, and the thermal state of the device at inference time.
Benchmark articles that report a single figure — “8ms inference” — are describing one point in that space. The number is not wrong, but it is incomplete. Understanding the full distribution is what separates a benchmark from an architecture input.
The constraint that shapes every production decision: you cannot control which device your user runs, and you cannot control thermal state. You can control compute unit selection, model precision, and when you schedule inference.
What Core ML actually runs on
Apple Silicon devices contain three distinct compute resources that Core ML can target: the CPU, the GPU, and the Neural Engine (ANE). Each has a different performance profile and a different power cost.
Neural Engine (ANE)
Purpose-built for matrix operations. On A17 Pro and M-series chips it delivers peak throughput for supported operations — typically sub-10ms for classification-scale models. It is also the most power-efficient path for sustained inference workloads. The MLComputeUnits default, .all, lets the runtime decide — the right default for most models.
GPU
Handles operations the ANE cannot execute and outperforms the CPU on parallelisable workloads. Its power draw is higher than the ANE for equivalent tasks, making it the second-best path for sustained workloads.
CPU
The fallback. Deterministic and always available, but the slowest path for neural network operations. Inference routed through the CPU signals that the model contains operations the ANE or GPU cannot handle — not a deliberate choice.
Latency benchmarks: 2026 reference points
The figures below reflect measurements on current-generation Apple Silicon. They are representative ranges, not guarantees — device condition, thermal state, and model specifics all shift the numbers.
Classification models
Image classification models converted to Core ML with INT8 weight quantization, running on A17 Pro or M3 hardware:
| Model | ANE Latency (p50) | ANE Latency (p95) | CPU Fallback (p50) |
|---|---|---|---|
| MobileNetV3-Small | 1–3ms | 4–6ms | 18–30ms |
| EfficientNet-B0 | 3–6ms | 8–12ms | 35–60ms |
| ResNet-50 | 8–14ms | 18–25ms | 80–130ms |
Text classification models (BERT-Mini, DistilBERT) with 4-bit quantization:
| Model | ANE Latency (p50) | ANE Latency (p95) |
|---|---|---|
| BERT-Mini (4-layer) | 4–8ms | 10–16ms |
| DistilBERT classification head | 12–20ms | 25–38ms |
Generative and language models
Autoregressive generation latency is measured in tokens per second, not total latency — first-token latency and sustained throughput are separate figures. For a 3B parameter model quantized to 4-bit:
A17 Pro & M3
- —First-token: 80–150ms
- —Throughput: 25–45 tokens/sec
- —p95 under thermal pressure: 200–320ms
M3 Pro & M4
- —First-token: 40–80ms
- —Throughput: 55–90 tokens/sec
First-token latency under 100ms is imperceptible. Above 300ms, users perceive a pause.
Cloud API round-trip comparison
The on-device numbers above compare against cloud API round-trips of 200–800ms under normal network conditions — and that range assumes a reliable connection. On mobile networks, p95 latency for a cloud inference call frequently exceeds 1 second.
On-device inference is not faster in every scenario. For very large models — 70B+ parameters — cloud inference is still faster in absolute terms. But for models in the 3B–7B range with quantization, on-device inference on current Apple Silicon is competitive with cloud latency on a good connection, and strictly faster on a degraded one.
At 3NSOFTS, on-device AI integration targets sub-10ms latency for classification and feature extraction tasks, and under 150ms first-token for generative tasks — figures that hold without any network dependency. Full Core ML vs cloud API comparison →
The variables that move the numbers
Compute unit selection
The default .all configuration works well for models that were converted cleanly and contain only ANE-supported operations. When a model contains unsupported ops, Core ML silently routes those layers to the CPU — and total latency reflects the CPU bottleneck, not ANE throughput.
The diagnostic: run MLModel with .cpuAndNeuralEngine and compare against .all. If latency is similar, the model is already running on the ANE. If .cpuAndNeuralEngine is significantly slower, the GPU was doing meaningful work. If both are slow, the model has CPU-bound operations.
Model precision and quantization
FP32 weights are the starting point. FP16 halves the memory footprint with negligible accuracy loss for most tasks. INT8 halves it again. 4-bit quantization — now well-supported via Core ML Tools 8.x — reduces a 3B parameter model to approximately 1.5GB, which fits comfortably within the memory budget of an iPhone 15 Pro.
The latency improvement from FP32 to INT8 on the ANE is typically 30–50% for classification models. For generative models, quantization affects sustained throughput more than first-token latency. Core ML optimization techniques deep-dive →
Batch size and input shape
Core ML models are compiled for a fixed input shape by default. Sending a batch of 8 images through a model compiled for batch size 1 forces sequential inference — you get 8x the single-image latency, not a batched speedup.
If your use case involves batch inference, compile the model with a flexible input shape using MLModelConfiguration and MLMultiArray with the appropriate batch dimension. The ANE handles batched operations efficiently when the model is compiled to expect them.
Thermal state and battery pressure
This is the variable most benchmarks ignore. Apple Silicon chips throttle aggressively under sustained thermal load. A model that runs in 8ms on a cold device may run in 18–25ms after 10 minutes of continuous inference.
ProcessInfo.thermalState surfaces the current thermal state. Production architectures that run inference in a loop — real-time classification, continuous audio processing — need to check thermal state and reduce inference frequency under pressure. This is not optional. It is the difference between an app that works in a demo and one that works in production.
MLComputeUnits: what each option actually does
| Configuration | Behaviour | When to use |
|---|---|---|
| .all | Runtime selects CPU, GPU, and ANE as needed | Default; correct for most converted models |
| .cpuOnly | Forces CPU execution | Debugging; deterministic testing |
| .cpuAndGPU | Excludes ANE | Models with ANE-incompatible ops that run well on GPU |
| .cpuAndNeuralEngine | Excludes GPU | Power-sensitive workloads on ANE-compatible models |
.cpuAndNeuralEngine is underused. For models that are fully ANE-compatible, it produces the best combination of latency and power draw — the GPU is not loaded, and the ANE handles the full inference path.
Measuring inference in production
Xcode Instruments has a Core ML template. Use it. It shows per-layer execution time, compute unit routing, and memory allocation — the three things you need to diagnose a latency problem.
For production monitoring, the minimal measurement pattern is:
let start = CFAbsoluteTimeGetCurrent()
let output = try model.prediction(input: input)
let latency = CFAbsoluteTimeGetCurrent() - start
// Log latency alongside ProcessInfo.thermalStateLog thermal state alongside latency. Without it, you cannot distinguish a model performance regression from a thermal throttling event.
The p95 figure is what matters for user experience, not the mean. A model with a 5ms mean and a 200ms p95 will produce visible stutters. Measure the distribution, not the average.
Apple Foundation Models and the 2026 baseline
Apple Foundation Models, introduced with Apple Intelligence, run entirely on-device via the ANE. They surface through the FoundationModels framework with a structured API — not as raw Core ML models.
Foundation Models latency (2026)
- —Summarisation and short-form generation: 200–400ms on A17 Pro and later
- —Structured output (classification, extraction): 80–150ms
- —Cold-start model loading: 300–800ms on first use
Foundation Models tasks are fast enough for interactive use, but not fast enough for real-time inference in a tight loop. They belong in response to user actions, not in a 60fps render loop.
What the numbers mean for architecture decisions
The latency figures above are inputs to architecture decisions, not outputs. The question is not “how fast is Core ML?” — it is “which inference path fits the latency budget of this specific feature?”
Classification and feature extraction
Sub-10ms on the ANE with a properly quantized model. Fast enough to run on every user interaction, every frame, or every audio buffer. No scheduling required.
Generative tasks under 500 tokens
200–400ms end-to-end on A17 Pro and later. Fast enough for conversational UI. Schedule on a background actor to keep the main thread free. Swift concurrency patterns for AI workloads →
Sustained inference workloads
Thermal state management is non-negotiable. Check ProcessInfo.thermalState before each inference call in a loop. Reduce frequency at .fair and pause at .critical. Ignoring thermal state is the most common cause of performance regressions in production AI features.
FAQs
What is typical Core ML inference latency on an iPhone 15 Pro in 2026?
For classification-scale models (MobileNetV3, EfficientNet-B0) running on the ANE with INT8 quantization, p50 latency is 1–6ms. For generative models at 3B parameters with 4-bit quantization, first-token latency is 80–150ms and sustained throughput is 25–45 tokens per second. Both figures assume the device is not under thermal pressure.
How does Core ML inference compare to cloud API latency?
Cloud API round-trips for equivalent inference tasks run 200–800ms under normal network conditions. On-device Core ML inference for classification tasks is 10–50x faster in absolute terms. For generative tasks, on-device and cloud latency are comparable on a good connection; on-device is strictly faster when connectivity is degraded or absent.
What is the fastest compute unit configuration for Core ML?
.all is the correct default. For models that are fully ANE-compatible, .cpuAndNeuralEngine often produces the best combination of latency and power efficiency by excluding the GPU from the execution path. Use Instruments to verify which compute units are actually being used before optimising.
Does quantization affect Core ML inference accuracy?
For classification and feature extraction tasks, INT8 quantization produces accuracy within 1–2% of FP32 on standard benchmarks. 4-bit quantization introduces more variance — acceptable for most generative tasks, but requires validation on your specific dataset. Core ML Tools 8.x includes post-training quantization with calibration dataset support.
How does thermal throttling affect Core ML performance?
Sustained inference workloads cause thermal throttling on all Apple Silicon devices. A model that runs in 8ms on a cold device may run in 18–25ms after 10 minutes of continuous use. Production architectures need to check ProcessInfo.thermalState and reduce inference frequency at .fair state. Ignoring thermal state is the most common cause of performance regressions in production AI features.
Can Core ML run models while the app is in the background?
Background inference is constrained by iOS background execution limits. Short tasks complete if the app has background processing entitlements. Sustained background inference requires BGProcessingTask with the appropriate entitlement. The ANE remains available in background execution, but thermal and battery constraints apply with greater force.
What model size fits comfortably on an iPhone for on-device inference?
A 3B parameter model at 4-bit quantization occupies approximately 1.5GB. An iPhone 15 Pro has 8GB RAM, making 3B models practical for foreground inference. Models above 7B parameters at 4-bit (approximately 3.5GB) are feasible on Pro hardware but leave limited headroom for the rest of the app. The practical ceiling for most production apps is 3–4B parameters.
Related articles
On-Device AI for Apple Platforms: The Complete Guide
Core ML, Foundation Models, and MLX — the full stack for on-device inference on iOS, macOS, and visionOS.
Core ML Optimization Techniques
Quantization, pruning, palettization, and Neural Engine targeting. Real numbers from production models.
Swift Concurrency for AI Workloads
Actor-isolated inference, AsyncStream for streaming output, and keeping the main thread free during prediction.
SwiftUI + Core ML Architecture Patterns
Service layer isolation, persisting inference results, and the patterns that keep ML features testable and composable.
Optimising Core ML performance in your app?
We've shipped on-device AI in production apps targeting classification, generative, and continuous inference scenarios. Talk to us about your performance requirements.