Skip to main content
3Nsofts logo3Nsofts

Insights / On-Device AI

Core ML Inference Performance in 2026: Benchmarks, Latency, and What the Numbers Mean

Most benchmark articles report a number without reporting the conditions. This guide covers what Core ML inference performance actually looks like across device classes, model types, and compute configurations — and what those numbers mean when you are making architecture decisions.

By Ehsan Azish · 3NSOFTS · May 2026

The measurement problem

Core ML inference latency is not a single number. It is a distribution shaped by four independent variables: the hardware, the compute unit targeted, the precision of the model weights, and the thermal state of the device at inference time.

Benchmark articles that report a single figure — “8ms inference” — are describing one point in that space. The number is not wrong, but it is incomplete. Understanding the full distribution is what separates a benchmark from an architecture input.

The constraint that shapes every production decision: you cannot control which device your user runs, and you cannot control thermal state. You can control compute unit selection, model precision, and when you schedule inference.

What Core ML actually runs on

Apple Silicon devices contain three distinct compute resources that Core ML can target: the CPU, the GPU, and the Neural Engine (ANE). Each has a different performance profile and a different power cost.

Neural Engine (ANE)

Purpose-built for matrix operations. On A17 Pro and M-series chips it delivers peak throughput for supported operations — typically sub-10ms for classification-scale models. It is also the most power-efficient path for sustained inference workloads. The MLComputeUnits default, .all, lets the runtime decide — the right default for most models.

GPU

Handles operations the ANE cannot execute and outperforms the CPU on parallelisable workloads. Its power draw is higher than the ANE for equivalent tasks, making it the second-best path for sustained workloads.

CPU

The fallback. Deterministic and always available, but the slowest path for neural network operations. Inference routed through the CPU signals that the model contains operations the ANE or GPU cannot handle — not a deliberate choice.

Latency benchmarks: 2026 reference points

The figures below reflect measurements on current-generation Apple Silicon. They are representative ranges, not guarantees — device condition, thermal state, and model specifics all shift the numbers.

Classification models

Image classification models converted to Core ML with INT8 weight quantization, running on A17 Pro or M3 hardware:

ModelANE Latency (p50)ANE Latency (p95)CPU Fallback (p50)
MobileNetV3-Small1–3ms4–6ms18–30ms
EfficientNet-B03–6ms8–12ms35–60ms
ResNet-508–14ms18–25ms80–130ms

Text classification models (BERT-Mini, DistilBERT) with 4-bit quantization:

ModelANE Latency (p50)ANE Latency (p95)
BERT-Mini (4-layer)4–8ms10–16ms
DistilBERT classification head12–20ms25–38ms

Generative and language models

Autoregressive generation latency is measured in tokens per second, not total latency — first-token latency and sustained throughput are separate figures. For a 3B parameter model quantized to 4-bit:

A17 Pro & M3

  • First-token: 80–150ms
  • Throughput: 25–45 tokens/sec
  • p95 under thermal pressure: 200–320ms

M3 Pro & M4

  • First-token: 40–80ms
  • Throughput: 55–90 tokens/sec

First-token latency under 100ms is imperceptible. Above 300ms, users perceive a pause.

Cloud API round-trip comparison

The on-device numbers above compare against cloud API round-trips of 200–800ms under normal network conditions — and that range assumes a reliable connection. On mobile networks, p95 latency for a cloud inference call frequently exceeds 1 second.

On-device inference is not faster in every scenario. For very large models — 70B+ parameters — cloud inference is still faster in absolute terms. But for models in the 3B–7B range with quantization, on-device inference on current Apple Silicon is competitive with cloud latency on a good connection, and strictly faster on a degraded one.

At 3NSOFTS, on-device AI integration targets sub-10ms latency for classification and feature extraction tasks, and under 150ms first-token for generative tasks — figures that hold without any network dependency. Full Core ML vs cloud API comparison →

The variables that move the numbers

Compute unit selection

The default .all configuration works well for models that were converted cleanly and contain only ANE-supported operations. When a model contains unsupported ops, Core ML silently routes those layers to the CPU — and total latency reflects the CPU bottleneck, not ANE throughput.

The diagnostic: run MLModel with .cpuAndNeuralEngine and compare against .all. If latency is similar, the model is already running on the ANE. If .cpuAndNeuralEngine is significantly slower, the GPU was doing meaningful work. If both are slow, the model has CPU-bound operations.

Model precision and quantization

FP32 weights are the starting point. FP16 halves the memory footprint with negligible accuracy loss for most tasks. INT8 halves it again. 4-bit quantization — now well-supported via Core ML Tools 8.x — reduces a 3B parameter model to approximately 1.5GB, which fits comfortably within the memory budget of an iPhone 15 Pro.

The latency improvement from FP32 to INT8 on the ANE is typically 30–50% for classification models. For generative models, quantization affects sustained throughput more than first-token latency. Core ML optimization techniques deep-dive →

Batch size and input shape

Core ML models are compiled for a fixed input shape by default. Sending a batch of 8 images through a model compiled for batch size 1 forces sequential inference — you get 8x the single-image latency, not a batched speedup.

If your use case involves batch inference, compile the model with a flexible input shape using MLModelConfiguration and MLMultiArray with the appropriate batch dimension. The ANE handles batched operations efficiently when the model is compiled to expect them.

Thermal state and battery pressure

This is the variable most benchmarks ignore. Apple Silicon chips throttle aggressively under sustained thermal load. A model that runs in 8ms on a cold device may run in 18–25ms after 10 minutes of continuous inference.

ProcessInfo.thermalState surfaces the current thermal state. Production architectures that run inference in a loop — real-time classification, continuous audio processing — need to check thermal state and reduce inference frequency under pressure. This is not optional. It is the difference between an app that works in a demo and one that works in production.

MLComputeUnits: what each option actually does

ConfigurationBehaviourWhen to use
.allRuntime selects CPU, GPU, and ANE as neededDefault; correct for most converted models
.cpuOnlyForces CPU executionDebugging; deterministic testing
.cpuAndGPUExcludes ANEModels with ANE-incompatible ops that run well on GPU
.cpuAndNeuralEngineExcludes GPUPower-sensitive workloads on ANE-compatible models

.cpuAndNeuralEngine is underused. For models that are fully ANE-compatible, it produces the best combination of latency and power draw — the GPU is not loaded, and the ANE handles the full inference path.

Measuring inference in production

Xcode Instruments has a Core ML template. Use it. It shows per-layer execution time, compute unit routing, and memory allocation — the three things you need to diagnose a latency problem.

For production monitoring, the minimal measurement pattern is:

let start = CFAbsoluteTimeGetCurrent()
let output = try model.prediction(input: input)
let latency = CFAbsoluteTimeGetCurrent() - start
// Log latency alongside ProcessInfo.thermalState

Log thermal state alongside latency. Without it, you cannot distinguish a model performance regression from a thermal throttling event.

The p95 figure is what matters for user experience, not the mean. A model with a 5ms mean and a 200ms p95 will produce visible stutters. Measure the distribution, not the average.

Apple Foundation Models and the 2026 baseline

Apple Foundation Models, introduced with Apple Intelligence, run entirely on-device via the ANE. They surface through the FoundationModels framework with a structured API — not as raw Core ML models.

Foundation Models latency (2026)

  • Summarisation and short-form generation: 200–400ms on A17 Pro and later
  • Structured output (classification, extraction): 80–150ms
  • Cold-start model loading: 300–800ms on first use

Foundation Models tasks are fast enough for interactive use, but not fast enough for real-time inference in a tight loop. They belong in response to user actions, not in a 60fps render loop.

Foundation Models vs Core ML: when to use each →

What the numbers mean for architecture decisions

The latency figures above are inputs to architecture decisions, not outputs. The question is not “how fast is Core ML?” — it is “which inference path fits the latency budget of this specific feature?”

Classification and feature extraction

Sub-10ms on the ANE with a properly quantized model. Fast enough to run on every user interaction, every frame, or every audio buffer. No scheduling required.

Generative tasks under 500 tokens

200–400ms end-to-end on A17 Pro and later. Fast enough for conversational UI. Schedule on a background actor to keep the main thread free. Swift concurrency patterns for AI workloads →

Sustained inference workloads

Thermal state management is non-negotiable. Check ProcessInfo.thermalState before each inference call in a loop. Reduce frequency at .fair and pause at .critical. Ignoring thermal state is the most common cause of performance regressions in production AI features.

FAQs

What is typical Core ML inference latency on an iPhone 15 Pro in 2026?

For classification-scale models (MobileNetV3, EfficientNet-B0) running on the ANE with INT8 quantization, p50 latency is 1–6ms. For generative models at 3B parameters with 4-bit quantization, first-token latency is 80–150ms and sustained throughput is 25–45 tokens per second. Both figures assume the device is not under thermal pressure.

How does Core ML inference compare to cloud API latency?

Cloud API round-trips for equivalent inference tasks run 200–800ms under normal network conditions. On-device Core ML inference for classification tasks is 10–50x faster in absolute terms. For generative tasks, on-device and cloud latency are comparable on a good connection; on-device is strictly faster when connectivity is degraded or absent.

What is the fastest compute unit configuration for Core ML?

.all is the correct default. For models that are fully ANE-compatible, .cpuAndNeuralEngine often produces the best combination of latency and power efficiency by excluding the GPU from the execution path. Use Instruments to verify which compute units are actually being used before optimising.

Does quantization affect Core ML inference accuracy?

For classification and feature extraction tasks, INT8 quantization produces accuracy within 1–2% of FP32 on standard benchmarks. 4-bit quantization introduces more variance — acceptable for most generative tasks, but requires validation on your specific dataset. Core ML Tools 8.x includes post-training quantization with calibration dataset support.

How does thermal throttling affect Core ML performance?

Sustained inference workloads cause thermal throttling on all Apple Silicon devices. A model that runs in 8ms on a cold device may run in 18–25ms after 10 minutes of continuous use. Production architectures need to check ProcessInfo.thermalState and reduce inference frequency at .fair state. Ignoring thermal state is the most common cause of performance regressions in production AI features.

Can Core ML run models while the app is in the background?

Background inference is constrained by iOS background execution limits. Short tasks complete if the app has background processing entitlements. Sustained background inference requires BGProcessingTask with the appropriate entitlement. The ANE remains available in background execution, but thermal and battery constraints apply with greater force.

What model size fits comfortably on an iPhone for on-device inference?

A 3B parameter model at 4-bit quantization occupies approximately 1.5GB. An iPhone 15 Pro has 8GB RAM, making 3B models practical for foreground inference. Models above 7B parameters at 4-bit (approximately 3.5GB) are feasible on Pro hardware but leave limited headroom for the rest of the app. The practical ceiling for most production apps is 3–4B parameters.

Related articles

Optimising Core ML performance in your app?

We've shipped on-device AI in production apps targeting classification, generative, and continuous inference scenarios. Talk to us about your performance requirements.