On-Device AI on Apple Platforms: Core ML, Apple Foundation Models, and the Neural Engine in 2026
The Apple Neural Engine changes the constraint for mobile AI. Inference that would take 400ms over a network API completes in under 10ms on-device. This guide covers the full stack: Core ML, Apple Foundation Models, battery-aware scheduling, privacy architecture, and when cloud AI is the wrong choice.
The structural shift toward on-device inference
Cloud inference has a latency floor that no amount of infrastructure optimization can eliminate. A round-trip to a remote model adds 200–800ms, requires an active network connection, and sends user data off the device. In consumer apps, that latency is noticeable. In low-connectivity environments, it is a hard failure mode.
The Apple Neural Engine changes the constraint. Modern Apple Silicon — A17 Pro, M4, and their successors — includes a dedicated neural processing unit capable of running billions of operations per second at a fraction of the power draw of the CPU or GPU. Inference that would take 400ms over a network API completes in under 10ms on-device.
That is not a marginal improvement. It is a different category of experience.
The Apple Neural Engine: what it is and what it does
The Neural Engine is a dedicated hardware block inside Apple Silicon, separate from the CPU and GPU. Its job is matrix multiplication — the core operation behind neural network inference — and it does that job with far higher throughput and far lower energy consumption than either general-purpose compute unit.
On A17 Pro and later chips, the Neural Engine delivers up to 35 TOPS (tera-operations per second). The M4 pushes that further. These numbers matter because model complexity is measured in operations — a larger, more capable model requires more operations per inference pass.
The constraint that shaped Apple's design: mobile devices cannot sustain the power draw of continuous GPU inference. The Neural Engine solves this by running the same workload at a fraction of the wattage. Battery-aware inference is not a workaround — it is what the hardware was built for.
Core ML: the inference layer
Core ML sits between your Swift code and the Neural Engine. You load a model, pass inputs, receive outputs — the framework handles dispatch to the appropriate compute unit (Neural Engine, GPU, or CPU) based on model type and device state.
Model formats and the .mlpackage container
Core ML models are distributed as .mlpackage bundles. The format encapsulates model weights, the compute graph, and metadata describing input/output types. Models trained in PyTorch or TensorFlow are converted using the coremltools Python library before being bundled into an app.
The .mlpackage format replaced the older .mlmodel format starting with Xcode 13. It supports model encryption, on-device personalization, and the multi-function model structure introduced in later Core ML versions.
Compute unit selection
When initializing an MLModel instance, you specify an MLModelConfiguration that controls compute unit selection:
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine // exclude GPU for battery efficiency
let model = try MLModel(contentsOf: modelURL, configuration: config)
.cpuAndNeuralEngine is the correct default for most inference workloads on iPhone. It keeps the GPU available for rendering and routes neural operations to the dedicated hardware. .all adds the GPU into the dispatch pool — appropriate for models that benefit from GPU parallelism, less appropriate for sustained background inference.
Latency in practice
On A17 Pro and M-series chips, a well-quantized classification or NLP model runs in 5–15ms using .cpuAndNeuralEngine. Image segmentation models run in 20–60ms depending on resolution and model depth.
Model conversion uses the coremltools Python library. You load the trained model, define the input/output types, and call ct.convert() with the target deployment specification. The output is an .mlpackage bundle that Xcode imports directly. Quantization — reducing weight precision from float32 to float16 or int8 — happens at conversion time and significantly affects both model size and Neural Engine throughput. The conversion step is where most production issues originate: input shape mismatches, unsupported ops, and quantization accuracy loss all surface here before they surface in the app.
Apple Foundation Models: structured generation on device
Apple Foundation Models exposes Apple Intelligence capabilities to third-party apps. It provides access to a small, on-device language model — not a general-purpose LLM, but a model optimized for structured generation, summarization, and classification within defined schemas.
What the framework exposes
The primary interface is LanguageModelSession. You create a session, define a prompt, and receive a structured response — either as a stream or as a complete value:
import FoundationModels
@Generable
struct UrgencyClassification {
@Guide(description: "The urgency level of the message")
var level: String
@Guide(description: "Brief explanation of the urgency assessment")
var rationale: String
}
let session = LanguageModelSession()
let response = try await session.respond(
to: "Classify the urgency of this message: \(messageText)",
generating: UrgencyClassification.self
)
print(response.content.level) // "high", "medium", or "low"
The generating: parameter accepts a Generable-conforming type. The model produces output that conforms to the schema — not free text that you parse. That is the design premise: structured generation, not open-ended chat.
The constraint that shapes usage
Apple Foundation Models runs entirely on-device. It is not GPT-4. It is not designed for multi-step reasoning or long-context tasks. That constraint is real, and it shapes what you can build with it.
Where it performs well: classification, extraction, summarization of short text, structured data generation. Where it does not: complex reasoning chains, large-context synthesis, tasks requiring world knowledge beyond its training window.
Working within that constraint produces reliable, fast, private features. Working against it produces inconsistent results that erode user trust.
Battery-aware inference scheduling
The Neural Engine is efficient, but efficiency is relative — a model running every second for an hour still draws meaningful power. Continuous inference drains batteries.
The correct architecture separates inference triggers from inference execution. Inference runs in response to discrete events — a user action, a content update, a background fetch — not on a polling timer.
For background inference, BGProcessingTaskRequest is the appropriate scheduling mechanism. It defers execution to periods when the device is charging and idle, which is exactly when battery cost is irrelevant.
let request = BGProcessingTaskRequest(identifier: "com.app.inference")
request.requiresNetworkConnectivity = false
request.requiresExternalPower = true // defer to charging periods
try BGTaskScheduler.shared.submit(request)
requiresExternalPower = true is not always the right call — it depends on how time-sensitive the inference is. For pre-computation tasks (generating embeddings, pre-classifying content), deferring to charging is the correct trade-off. For user-triggered inference, it is not.
Privacy as an architectural property
On-device inference is not a privacy feature — it is a privacy architecture. The distinction matters.
A privacy feature is something you add: an opt-out, an anonymization step, a data retention policy. These are valuable, but they exist within a system that still sends data somewhere.
A privacy architecture is structural. When inference runs on-device, user data cannot reach a server. Not because of a policy — because there is no network call in the inference path. The Neural Engine processes the data. The result surfaces in the app. Nothing leaves the device.
This has concrete implications:
- App Store privacy labels — the AI inference path adds no data collection to disclose
- Compliance — GDPR, HIPAA, and CCPA obligations do not apply to data that never transits a server
- Trust — users and enterprise customers can verify the privacy claim structurally, not contractually
For apps handling health data, financial data, personal communications, or any regulated category, on-device inference is not a differentiator — it is the minimum viable privacy posture.
When cloud AI is the wrong choice
Cloud AI is the wrong default for most iOS app features, not the wrong choice for all of them. The cases where cloud AI makes sense:
- Tasks that require frontier model capability — complex multi-step reasoning, broad world knowledge retrieval, tasks outside on-device model capabilities
- Low-frequency batch tasks where latency and cost are not concerns and data is not sensitive
- Tasks that require external data that cannot exist on-device
The cases where cloud AI is structurally wrong:
- Real-time interaction — autocomplete, live classification, gesture interpretation
- Sensitive data — health metrics, financial records, personal communications
- Offline-first apps — the feature must work without connectivity
- Cost-sensitive scale — 10,000+ daily active users running multiple inference calls per session
Defaulting to cloud APIs because they are faster to prototype is a decision that creates latency debt, cost scaling problems, and privacy obligations that are expensive to undo.
Building production AI features on this stack
The integration sequence that produces reliable production results:
- Define the task precisely — what input, what output, what accuracy requirement
- Evaluate Apple Foundation Models first — if the task fits within structured generation, use it
- If custom Core ML is needed — select model, convert with
coremltools, quantize to Int8, validate accuracy - Isolate behind an actor — inference never touches the main thread
- Handle availability — check
SystemLanguageModel.default.availabilityfor Foundation Models; handle model load failures for Core ML - Design the fallback — what happens when inference is unavailable or slow?
- Test on physical hardware — Simulator numbers are not real numbers
- Audit the privacy surface — verify no data leaves the device through any path
The stack is production-ready. The patterns are established. The failure modes are known and avoidable.