What is the Apple Neural Engine and what does it do?

The Neural Engine is a dedicated hardware block inside Apple Silicon, separate from the CPU and GPU. Its job is matrix multiplication — the core operation behind neural network inference — at far higher throughput and lower energy consumption than general-purpose compute. On A17 Pro and later chips, it delivers up to 35 TOPS (tera-operations per second).

What is Apple Foundation Models designed for?

Apple Foundation Models is designed for structured generation: classification, extraction, summarization of short text, and structured data generation using @Generable-conforming types. It is not designed for complex reasoning chains, large-context synthesis, or tasks requiring world knowledge beyond its training window.

How do I schedule background inference without draining the battery?

Use BGProcessingTaskRequest with requiresExternalPower = true for pre-computation tasks where timing is flexible. For foreground inference, check ProcessInfo.thermalState and ProcessInfo.isLowPowerModeEnabled before executing. For user-triggered inference, do not defer — respond immediately but throttle background/ambient inference.

Why is on-device inference considered a privacy architecture rather than a privacy feature?

A privacy feature is something added to a system. A privacy architecture is a structural property. When inference runs on-device, user data cannot reach a server — not because of a policy, but because there is no network call in the inference path. The guarantee is structural. No vendor agreement, server configuration, or policy change affects it.

On-Device AI

On-Device AI on Apple Platforms: Core ML, Apple Foundation Models, and the Neural Engine in 2026

The Apple Neural Engine changes the constraint for mobile AI. Inference that would take 400ms over a network API completes in under 10ms on-device. This guide covers the full stack: Core ML, Apple Foundation Models, battery-aware scheduling, privacy architecture, and when cloud AI is the wrong choice.

By Ehsan Azish · 3NSOFTS·June 2026·14 min read

The structural shift toward on-device inference

Cloud inference has a latency floor that no amount of infrastructure optimization can eliminate. A round-trip to a remote model adds 200–800ms, requires an active network connection, and sends user data off the device. In consumer apps, that latency is noticeable. In low-connectivity environments, it is a hard failure mode.

The Apple Neural Engine changes the constraint. Modern Apple Silicon — A17 Pro, M4, and their successors — includes a dedicated neural processing unit capable of running billions of operations per second at a fraction of the power draw of the CPU or GPU. Inference that would take 400ms over a network API completes in under 10ms on-device.

That is not a marginal improvement. It is a different category of experience.

The Apple Neural Engine: what it is and what it does

The Neural Engine is a dedicated hardware block inside Apple Silicon, separate from the CPU and GPU. Its job is matrix multiplication — the core operation behind neural network inference — and it does that job with far higher throughput and far lower energy consumption than either general-purpose compute unit.

On A17 Pro and later chips, the Neural Engine delivers up to 35 TOPS (tera-operations per second). The M4 pushes that further. These numbers matter because model complexity is measured in operations — a larger, more capable model requires more operations per inference pass.

The constraint that shaped Apple's design: mobile devices cannot sustain the power draw of continuous GPU inference. The Neural Engine solves this by running the same workload at a fraction of the wattage. Battery-aware inference is not a workaround — it is what the hardware was built for.

Core ML: the inference layer

Core ML sits between your Swift code and the Neural Engine. You load a model, pass inputs, receive outputs — the framework handles dispatch to the appropriate compute unit (Neural Engine, GPU, or CPU) based on model type and device state.

Model formats and the .mlpackage container

Core ML models are distributed as .mlpackage bundles. The format encapsulates model weights, the compute graph, and metadata describing input/output types. Models trained in PyTorch or TensorFlow are converted using the coremltools Python library before being bundled into an app.

The .mlpackage format replaced the older .mlmodel format starting with Xcode 13. It supports model encryption, on-device personalization, and the multi-function model structure introduced in later Core ML versions.

Compute unit selection

When initializing an MLModel instance, you specify an MLModelConfiguration that controls compute unit selection:

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine // exclude GPU for battery efficiency

let model = try MLModel(contentsOf: modelURL, configuration: config)

.cpuAndNeuralEngine is the correct default for most inference workloads on iPhone. It keeps the GPU available for rendering and routes neural operations to the dedicated hardware. .all adds the GPU into the dispatch pool — appropriate for models that benefit from GPU parallelism, less appropriate for sustained background inference.

Latency in practice

On A17 Pro and M-series chips, a well-quantized classification or NLP model runs in 5–15ms using .cpuAndNeuralEngine. Image segmentation models run in 20–60ms depending on resolution and model depth.

Model conversion uses the coremltools Python library. You load the trained model, define the input/output types, and call ct.convert() with the target deployment specification. The output is an .mlpackage bundle that Xcode imports directly. Quantization — reducing weight precision from float32 to float16 or int8 — happens at conversion time and significantly affects both model size and Neural Engine throughput. The conversion step is where most production issues originate: input shape mismatches, unsupported ops, and quantization accuracy loss all surface here before they surface in the app.

Apple Foundation Models: structured generation on device

Apple Foundation Models exposes Apple Intelligence capabilities to third-party apps. It provides access to a small, on-device language model — not a general-purpose LLM, but a model optimized for structured generation, summarization, and classification within defined schemas.

What the framework exposes

The primary interface is LanguageModelSession. You create a session, define a prompt, and receive a structured response — either as a stream or as a complete value:

import FoundationModels

@Generable
struct UrgencyClassification {
    @Guide(description: "The urgency level of the message")
    var level: String
    @Guide(description: "Brief explanation of the urgency assessment")
    var rationale: String
}

let session = LanguageModelSession()
let response = try await session.respond(
    to: "Classify the urgency of this message: \(messageText)",
    generating: UrgencyClassification.self
)
print(response.content.level) // "high", "medium", or "low"

The generating: parameter accepts a Generable-conforming type. The model produces output that conforms to the schema — not free text that you parse. That is the design premise: structured generation, not open-ended chat.

The constraint that shapes usage

Apple Foundation Models runs entirely on-device. It is not GPT-4. It is not designed for multi-step reasoning or long-context tasks. That constraint is real, and it shapes what you can build with it.

Where it performs well: classification, extraction, summarization of short text, structured data generation. Where it does not: complex reasoning chains, large-context synthesis, tasks requiring world knowledge beyond its training window.

Working within that constraint produces reliable, fast, private features. Working against it produces inconsistent results that erode user trust.

Battery-aware inference scheduling

The Neural Engine is efficient, but efficiency is relative — a model running every second for an hour still draws meaningful power. Continuous inference drains batteries.

The correct architecture separates inference triggers from inference execution. Inference runs in response to discrete events — a user action, a content update, a background fetch — not on a polling timer.

For background inference, BGProcessingTaskRequest is the appropriate scheduling mechanism. It defers execution to periods when the device is charging and idle, which is exactly when battery cost is irrelevant.

let request = BGProcessingTaskRequest(identifier: "com.app.inference")
request.requiresNetworkConnectivity = false
request.requiresExternalPower = true // defer to charging periods
try BGTaskScheduler.shared.submit(request)

requiresExternalPower = true is not always the right call — it depends on how time-sensitive the inference is. For pre-computation tasks (generating embeddings, pre-classifying content), deferring to charging is the correct trade-off. For user-triggered inference, it is not.

Privacy as an architectural property

On-device inference is not a privacy feature — it is a privacy architecture. The distinction matters.

A privacy feature is something you add: an opt-out, an anonymization step, a data retention policy. These are valuable, but they exist within a system that still sends data somewhere.

A privacy architecture is structural. When inference runs on-device, user data cannot reach a server. Not because of a policy — because there is no network call in the inference path. The Neural Engine processes the data. The result surfaces in the app. Nothing leaves the device.

This has concrete implications:

App Store privacy labels — the AI inference path adds no data collection to disclose
Compliance — GDPR, HIPAA, and CCPA obligations do not apply to data that never transits a server
Trust — users and enterprise customers can verify the privacy claim structurally, not contractually

For apps handling health data, financial data, personal communications, or any regulated category, on-device inference is not a differentiator — it is the minimum viable privacy posture.

When cloud AI is the wrong choice

Cloud AI is the wrong default for most iOS app features, not the wrong choice for all of them. The cases where cloud AI makes sense:

Tasks that require frontier model capability — complex multi-step reasoning, broad world knowledge retrieval, tasks outside on-device model capabilities
Low-frequency batch tasks where latency and cost are not concerns and data is not sensitive
Tasks that require external data that cannot exist on-device

The cases where cloud AI is structurally wrong:

Real-time interaction — autocomplete, live classification, gesture interpretation
Sensitive data — health metrics, financial records, personal communications
Offline-first apps — the feature must work without connectivity
Cost-sensitive scale — 10,000+ daily active users running multiple inference calls per session

Defaulting to cloud APIs because they are faster to prototype is a decision that creates latency debt, cost scaling problems, and privacy obligations that are expensive to undo.

Building production AI features on this stack

The integration sequence that produces reliable production results:

Define the task precisely — what input, what output, what accuracy requirement
Evaluate Apple Foundation Models first — if the task fits within structured generation, use it
If custom Core ML is needed — select model, convert with coremltools, quantize to Int8, validate accuracy
Isolate behind an actor — inference never touches the main thread
Handle availability — check SystemLanguageModel.default.availability for Foundation Models; handle model load failures for Core ML
Design the fallback — what happens when inference is unavailable or slow?
Test on physical hardware — Simulator numbers are not real numbers
Audit the privacy surface — verify no data leaves the device through any path

The stack is production-ready. The patterns are established. The failure modes are known and avoidable.

Authoritative References

Foundation Models frameworkApple IntelligencePrivate Cloud ComputeCore MLCore ML documentation