What is the difference between Core ML and Apple Foundation Models for on-device AI?

Core ML runs any converted ML model on-device — classification, detection, NLP, custom neural networks from PyTorch or TensorFlow. It works on all iPhones with A12 Bionic or later. Apple Foundation Models (via the FoundationModels API in iOS 18.1+) gives apps access to Apple's built-in on-device language model for text generation, summarization, and structured output. It requires Apple Intelligence hardware (iPhone 15 Pro or later, or M-series iPad/Mac). Most production apps use Core ML for high-frequency, real-time tasks and Foundation Models for language features where the device supports it.

What data privacy advantages does on-device AI provide over cloud AI APIs?

Four concrete advantages: (1) No data transmission — inference input, output, and intermediate state never leave the device, eliminating the possibility of interception, logging, or breach at a third-party API provider. (2) No third-party data processor — your App Store Privacy Nutrition Label has no AI-pipeline data types to declare for third-party sharing. (3) No server logs — cloud AI APIs typically log requests for model improvement; on-device inference produces no server-side logs of user activity. (4) GDPR and CCPA posture — user data processed entirely on-device is not a cross-border data transfer, which simplifies compliance for regulated industries.

How fast is on-device AI inference compared to calling a cloud API?

Core ML inference on Apple Silicon is 2–15ms for typical classification and detection tasks. Apple Foundation Models structured generation runs in 50–300ms. A cloud AI API round-trip adds a minimum 200–400ms of network latency before server-side processing begins — total latency is typically 500ms–3s. On-device inference is 4–10x faster for tasks Core ML can handle. For real-time features (live camera inference, audio processing, continuous typing analysis), cloud AI is technically unsuitable — on-device inference is the only viable option.

Do I need to train a custom model for on-device AI, or can I use pre-trained models?

Both options are available. Apple provides pre-trained Core ML models via the Core ML Model Gallery (covering image classification, object detection, depth estimation, body pose, and more) — these can be dropped into an app with no training required. Create ML lets you train custom models in Xcode using your own data without Python. For custom architectures, coremltools converts PyTorch and TensorFlow models. For text and language tasks, Foundation Models uses Apple's built-in model — no model file needed. Start with pre-trained or Apple-provided models before committing to custom training.

How-To Guide · Core ML · Foundation Models · iOS

How to Integrate On-Device AI
into Your iOS App Without
Sending Data to the Cloud

A step-by-step production guide using Core ML and Apple Foundation Models. The privacy guarantee is architectural — because inference runs locally, there is no network call to intercept and no data to transmit.

Why on-device inference is private by design

A cloud AI API call sends your user’s data to a third-party server to be processed. On-device inference sends nothing — the model runs on the device’s Apple Neural Engine using data that never leaves RAM. This is not a privacy policy or a configuration option. It is the physical architecture.

6 steps
Implementation: 2–15 ms
Core ML latency: Zero
Cloud transmissions: iOS 16+
Min deployment

By Ehsan Azish · 3NSOFTS · May 2026

Why Cloud AI Breaks the Privacy Boundary

Cloud AI APIs (OpenAI, Anthropic, Google, and similar) require your app to send user data over the network to a third-party server for processing. The server processes the data, runs inference, and returns a result. Every step involves data transmission:

Stage	Cloud AI API	On-Device (Core ML)
Input data	Sent over HTTPS to third-party server	Stays in device RAM
Inference	Runs on cloud GPU/TPU cluster	Runs on Apple Neural Engine
Output	Returned over HTTPS, logged server-side	Returned in-process, never transmitted
User data exposure	Third-party processor receives and may log	No third party involved
Privacy label	Must declare data types sent to provider	No AI-related third-party data to declare
Offline support	Fails without network	Full functionality offline
Inference cost	Per-token or per-request billing	Zero per call

The 6-Step Implementation Process

Choose the right on-device framework

Core ML or Foundation Models — the choice is determined by the AI task, not by preference.

Use Core ML when:

The AI task is classification, detection, segmentation, regression, or prediction
The feature must run on devices going back to iOS 16 or A12 Bionic
You have a custom model trained in PyTorch or TensorFlow
Inference speed must be under 20ms (real-time camera, audio, or typing features)

Use Apple Foundation Models when:

The AI task involves text: summarization, generation, intent classification, or slot extraction
The user's device is iPhone 15 Pro or later, or any M-series iPad or Mac
You need structured output from a language model (typed Swift structs from free text)
You want generative AI features without bundling a model file or paying API costs

Privacy rule that applies to both:

Neither framework makes network calls during inference — the data boundary is maintained by design, not by configuration
If your implementation ever calls a cloud AI API (OpenAI, Anthropic, Google) at inference time, it is no longer on-device AI

Prepare the model for on-device execution

For Core ML: convert, quantize, and bundle. For Foundation Models: add an entitlement.

Core ML model conversion (coremltools):

Install coremltools: pip install coremltools
Trace your PyTorch model: traced = torch.jit.trace(model, example_input)
Convert: mlmodel = ct.convert(traced, compute_units=ct.ComputeUnit.ALL)
Quantize to reduce size: ct.optimize.coreml.palettize_weights(mlmodel, config=...) — Int4 palettization reduces a 100MB model to ~25MB with minimal accuracy loss for most vision and NLP architectures
Save: mlmodel.save('YourModel.mlpackage') and drag into your Xcode project

Apple Foundation Models (no model file needed):

Add the FoundationModels framework to your target in Xcode
Add the com.apple.developer.foundation-models entitlement to your .entitlements file
The OS provides the model — nothing to download, convert, or bundle
Check availability at runtime before using: LanguageModelSession.availability

Implement inference with a Swift actor

Actor isolation ensures inference never blocks the main thread. All prediction calls are async.

Core ML actor pattern:

Declare an actor (not a class or struct) to hold the loaded MLModel
Load the model with: self.model = try await MLModel.load(contentsOf:) — this is async and non-blocking
Expose inference as: func predict(_ input: YourModelInput) async throws -> YourModelOutput
In your SwiftUI view model, call with: let result = try await inferenceActor.predict(input)
The Swift concurrency runtime routes the actor method off the main thread automatically

Foundation Models actor pattern:

Create a LanguageModelSession inside an actor
Define your output type as a Swift struct conforming to Generable
For streaming: use session.streamResponse(to: prompt) which returns an AsyncStream
Surface output to SwiftUI with @Published var streamedText = String() updated from the stream
Wrap the entire session lifecycle in the actor to prevent concurrent session access

What NOT to do:

Do not call MLModel.prediction() or LanguageModelSession methods from @MainActor context — this blocks the UI thread
Do not store MLModel as a global singleton — actor isolation makes concurrency safe
Do not make network calls inside inference functions — if you find yourself adding URLSession calls, you have left on-device AI

Audit the data boundary

Confirm with instrumentation that no user data leaves the device during or after AI inference.

Method 1 — Xcode Network Instrument:

Open Instruments, select the Network template, attach to your app
Trigger all AI inference paths: every feature, every input type, every error condition
Confirm zero connections appear in the Network Instrument timeline during inference
If any connection fires, identify the source — it is either your code or a third-party SDK

Method 2 — Debug entitlement restriction:

In a Debug build configuration, add an App Sandbox entitlement that disables outgoing connections
Any network call in any code path (including SDKs) becomes an immediate runtime error
This catches accidental cloud calls that silent logging might miss

Privacy Nutrition Label review:

For on-device inference, you do not need to declare data types sent to third-party AI providers
Document this explicitly in your internal privacy review: 'AI inference input and output are processed on-device. No data transmitted to any third party during AI feature operation.'
If you later add a cloud fallback path, this declaration must be updated before the next App Store submission

Profile inference on a physical device

The Neural Engine does not run in Simulator. All performance measurements must use real hardware.

What to measure:

Inference latency: time from prediction call to result — target under 15ms for real-time features, under 500ms for user-triggered features
Model load time: first prediction call after app launch includes model compilation — measure and consider preloading
Peak memory during model load: models expand in memory during loading; a 20MB model may use 80–120MB RAM at peak
Neural Engine utilization: Instruments Core ML Instrument shows which compute unit handles each layer — confirm ANE is active, not CPU fallback

What to optimize if targets are missed:

Inference latency too high: apply 4-bit palettization to reduce memory bandwidth, verify computeUnits=.all, reduce input resolution for vision models
Memory too high: use MLModelConfiguration.functionNames to load only the prediction function, not the full model interface
Neural Engine not used: check for unsupported operations in your model architecture — ReLU, Conv2D, and linear layers are ANE-compatible; custom ops may force CPU fallback

Deploy with capability gates and staged rollout

Ship the AI feature safely to a subset of users and gate it behind hardware capability checks.

Device capability gates:

Core ML: add a guard that checks ProcessInfo.processInfo.processorCount and device model before enabling the feature — ensure the minimum deployment target actually has A12 Bionic or later
Foundation Models: always check LanguageModelSession.availability before creating a session — the API returns .available, .unavailable, or a specific reason code
Show a non-AI fallback experience for users on unsupported hardware — never crash or degrade silently

Feature flag and staged rollout:

Implement a remote feature flag that enables/disables the AI feature independently of the app version — this lets you disable the feature if a memory or performance issue appears post-launch without waiting for an App Store update
Deploy to 10% of users first, monitor crash rates (Xcode Organizer or Crashlytics), watch memory pressure in analytics, then expand to 50% and 100%
Log inference success/failure rate (not inference input or output — that would defeat the privacy purpose) to detect silent failures

Common Mistakes That Break the Privacy Boundary

Cloud fallback that runs silently

A pattern like: if onDeviceInferenceFails { callCloudAPI() } silently transmits data to the cloud on any inference error. This negates the privacy guarantee for a percentage of users without any disclosure. Either drop the feature gracefully on unsupported hardware, or document the cloud fallback explicitly in your privacy nutrition label.

Analytics that log inference input or output

Logging what a user typed, photographed, or spoke — even to a first-party analytics service — defeats the privacy architecture. On-device inference allows you to log inference success/failure rates and latency without logging the content. Keep content processing strictly on-device.

Third-party SDK network calls during inference

An analytics SDK, A/B testing framework, or crash reporting library may make network calls in the same code path as inference. Use the Xcode Network Instrument to confirm that SDK network calls and inference calls are temporally separate, so no inference context is inadvertently included in an SDK payload.

MLModel loaded on the main thread

Calling MLModel.load() synchronously on @MainActor freezes the UI for 100–500ms during model compilation. Use async let or Task to load the model on a background context. Actor-isolated loading is the correct pattern for all production apps.

Frequently Asked Questions

How do I integrate on-device AI into my iOS app without sending data to the cloud?

Use Core ML (for classification, detection, and prediction) or Apple Foundation Models (for text generation). Both run entirely on-device using the Apple Neural Engine. The key steps: choose the right framework, prepare the model, implement with a Swift actor pattern, audit the data boundary with Xcode Network Instrument, profile on physical hardware, then deploy with capability gates and a staged rollout.

What frameworks ensure AI inference stays on-device?

Core ML and Apple Foundation Models are both on-device only. Neither framework makes network calls during inference — the privacy boundary is architectural, not configurable. Avoid any framework that routes to a cloud API (OpenAI SDK, Anthropic SDK, LangChain, etc.) unless you intend to send data to the cloud.

How do I verify that my AI feature is not sending data to the cloud?

Use Xcode's Network Instrument to profile all AI inference code paths and confirm zero outbound connections. In Debug builds, add a network entitlement restriction to make any accidental network call a visible runtime error rather than a silent data transmission.

Does on-device AI work without an internet connection?

Yes. Core ML inference uses a model file bundled in the app — no network access required at any point. This also means the AI feature continues to work in airplane mode, on spotty connections, and in any environment where network availability cannot be assumed.

What is the performance difference between on-device and cloud AI?

Core ML inference: 2–15ms for classification tasks. Cloud AI API: 500ms–3s including network round-trip. On-device is 4–10x faster for tasks Core ML can handle. For real-time features (camera, audio, live text analysis), on-device inference is the only option — cloud APIs are too slow.

Do I need to train a custom model?

Not necessarily. Apple's Core ML Model Gallery provides pre-trained models for common vision and NLP tasks. Create ML in Xcode lets you train custom models without Python. For language features, Foundation Models uses Apple's built-in model — no training or model file needed.

Related Technical References

Complete On-Device AI Guide with Core ML

Model types, Swift 6 patterns, privacy architecture, performance benchmarks, and deployment.

Core ML Integration Reference

Model conversion with coremltools, actor patterns, Neural Engine optimization, and quantization.

iOS AI Architecture Patterns

How to structure on-device AI in production iOS apps — data flow, actor isolation, and rollout strategy.

On-Device vs Cloud AI Comparison

When to use Core ML, when to use cloud APIs, and when a hybrid approach is appropriate.

Need this built for your iOS app?

The 3NSOFTS On-Device AI Integration sprint implements all six steps for a specific AI feature in your existing iOS app — in 3–4 weeks at a fixed price. Architecture review, Swift 6 implementation, data boundary audit, and production rollout playbook included.

Start an AI Integration Sprint →See Full Service Details

Starting at $5,000. Fixed scope, fixed price. Senior Apple platform delivery throughout.