Skip to main content
3Nsofts logo3Nsofts
How-To Guide · Core ML · Foundation Models · iOS

How to Integrate On-Device AI into Your iOS App Without Sending Data to the Cloud

A step-by-step production guide using Core ML and Apple Foundation Models. The privacy guarantee is architectural — because inference runs locally, there is no network call to intercept and no data to transmit.

Why on-device inference is private by design

A cloud AI API call sends your user’s data to a third-party server to be processed. On-device inference sends nothing — the model runs on the device’s Apple Neural Engine using data that never leaves RAM. This is not a privacy policy or a configuration option. It is the physical architecture.

6 steps
Implementation
2–15 ms
Core ML latency
Zero
Cloud transmissions
iOS 16+
Min deployment

By Ehsan Azish · 3NSOFTS · May 2026

Why Cloud AI Breaks the Privacy Boundary

Cloud AI APIs (OpenAI, Anthropic, Google, and similar) require your app to send user data over the network to a third-party server for processing. The server processes the data, runs inference, and returns a result. Every step involves data transmission:

StageCloud AI APIOn-Device (Core ML)
Input dataSent over HTTPS to third-party serverStays in device RAM
InferenceRuns on cloud GPU/TPU clusterRuns on Apple Neural Engine
OutputReturned over HTTPS, logged server-sideReturned in-process, never transmitted
User data exposureThird-party processor receives and may logNo third party involved
Privacy labelMust declare data types sent to providerNo AI-related third-party data to declare
Offline supportFails without networkFull functionality offline
Inference costPer-token or per-request billingZero per call

The 6-Step Implementation Process

01

Choose the right on-device framework

Core ML or Foundation Models — the choice is determined by the AI task, not by preference.

Use Core ML when:

  • The AI task is classification, detection, segmentation, regression, or prediction
  • The feature must run on devices going back to iOS 16 or A12 Bionic
  • You have a custom model trained in PyTorch or TensorFlow
  • Inference speed must be under 20ms (real-time camera, audio, or typing features)

Use Apple Foundation Models when:

  • The AI task involves text: summarization, generation, intent classification, or slot extraction
  • The user's device is iPhone 15 Pro or later, or any M-series iPad or Mac
  • You need structured output from a language model (typed Swift structs from free text)
  • You want generative AI features without bundling a model file or paying API costs

Privacy rule that applies to both:

  • Neither framework makes network calls during inference — the data boundary is maintained by design, not by configuration
  • If your implementation ever calls a cloud AI API (OpenAI, Anthropic, Google) at inference time, it is no longer on-device AI
02

Prepare the model for on-device execution

For Core ML: convert, quantize, and bundle. For Foundation Models: add an entitlement.

Core ML model conversion (coremltools):

  • Install coremltools: pip install coremltools
  • Trace your PyTorch model: traced = torch.jit.trace(model, example_input)
  • Convert: mlmodel = ct.convert(traced, compute_units=ct.ComputeUnit.ALL)
  • Quantize to reduce size: ct.optimize.coreml.palettize_weights(mlmodel, config=...) — Int4 palettization reduces a 100MB model to ~25MB with minimal accuracy loss for most vision and NLP architectures
  • Save: mlmodel.save('YourModel.mlpackage') and drag into your Xcode project

Apple Foundation Models (no model file needed):

  • Add the FoundationModels framework to your target in Xcode
  • Add the com.apple.developer.foundation-models entitlement to your .entitlements file
  • The OS provides the model — nothing to download, convert, or bundle
  • Check availability at runtime before using: LanguageModelSession.availability
03

Implement inference with a Swift actor

Actor isolation ensures inference never blocks the main thread. All prediction calls are async.

Core ML actor pattern:

  • Declare an actor (not a class or struct) to hold the loaded MLModel
  • Load the model with: self.model = try await MLModel.load(contentsOf:) — this is async and non-blocking
  • Expose inference as: func predict(_ input: YourModelInput) async throws -> YourModelOutput
  • In your SwiftUI view model, call with: let result = try await inferenceActor.predict(input)
  • The Swift concurrency runtime routes the actor method off the main thread automatically

Foundation Models actor pattern:

  • Create a LanguageModelSession inside an actor
  • Define your output type as a Swift struct conforming to Generable
  • For streaming: use session.streamResponse(to: prompt) which returns an AsyncStream
  • Surface output to SwiftUI with @Published var streamedText = String() updated from the stream
  • Wrap the entire session lifecycle in the actor to prevent concurrent session access

What NOT to do:

  • Do not call MLModel.prediction() or LanguageModelSession methods from @MainActor context — this blocks the UI thread
  • Do not store MLModel as a global singleton — actor isolation makes concurrency safe
  • Do not make network calls inside inference functions — if you find yourself adding URLSession calls, you have left on-device AI
04

Audit the data boundary

Confirm with instrumentation that no user data leaves the device during or after AI inference.

Method 1 — Xcode Network Instrument:

  • Open Instruments, select the Network template, attach to your app
  • Trigger all AI inference paths: every feature, every input type, every error condition
  • Confirm zero connections appear in the Network Instrument timeline during inference
  • If any connection fires, identify the source — it is either your code or a third-party SDK

Method 2 — Debug entitlement restriction:

  • In a Debug build configuration, add an App Sandbox entitlement that disables outgoing connections
  • Any network call in any code path (including SDKs) becomes an immediate runtime error
  • This catches accidental cloud calls that silent logging might miss

Privacy Nutrition Label review:

  • For on-device inference, you do not need to declare data types sent to third-party AI providers
  • Document this explicitly in your internal privacy review: 'AI inference input and output are processed on-device. No data transmitted to any third party during AI feature operation.'
  • If you later add a cloud fallback path, this declaration must be updated before the next App Store submission
05

Profile inference on a physical device

The Neural Engine does not run in Simulator. All performance measurements must use real hardware.

What to measure:

  • Inference latency: time from prediction call to result — target under 15ms for real-time features, under 500ms for user-triggered features
  • Model load time: first prediction call after app launch includes model compilation — measure and consider preloading
  • Peak memory during model load: models expand in memory during loading; a 20MB model may use 80–120MB RAM at peak
  • Neural Engine utilization: Instruments Core ML Instrument shows which compute unit handles each layer — confirm ANE is active, not CPU fallback

What to optimize if targets are missed:

  • Inference latency too high: apply 4-bit palettization to reduce memory bandwidth, verify computeUnits=.all, reduce input resolution for vision models
  • Memory too high: use MLModelConfiguration.functionNames to load only the prediction function, not the full model interface
  • Neural Engine not used: check for unsupported operations in your model architecture — ReLU, Conv2D, and linear layers are ANE-compatible; custom ops may force CPU fallback
06

Deploy with capability gates and staged rollout

Ship the AI feature safely to a subset of users and gate it behind hardware capability checks.

Device capability gates:

  • Core ML: add a guard that checks ProcessInfo.processInfo.processorCount and device model before enabling the feature — ensure the minimum deployment target actually has A12 Bionic or later
  • Foundation Models: always check LanguageModelSession.availability before creating a session — the API returns .available, .unavailable, or a specific reason code
  • Show a non-AI fallback experience for users on unsupported hardware — never crash or degrade silently

Feature flag and staged rollout:

  • Implement a remote feature flag that enables/disables the AI feature independently of the app version — this lets you disable the feature if a memory or performance issue appears post-launch without waiting for an App Store update
  • Deploy to 10% of users first, monitor crash rates (Xcode Organizer or Crashlytics), watch memory pressure in analytics, then expand to 50% and 100%
  • Log inference success/failure rate (not inference input or output — that would defeat the privacy purpose) to detect silent failures

Common Mistakes That Break the Privacy Boundary

Cloud fallback that runs silently

A pattern like: if onDeviceInferenceFails { callCloudAPI() } silently transmits data to the cloud on any inference error. This negates the privacy guarantee for a percentage of users without any disclosure. Either drop the feature gracefully on unsupported hardware, or document the cloud fallback explicitly in your privacy nutrition label.

Analytics that log inference input or output

Logging what a user typed, photographed, or spoke — even to a first-party analytics service — defeats the privacy architecture. On-device inference allows you to log inference success/failure rates and latency without logging the content. Keep content processing strictly on-device.

Third-party SDK network calls during inference

An analytics SDK, A/B testing framework, or crash reporting library may make network calls in the same code path as inference. Use the Xcode Network Instrument to confirm that SDK network calls and inference calls are temporally separate, so no inference context is inadvertently included in an SDK payload.

MLModel loaded on the main thread

Calling MLModel.load() synchronously on @MainActor freezes the UI for 100–500ms during model compilation. Use async let or Task to load the model on a background context. Actor-isolated loading is the correct pattern for all production apps.

Frequently Asked Questions

How do I integrate on-device AI into my iOS app without sending data to the cloud?

Use Core ML (for classification, detection, and prediction) or Apple Foundation Models (for text generation). Both run entirely on-device using the Apple Neural Engine. The key steps: choose the right framework, prepare the model, implement with a Swift actor pattern, audit the data boundary with Xcode Network Instrument, profile on physical hardware, then deploy with capability gates and a staged rollout.

What frameworks ensure AI inference stays on-device?

Core ML and Apple Foundation Models are both on-device only. Neither framework makes network calls during inference — the privacy boundary is architectural, not configurable. Avoid any framework that routes to a cloud API (OpenAI SDK, Anthropic SDK, LangChain, etc.) unless you intend to send data to the cloud.

How do I verify that my AI feature is not sending data to the cloud?

Use Xcode's Network Instrument to profile all AI inference code paths and confirm zero outbound connections. In Debug builds, add a network entitlement restriction to make any accidental network call a visible runtime error rather than a silent data transmission.

Does on-device AI work without an internet connection?

Yes. Core ML inference uses a model file bundled in the app — no network access required at any point. This also means the AI feature continues to work in airplane mode, on spotty connections, and in any environment where network availability cannot be assumed.

What is the performance difference between on-device and cloud AI?

Core ML inference: 2–15ms for classification tasks. Cloud AI API: 500ms–3s including network round-trip. On-device is 4–10x faster for tasks Core ML can handle. For real-time features (camera, audio, live text analysis), on-device inference is the only option — cloud APIs are too slow.

Do I need to train a custom model?

Not necessarily. Apple's Core ML Model Gallery provides pre-trained models for common vision and NLP tasks. Create ML in Xcode lets you train custom models without Python. For language features, Foundation Models uses Apple's built-in model — no training or model file needed.

Related Technical References

Need this built for your iOS app?

The 3NSOFTS On-Device AI Integration sprint implements all six steps for a specific AI feature in your existing iOS app — in 3–4 weeks at a fixed price. Architecture review, Swift 6 implementation, data boundary audit, and production rollout playbook included.

Starting at $5,000. Fixed scope, fixed price. Senior Apple platform delivery throughout.