What is the difference between Core ML and Apple Foundation Models for iOS ML integration?

Core ML is a general inference framework — it runs any compatible model, whether trained for image classification, text analysis, audio processing, or anything else. Apple Foundation Models is specifically for accessing Apple's on-device language model through a session-based API. Core ML gives you full control over the model; Apple Foundation Models gives you access to a capable language model with no infrastructure overhead and Apple's privacy guarantees.

How do I decide between on-device and cloud inference for my iOS app?

The decision follows from constraints, not preferences. If your app processes private data, must work offline, or runs inference frequently enough that per-request cloud costs are a concern, on-device is the correct path. If your feature requires a model larger than what runs efficiently on-device, or capabilities Apple Foundation Models does not expose, cloud inference is appropriate — with explicit handling for the offline failure case.

What hardware does Apple Foundation Models require in 2026?

Apple Foundation Models requires Apple Intelligence support: iPhone 15 Pro and later, iPad with M-series chip, or Mac with M-series chip. Apps must check SystemLanguageModel.default.isAvailable at runtime and define a fallback path for unsupported devices.

How do I convert a PyTorch model to Core ML format?

Use Apple's coremltools Python library. The conversion process accepts a traced or scripted PyTorch model and produces an .mlpackage. Quantization options — float16, int8 — are configurable during conversion. Validate output quality after quantization before shipping; some model types degrade measurably at lower precision.

What is the typical inference latency for Core ML on Apple Silicon?

For classification and regression tasks on Apple Silicon, Core ML inference runs in under 10ms when routed to the Neural Engine. Text generation via Apple Foundation Models is token-by-token — throughput is determined by the Neural Engine, not a single round-trip. Cloud API round-trips typically range from 200ms to 800ms under normal network conditions.

How should I handle the case where inference is unavailable — low battery, unsupported hardware, or no network?

Design the feature around a defined degraded state, not around the assumption that inference will always succeed. For on-device features, check ProcessInfo.processInfo.isLowPowerModeEnabled and reduce inference frequency or disable non-essential inference tasks. For cloud features, queue requests or surface a clear unavailable state. The app must be functional — not broken — when the inference path is unavailable.

Does shipping an on-device ML model affect App Store review?

The model itself does not trigger additional review, but model size affects download size, which affects App Store eligibility for cellular downloads. Models should be quantized appropriately and, where possible, delivered via on-demand resources rather than bundled in the main binary.

On-Device AI

Machine Learning Integration in iOS Apps: A Technical Decision Framework for 2026

ML integration in iOS is a sequence of constrained decisions, not a feature addition. This framework covers on-device vs cloud inference, Core ML integration, Apple Foundation Models, hybrid architectures, and the failure modes that cost teams weeks in production.

By Ehsan Azish · 3NSOFTS·May 2026·12 min read·iOS 17+, Core ML 7+, Apple Foundation Models (iPhone 15 Pro+)

ML integration in iOS is not one decision — it is a sequence, each step constrained by the one before it. The first decision shapes everything: does inference run on the device, or does it transit to a server?

Everything else flows from that. Which framework you use, how you handle model updates, how you schedule inference, how you handle failures — all of it is downstream of that single architectural choice.

This article is a decision framework for that sequence. It covers the two primary integration paths available in 2026, the constraints that make each appropriate, the failure modes of each, and the implementation details that matter in production.

The assumed reader is a developer or technical decision-maker who has already decided to add ML to an iOS app and needs to understand what that actually requires.

The Two Integration Paths

On-Device Inference

On-device inference means the model runs entirely on the user's device — on the CPU, GPU, or Apple's Neural Engine — using Core ML or Apple Foundation Models. No network request is made. No data leaves the device.

The latency profile is deterministic. Core ML inference on Apple Silicon runs in under 10ms for most classification and regression tasks. Text generation via Apple Foundation Models runs at a cadence set by the Neural Engine — not by server load or network conditions.

The constraints are equally deterministic: the model must fit on the device, and the device must have the hardware to run it efficiently.

Cloud API Inference

Cloud API inference means the app sends data to a remote endpoint — OpenAI, Anthropic, a self-hosted model, or any other provider — and receives a result. The model can be arbitrarily large. Updates happen server-side with no App Store submission required.

The latency profile is not deterministic. Round-trip times typically range from 200ms to 800ms under normal conditions, and degrade further under load or poor connectivity. For a user with unreliable signal, the feature may not function at all.

When On-Device Is the Right Choice

The constraint that makes on-device inference necessary is not performance — it is privacy, reliability, or cost structure.

Privacy as a hard constraint. If your app processes health data, financial records, personal communications, or any data the user has a reasonable expectation of keeping private, sending that data to a third-party inference endpoint is an architectural liability. On-device inference means zero bytes transit to any server. The CalmLedger case study demonstrates this directly — financial transaction data stays on-device through the full inference path.

Offline operation as a hard constraint. If the feature must work without a network connection — emergency scenarios, field work, warehouse environments — cloud inference is not viable. The offgrid:AI architecture is built on this premise: the entire AI capability runs locally, with battery-aware scheduling, because a network dependency would make the app unreliable in the exact scenarios it was built for.

Per-request cost at scale. Cloud inference is billed per token or per request. For features that run frequently — classification on every user action, real-time suggestions, continuous audio analysis — the cost structure compounds quickly. On-device inference has zero marginal cost per inference.

On-device is the right choice when any of these three constraints are present. If none of them apply, the tradeoffs shift.

When Cloud APIs Are the Right Choice

Cloud inference is appropriate when the required model size exceeds what runs efficiently on-device, when the feature requires capabilities Apple Foundation Models does not yet expose, or when inference frequency is low enough that latency and cost are acceptable.

Large language model tasks requiring extensive world knowledge, complex multi-step reasoning, or up-to-date information are reasonable candidates. Code generation, complex document summarisation, and retrieval-augmented tasks fall into this category.

The failure mode to design for is network unavailability. A cloud-dependent AI feature that fails silently or crashes when offline is a production defect. The architecture must account for it — either by degrading gracefully, queuing requests for later execution, or surfacing a clear unavailable state to the user.

The Core ML Integration Path

Core ML is the framework for on-device inference on Apple platforms. It handles model loading, hardware routing (CPU, GPU, or Neural Engine), and the inference call itself. The developer's responsibility is model acquisition, conversion, and integration into the app's data flow.

Model Acquisition and Conversion

Core ML requires models in .mlpackage or .mlmodel format. Models trained in PyTorch or TensorFlow must be converted using coremltools. Apple also provides pre-converted models through the Core ML Models page and through Create ML for training directly on-device.

The conversion step is where most integration problems originate. Quantisation decisions made during conversion directly affect model size, inference speed, and output quality. A model that performs well in PyTorch may degrade measurably after int8 quantisation — this must be validated before the model ships.

The Inference Call

A Core ML inference call is synchronous from the framework's perspective, but should always be dispatched off the main thread. The standard pattern uses an actor to isolate the model and prevent concurrent access:

actor InferenceEngine {
    private let model: MyMLModel

    init() throws {
        // Load the compiled model from the app bundle
        self.model = try MyMLModel(configuration: MLModelConfiguration())
    }

    func predict(input: MyMLModelInput) throws -> MyMLModelOutput {
        return try model.prediction(input: input)
    }
}

The MLModelConfiguration object controls hardware routing. Setting .computeUnits = .cpuAndNeuralEngine restricts inference to the Neural Engine and CPU — appropriate for most production use cases, since it avoids GPU contention with the render pipeline.

Battery-Aware Scheduling

Continuous inference tasks — audio classification, motion analysis, real-time image processing — require explicit battery-aware scheduling. Running inference at full frequency in a background task will drain the battery and trigger iOS thermal throttling, which degrades performance in ways that are difficult to predict or reproduce.

The correct approach is to check ProcessInfo.processInfo.isLowPowerModeEnabled and reduce inference frequency when Low Power Mode is active. For background tasks, BGProcessingTask with requiresExternalPower set appropriately gives the scheduler enough information to defer work intelligently.

Apple Foundation Models in 2026

Apple Foundation Models is the framework for accessing Apple Intelligence capabilities on-device. It exposes a language model that runs entirely on the Neural Engine — no data leaves the device.

The framework is structured around sessions and instructions:

import FoundationModels

let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize the following notes: \(userNotes)")

The model is not directly accessible — you interact through the session API. You cannot inspect weights, adjust temperature directly, or fine-tune the model. The tradeoff: the model is maintained and updated by Apple, runs with zero infrastructure cost, and carries Apple's privacy guarantees.

The practical constraint: Apple Foundation Models require Apple Intelligence support — iPhone 15 Pro and later, and M-series iPads and Macs. Apps targeting older hardware must either fall back to a lighter Core ML model or degrade gracefully for unsupported devices.

For a full implementation walkthrough, the Apple Intelligence integration guide covers the session lifecycle, streaming responses, guided generation, and capability detection in detail.

Hybrid Architectures

On-device and cloud inference are not mutually exclusive. A well-designed architecture uses on-device inference where the constraints demand it — privacy-sensitive classification, offline functionality, low-latency suggestions — and cloud inference where model capability is the binding constraint.

The design premise for a hybrid architecture: on-device handles the default path, cloud handles the exception path. The app is fully functional without a network connection. Cloud inference is an enhancement, not a dependency.

The implementation requires a routing layer that decides which path to take based on the feature, current network state, and the user's privacy preferences. That routing logic must be explicit and testable — not implicit in a series of if statements scattered across the codebase.

The Swift 6 AI integration guide covers the concurrency model for managing both inference paths safely under Swift 6's strict concurrency rules, including how to structure the routing actor and handle cancellation correctly.

Common Integration Failure Modes

Model loaded on the main thread. Core ML model initialisation is not instantaneous. Loading a .mlpackage on the main thread blocks the UI. Load the model once at app launch, in a background task, and store it in an actor-isolated property.

No fallback for unsupported hardware. Apple Foundation Models require Apple Intelligence-capable hardware. Shipping without a capability check results in a crash on unsupported devices. Check SystemLanguageModel.default.isAvailable before any Foundation Models call.

Inference called on every keystroke. For text-based features, calling inference on every character input creates a backlog of tasks that degrades performance and drains the battery. Debounce the input — 300ms is a reasonable starting point for most text classification tasks.

Model bundled without quantisation. A full-precision model that could be quantised to int8 with acceptable quality loss adds unnecessary binary size. App Store download size limits make this a practical constraint, not just an optimisation.

No error handling on the inference call. Core ML inference can throw. A model that receives input outside its expected range, or encounters a hardware error, will throw rather than return a degraded result. Every inference call needs explicit error handling with a defined fallback behaviour.

Decision Summary

ML integration in iOS is a structural decision, not a feature addition. The framework choice, the inference path, the fallback behaviour — each is a consequence of the constraints the app actually operates under. Get the constraints right first. The implementation follows.

| Constraint | On-Device (Core ML / Foundation Models) | Cloud API | |---|---|---| | Private user data | Required | Architectural liability | | Must work offline | Required | Not viable | | High inference frequency | Zero marginal cost | Cost compounds | | Large model required | Limited by device | Suitable | | Complex world knowledge | Limited | Suitable | | Deterministic latency | Yes (<10ms typical) | No (200–800ms) |

For production-grade ML integration on Apple platforms, see the On-Device AI Integration service for a framework selection audit before writing inference code.

References

→Core ML — Apple Developer

→Foundation Models Framework — Apple Developer Documentation

→coremltools — Apple GitHub

→isLowPowerModeEnabled — Apple Developer Documentation

→BGProcessingTask — Apple Developer Documentation

→Core ML Models — Apple Developer