What devices support Apple Intelligence integration in iOS apps in 2026?

Apple Intelligence requires A17 Pro or later on iPhone, and M-series chips on iPad and Mac. iPhone 15 Pro and later, iPad Pro with M1 or later, and any Apple Silicon Mac meet the requirement. Check LanguageModelSession.isSupported at launch and handle the unavailable state as a first-class code path.

What is the difference between Apple Foundation Models and Core ML for iOS AI integration?

Apple Foundation Models gives you access to Apple's general-purpose on-device language model via the FoundationModels framework — no model file, no training, no conversion pipeline. Core ML is the inference engine for custom .mlpackage models you bring yourself. Use Foundation Models for language tasks where a general model is sufficient. Use Core ML when you need domain-specific behaviour, custom classification, or a model fine-tuned on your data.

Does Apple Intelligence send user data to Apple's servers?

No. The on-device model runs entirely on the Neural Engine. No data transits Apple's servers during inference. This guarantee only holds if your app does not introduce a cloud fallback path — if you call a cloud AI API when the on-device model is unavailable, the privacy property no longer holds for those users.

How do I handle Apple Intelligence being unavailable on older devices?

Check LanguageModelSession.isSupported at launch, store the result in your app's state layer, and gate all AI-dependent UI on that stored value. When unavailable, surface a reduced-capability state — not an error. The app must remain fully functional without the AI features.

What is the latency of on-device inference with Apple Foundation Models?

Session initialisation takes 200–400ms on first use on A17 Pro and M-series chips. Subsequent prompts within the same session produce first-token responses in under 50ms. Full response generation for a 200-token output typically completes in under 2 seconds. For custom Core ML classification models, inference runs in under 10ms on current Apple Silicon.

How do I choose between streaming and non-streaming responses from the Foundation Models API?

Use streaming for any output the user will read as it is generated — summaries, explanations, generated text. Use non-streaming for structured output where you need the complete response before doing anything with it — classification results, JSON parsing, data extraction. Streaming a response you are going to parse as JSON adds no benefit and complicates the parsing logic.

On-Device AI

Apple Intelligence Integration in iOS Apps: A 2026 Implementation Guide

Production-quality Apple Intelligence integration for iOS apps in 2026. Covers FoundationModels session lifecycle, streaming, guided generation, Core ML for domain-specific tasks, actor isolation, graceful degradation on older devices, and privacy boundaries.

By Ehsan Azish · 3NSOFTS·May 2026·14 min read

For most of iOS development history, adding AI to an app meant adding a network dependency. You sent data to a cloud endpoint, waited for a response, and handled the latency and failure modes that came with it.

Apple Intelligence changes that premise. Inference runs on the device. Data never leaves. Latency drops from hundreds of milliseconds to single digits. That is not a performance optimisation — it is an architectural category change. An app built around on-device inference has different failure modes, different privacy properties, and different design constraints than one built around cloud API calls.

This guide covers how to integrate Apple Intelligence into a production iOS app in 2026: which APIs to use, where the real constraints are, and what separates prototype-quality integration from production-quality architecture.

What Apple Intelligence Actually Is (Architecturally)

Apple Intelligence is not a single API. It is a set of system-level capabilities exposed through distinct frameworks, each with its own integration surface.

The three primary integration paths in 2026:

Apple Foundation Models — direct access to the on-device language model via a Swift API, introduced at WWDC 2025
Core ML — the inference engine for custom .mlpackage models, including third-party and fine-tuned models
Writing Tools and system UI extensions — opt-in integration with Apple's system-level text processing features

Each path serves a different integration scenario. Choosing the wrong one creates unnecessary complexity. For a direct comparison of these two inference approaches, see the Foundation Models vs Core ML breakdown.

The Constraint That Shapes Every Decision

Apple Intelligence requires Apple Silicon. On iPhone, that means A17 Pro or later. On iPad and Mac, M-series chips. Older devices do not support it.

Every architectural decision flows from that constraint.

An app that assumes Apple Intelligence availability will fail silently — or crash — on a significant portion of the installed base. Graceful degradation is not optional. It is the first design requirement.

The second constraint: the on-device model is general-purpose. It is not fine-tuned for your domain. For tasks requiring domain-specific knowledge or specialised classification, a custom Core ML model will outperform the Foundation Models API. See the On-Device AI iOS Core ML implementation guide for the custom model path.

Integration Path 1: Apple Foundation Models

The FoundationModels framework gives you direct programmatic access to the on-device language model. No API key. No network request. No data leaving the device.

The entry point is LanguageModelSession. You create a session, optionally configure it with a system prompt, and send prompts. The model responds.

Session Lifecycle

LanguageModelSession is not cheap to initialise. The model loads into memory on first use — on current hardware, that takes 200–400ms. Creating a new session per request wastes that time on every call.

The correct pattern: create one session per logical conversation or task context, reuse it for the duration of that context, and release it when the context ends.

import FoundationModels

@MainActor
final class SummaryViewModel: ObservableObject {
    private var session: LanguageModelSession?

    func prepareSession() {
        // System prompt scopes the model's behaviour for this context
        let instructions = "Summarise the provided text concisely. Return plain text only."
        session = LanguageModelSession(instructions: instructions)
    }

    func summarise(_ text: String) async throws -> String {
        guard let session else { throw SummaryError.sessionNotReady }
        let response = try await session.respond(to: text)
        return response.content
    }
}

Streaming Responses with AsyncStream

For longer outputs, waiting for the full response before updating the UI produces a poor experience. The Foundation Models API supports streaming via AsyncStream<String>.

func streamSummary(_ text: String) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            guard let session else {
                continuation.finish()
                return
            }
            do {
                for try await partial in session.streamResponse(to: text) {
                    continuation.yield(partial)
                }
                continuation.finish()
            } catch {
                continuation.finish()
            }
        }
    }
}

Each yield delivers a partial token. The view updates incrementally — the user sees output appearing as it is generated, not after a blank wait.

Guided Generation and Structured Output

Free-form text output is rarely what a production app needs. If the model needs to return structured data — a JSON object, a classification label, a ranked list — use guided generation.

The Foundation Models framework supports constrained decoding via GenerationOptions. You define the output schema, and the model is constrained to produce output that conforms to it. This eliminates the parsing fragility that comes from prompting for JSON and hoping the model complies.

Integration Path 2: Core ML for Custom Inference

The Foundation Models API is general-purpose. For domain-specific tasks — sentiment classification, named entity recognition, image labelling, anomaly detection — a custom Core ML model is the right tool.

Core ML runs inference entirely on-device using the Neural Engine. On Apple Silicon, inference on a quantised classification model runs in under 10ms — fast enough to run synchronously in response to user input without perceptible delay.

Model Selection and Quantisation

The constraint here is model size relative to memory budget. A 7B parameter model in full precision does not fit in the memory envelope available to a foreground app. Quantisation is the mechanism that makes on-device inference practical.

The coremltools Python library handles conversion and quantisation from PyTorch or TensorFlow. For most classification and embedding tasks, 4-bit or 8-bit quantisation produces negligible accuracy loss with a 4–8x reduction in model size.

import coremltools as ct

# Convert and quantise to 8-bit for on-device deployment
model = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name="input", shape=(1, 512))],
    compute_units=ct.ComputeUnit.ALL  # Uses Neural Engine where available
)

# Apply 8-bit weight quantisation
op_config = ct.optimize.coreml.OpLinearQuantizerConfig(mode="linear_symmetric")
config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)
compressed_model = ct.optimize.coreml.linear_quantize_weights(model, config=config)
compressed_model.save("Classifier.mlpackage")

For a full treatment of model size, binary impact, and performance across Apple Silicon variants, see the Core ML performance benchmarks resource.

Battery-Aware Scheduling

Continuous inference drains the battery. The correct architecture does not run the model on every keystroke or every frame — it schedules inference in response to meaningful state changes and defers background inference when battery state is low.

ProcessInfo.processInfo.isLowPowerModeEnabled surfaces the device's Low Power Mode state. Background inference tasks should check this before executing and defer when it is true.

For foreground inference triggered by user action, no deferral is needed — the user has expressed intent. The scheduling constraint applies to background and proactive inference only.

Integration Path 3: Writing Tools and System UI Extensions

Any UITextView or TextEditor in SwiftUI automatically participates in Apple's Writing Tools — the system-level proofreading, rewriting, and summarisation features. No integration code is required for the default behaviour.

If your app has a text editing surface where Writing Tools would be disruptive — a code editor, a structured form field, a terminal — opt out explicitly:

TextEditor(text: $content)
    .writingToolsBehavior(.disabled)

Where Writing Tools adds genuine value, the default opt-in is sufficient. The system handles the UI, the model interaction, and the text replacement. You get the feature at zero implementation cost.

Architecture Decisions That Determine Production Quality

Actor Isolation for Inference State

LanguageModelSession is not actor-isolated by default. Calling it from multiple concurrent contexts without isolation produces undefined behaviour — not a crash, which would at least be easy to catch.

The correct pattern: wrap session management in a dedicated Swift actor. All access to the session transits through that actor's serial executor.

actor InferenceEngine {
    private var session: LanguageModelSession?

    func prepare(instructions: String) {
        session = LanguageModelSession(instructions: instructions)
    }

    func respond(to prompt: String) async throws -> String {
        guard let session else { throw InferenceError.notReady }
        let response = try await session.respond(to: prompt)
        return response.content
    }
}

This is not defensive programming — it is the correct concurrency model for stateful inference. For deeper coverage of Swift 6 concurrency patterns in AI integration contexts, see the Swift 6 AI Integration guide which covers actor isolation, structured concurrency, and the specific failure modes that appear when inference state is not properly isolated.

Graceful Degradation on Unsupported Devices

Apple Intelligence is unavailable on devices predating A17 Pro. Attempting to initialise LanguageModelSession on an unsupported device throws LanguageModelSession.Error.modelNotAvailable.

The architecture needs to handle this at the feature layer, not the call site. Check availability once at app launch, store the result, and gate AI-dependent UI on that stored state.

enum AIAvailability {
    case available
    case unavailable(reason: String)
}

@MainActor
func checkAIAvailability() -> AIAvailability {
    guard LanguageModelSession.isSupported else {
        return .unavailable(reason: "Device does not support Apple Intelligence")
    }
    return .available
}

Features that depend on Apple Intelligence surface a reduced-capability state — not an error screen. The app remains fully functional; the AI features are absent.

Privacy Boundaries

On-device inference means the model never processes data that leaves the device. That is the privacy property — but the architecture has to enforce it.

The constraint: do not pass user data to a cloud endpoint as a fallback when on-device inference is unavailable. The fallback for unavailable Apple Intelligence is a non-AI code path, not a cloud AI call. Mixing the two produces an app that claims to be privacy-first but is not.

This is the distinction that matters in production. The offgrid:AI case study documents how this boundary was enforced in a fully offline AI assistant — zero bytes sent to any server, with the architecture designed from the start to make cloud fallback structurally impossible.

What Production Integration Actually Looks Like

Prototype-quality Apple Intelligence integration uses LanguageModelSession directly in a view model, creates a new session per request, and does not handle device availability. It works on a current device in a demo.

Production-quality integration has a different structure:

Availability is checked at launch and stored in app state
Session lifecycle is managed by an actor-isolated engine, not a view model
Inference is scheduled with awareness of battery state for background tasks
The fallback code path is a first-class feature, not an afterthought
Custom Core ML models handle domain-specific tasks where the general model is insufficient

The CalmLedger privacy-first AI case study covers how these decisions were applied in a health data context where the privacy boundary was a hard product requirement. For teams evaluating their existing codebase before adding Apple Intelligence, the AI-Native iOS Checklist covers the readiness criteria across architecture, device targeting, and privacy model.