Apple Intelligence Integration in iOS Apps: A 2026 Implementation Guide
Production-quality Apple Intelligence integration for iOS apps in 2026. Covers FoundationModels session lifecycle, streaming, guided generation, Core ML for domain-specific tasks, actor isolation, graceful degradation on older devices, and privacy boundaries.
For most of iOS development history, adding AI to an app meant adding a network dependency. You sent data to a cloud endpoint, waited for a response, and handled the latency and failure modes that came with it.
Apple Intelligence changes that premise. Inference runs on the device. Data never leaves. Latency drops from hundreds of milliseconds to single digits. That is not a performance optimisation — it is an architectural category change. An app built around on-device inference has different failure modes, different privacy properties, and different design constraints than one built around cloud API calls.
This guide covers how to integrate Apple Intelligence into a production iOS app in 2026: which APIs to use, where the real constraints are, and what separates prototype-quality integration from production-quality architecture.
What Apple Intelligence Actually Is (Architecturally)
Apple Intelligence is not a single API. It is a set of system-level capabilities exposed through distinct frameworks, each with its own integration surface.
The three primary integration paths in 2026:
- Apple Foundation Models — direct access to the on-device language model via a Swift API, introduced at WWDC 2025
- Core ML — the inference engine for custom
.mlpackagemodels, including third-party and fine-tuned models - Writing Tools and system UI extensions — opt-in integration with Apple's system-level text processing features
Each path serves a different integration scenario. Choosing the wrong one creates unnecessary complexity. For a direct comparison of these two inference approaches, see the Foundation Models vs Core ML breakdown.
The Constraint That Shapes Every Decision
Apple Intelligence requires Apple Silicon. On iPhone, that means A17 Pro or later. On iPad and Mac, M-series chips. Older devices do not support it.
Every architectural decision flows from that constraint.
An app that assumes Apple Intelligence availability will fail silently — or crash — on a significant portion of the installed base. Graceful degradation is not optional. It is the first design requirement.
The second constraint: the on-device model is general-purpose. It is not fine-tuned for your domain. For tasks requiring domain-specific knowledge or specialised classification, a custom Core ML model will outperform the Foundation Models API. See the On-Device AI iOS Core ML implementation guide for the custom model path.
Integration Path 1: Apple Foundation Models
The FoundationModels framework gives you direct programmatic access to the on-device language model. No API key. No network request. No data leaving the device.
The entry point is LanguageModelSession. You create a session, optionally configure it with a system prompt, and send prompts. The model responds.
Session Lifecycle
LanguageModelSession is not cheap to initialise. The model loads into memory on first use — on current hardware, that takes 200–400ms. Creating a new session per request wastes that time on every call.
The correct pattern: create one session per logical conversation or task context, reuse it for the duration of that context, and release it when the context ends.
import FoundationModels
@MainActor
final class SummaryViewModel: ObservableObject {
private var session: LanguageModelSession?
func prepareSession() {
// System prompt scopes the model's behaviour for this context
let instructions = "Summarise the provided text concisely. Return plain text only."
session = LanguageModelSession(instructions: instructions)
}
func summarise(_ text: String) async throws -> String {
guard let session else { throw SummaryError.sessionNotReady }
let response = try await session.respond(to: text)
return response.content
}
}
Streaming Responses with AsyncStream
For longer outputs, waiting for the full response before updating the UI produces a poor experience. The Foundation Models API supports streaming via AsyncStream<String>.
func streamSummary(_ text: String) -> AsyncStream<String> {
AsyncStream { continuation in
Task {
guard let session else {
continuation.finish()
return
}
do {
for try await partial in session.streamResponse(to: text) {
continuation.yield(partial)
}
continuation.finish()
} catch {
continuation.finish()
}
}
}
}
Each yield delivers a partial token. The view updates incrementally — the user sees output appearing as it is generated, not after a blank wait.
Guided Generation and Structured Output
Free-form text output is rarely what a production app needs. If the model needs to return structured data — a JSON object, a classification label, a ranked list — use guided generation.
The Foundation Models framework supports constrained decoding via GenerationOptions. You define the output schema, and the model is constrained to produce output that conforms to it. This eliminates the parsing fragility that comes from prompting for JSON and hoping the model complies.
Integration Path 2: Core ML for Custom Inference
The Foundation Models API is general-purpose. For domain-specific tasks — sentiment classification, named entity recognition, image labelling, anomaly detection — a custom Core ML model is the right tool.
Core ML runs inference entirely on-device using the Neural Engine. On Apple Silicon, inference on a quantised classification model runs in under 10ms — fast enough to run synchronously in response to user input without perceptible delay.
Model Selection and Quantisation
The constraint here is model size relative to memory budget. A 7B parameter model in full precision does not fit in the memory envelope available to a foreground app. Quantisation is the mechanism that makes on-device inference practical.
The coremltools Python library handles conversion and quantisation from PyTorch or TensorFlow. For most classification and embedding tasks, 4-bit or 8-bit quantisation produces negligible accuracy loss with a 4–8x reduction in model size.
import coremltools as ct
# Convert and quantise to 8-bit for on-device deployment
model = ct.convert(
traced_model,
inputs=[ct.TensorType(name="input", shape=(1, 512))],
compute_units=ct.ComputeUnit.ALL # Uses Neural Engine where available
)
# Apply 8-bit weight quantisation
op_config = ct.optimize.coreml.OpLinearQuantizerConfig(mode="linear_symmetric")
config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)
compressed_model = ct.optimize.coreml.linear_quantize_weights(model, config=config)
compressed_model.save("Classifier.mlpackage")
For a full treatment of model size, binary impact, and performance across Apple Silicon variants, see the Core ML performance benchmarks resource.
Battery-Aware Scheduling
Continuous inference drains the battery. The correct architecture does not run the model on every keystroke or every frame — it schedules inference in response to meaningful state changes and defers background inference when battery state is low.
ProcessInfo.processInfo.isLowPowerModeEnabled surfaces the device's Low Power Mode state. Background inference tasks should check this before executing and defer when it is true.
For foreground inference triggered by user action, no deferral is needed — the user has expressed intent. The scheduling constraint applies to background and proactive inference only.
Integration Path 3: Writing Tools and System UI Extensions
Any UITextView or TextEditor in SwiftUI automatically participates in Apple's Writing Tools — the system-level proofreading, rewriting, and summarisation features. No integration code is required for the default behaviour.
If your app has a text editing surface where Writing Tools would be disruptive — a code editor, a structured form field, a terminal — opt out explicitly:
TextEditor(text: $content)
.writingToolsBehavior(.disabled)
Where Writing Tools adds genuine value, the default opt-in is sufficient. The system handles the UI, the model interaction, and the text replacement. You get the feature at zero implementation cost.
Architecture Decisions That Determine Production Quality
Actor Isolation for Inference State
LanguageModelSession is not actor-isolated by default. Calling it from multiple concurrent contexts without isolation produces undefined behaviour — not a crash, which would at least be easy to catch.
The correct pattern: wrap session management in a dedicated Swift actor. All access to the session transits through that actor's serial executor.
actor InferenceEngine {
private var session: LanguageModelSession?
func prepare(instructions: String) {
session = LanguageModelSession(instructions: instructions)
}
func respond(to prompt: String) async throws -> String {
guard let session else { throw InferenceError.notReady }
let response = try await session.respond(to: prompt)
return response.content
}
}
This is not defensive programming — it is the correct concurrency model for stateful inference. For deeper coverage of Swift 6 concurrency patterns in AI integration contexts, see the Swift 6 AI Integration guide which covers actor isolation, structured concurrency, and the specific failure modes that appear when inference state is not properly isolated.
Graceful Degradation on Unsupported Devices
Apple Intelligence is unavailable on devices predating A17 Pro. Attempting to initialise LanguageModelSession on an unsupported device throws LanguageModelSession.Error.modelNotAvailable.
The architecture needs to handle this at the feature layer, not the call site. Check availability once at app launch, store the result, and gate AI-dependent UI on that stored state.
enum AIAvailability {
case available
case unavailable(reason: String)
}
@MainActor
func checkAIAvailability() -> AIAvailability {
guard LanguageModelSession.isSupported else {
return .unavailable(reason: "Device does not support Apple Intelligence")
}
return .available
}
Features that depend on Apple Intelligence surface a reduced-capability state — not an error screen. The app remains fully functional; the AI features are absent.
Privacy Boundaries
On-device inference means the model never processes data that leaves the device. That is the privacy property — but the architecture has to enforce it.
The constraint: do not pass user data to a cloud endpoint as a fallback when on-device inference is unavailable. The fallback for unavailable Apple Intelligence is a non-AI code path, not a cloud AI call. Mixing the two produces an app that claims to be privacy-first but is not.
This is the distinction that matters in production. The offgrid:AI case study documents how this boundary was enforced in a fully offline AI assistant — zero bytes sent to any server, with the architecture designed from the start to make cloud fallback structurally impossible.
What Production Integration Actually Looks Like
Prototype-quality Apple Intelligence integration uses LanguageModelSession directly in a view model, creates a new session per request, and does not handle device availability. It works on a current device in a demo.
Production-quality integration has a different structure:
- Availability is checked at launch and stored in app state
- Session lifecycle is managed by an actor-isolated engine, not a view model
- Inference is scheduled with awareness of battery state for background tasks
- The fallback code path is a first-class feature, not an afterthought
- Custom Core ML models handle domain-specific tasks where the general model is insufficient
The CalmLedger privacy-first AI case study covers how these decisions were applied in a health data context where the privacy boundary was a hard product requirement. For teams evaluating their existing codebase before adding Apple Intelligence, the AI-Native iOS Checklist covers the readiness criteria across architecture, device targeting, and privacy model.