On-Device AI Integration Guides
Implementation patterns for Core ML and Apple Foundation Models — from loading a model correctly to structured generation, streaming inference, compute unit optimization, and privacy compliance.
By Ehsan Azish · 3NSOFTS · March 2026
The Apple ML stack has two layers: Core ML for model execution (classification, detection, any .mlmodel), and Foundation Models for on-device LLM inference (iOS 26+). Most apps need to understand both to make the right architecture choice. These guides cover the patterns that produce production-grade AI features — not demo-grade prototypes.
Core ML — Use when:
- • Image or video classification, object detection
- • Custom model from PyTorch/TensorFlow/ONNX
- • iOS 15+ deployment target required
- • Structured prediction (tabular, audio, text)
Foundation Models — Use when:
- • Language generation, summarization, classification
- • Structured output from natural language
- • iOS 26+ deployment target acceptable
- • No model training or conversion required
Core ML: The Right Way to Load and Run a Model
IntermediateCore ML model loading is expensive (50ms–500ms). The naive pattern — instantiating MLModel at point of use — blocks the calling thread and re-loads the model on every call. The correct pattern uses lazy initialization inside an actor.
actor InferenceService {
// Lazy initialization: model loads once, on first use
private lazy var model: SentimentClassifier = {
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine // prefer ANE for throughput
return try! SentimentClassifier(configuration: config)
}()
func classify(text: String) async throws -> String {
let input = SentimentClassifierInput(text: text)
// prediction() is non-blocking from the caller's perspective
// MLModelConfiguration.computeUnits controls hardware dispatch
let output = try await model.prediction(input: input)
return output.label
}
}- ▸Use .cpuAndNeuralEngine for latency-sensitive real-time inference.
- ▸Use .all for throughput-heavy background workloads.
- ▸Use .cpuOnly only when debugging — it disables the ANE and is 5–10× slower.
- ▸The async prediction() API (iOS 16+) eliminates manual DispatchQueue management.
Apple Foundation Models: Structured Generation with Generable
IntermediateApple Foundation Models (iOS 26+) provides on-device LLM inference via a high-level Swift API. The Generable protocol constrains output to a defined type, eliminating the need to parse freeform text — the correct pattern for structured AI responses.
import FoundationModels
// Define the output shape using @Generable
@Generable
struct ProductSummary {
@Guide(description: "One-sentence product description, 100 characters max")
var summary: String
@Guide(description: "Primary product category")
var category: String
@Guide(description: "Sentiment: positive, neutral, or negative")
var sentiment: String
}
// Session is a conversational context — reuse across turns
let session = LanguageModelSession()
// Structured generation — output is already a typed Swift struct
let result: ProductSummary = try await session.respond(
to: "Summarize this product: \(productDescription)",
generating: ProductSummary.self
)
print(result.summary) // "Lightweight iOS keyboard for code snippets"
print(result.category) // "Developer Tools"
print(result.sentiment) // "positive"- ▸Foundation Models requires iOS 26+ and an A17 Pro chip or Apple Silicon.
- ▸@Generable generates both Swift types and the JSON schema used for constrained decoding.
- ▸@Guide annotations guide the model without requiring prompt engineering.
- ▸For streaming text output, use streamResponse(to:) with for-await-in.
Streaming Inference with AsyncStream
IntermediateFoundation Models supports token-by-token streaming via an AsyncSequence. Bridging this to SwiftUI's .task modifier produces a progressively-updating UI without blocking.
// Stream tokens from Foundation Models
func stream(prompt: String) -> AsyncThrowingStream<String, Error> {
AsyncThrowingStream { continuation in
Task {
do {
let session = LanguageModelSession()
// streamResponse returns an AsyncSequence of partial strings
for try await partial in session.streamResponse(to: prompt) {
continuation.yield(partial)
}
continuation.finish()
} catch {
continuation.finish(throwing: error)
}
}
}
}
// SwiftUI view consuming the stream
struct StreamView: View {
@State private var text = ""
@State private var error: Error?
let prompt: String
var body: some View {
ScrollView {
Text(text)
.frame(maxWidth: .infinity, alignment: .leading)
}
.task(id: prompt) {
// Task is automatically cancelled if the view disappears
// or if prompt changes
do {
for try await token in stream(prompt: prompt) {
text += token
}
} catch {
self.error = error
}
}
}
}- ▸Use AsyncThrowingStream when the stream can throw (Foundation Models can fail on unsupported devices).
- ▸The .task(id:) modifier cancels and restarts the task when the id value changes — correct for prompt changes.
- ▸SwiftUI runs .task closures on the @MainActor automatically — no need to dispatch UI updates.
Compute Unit Selection for Core ML Performance
AdvancedMLModelConfiguration.computeUnits controls which hardware executes inference. Choosing the wrong option is the most common source of Core ML performance problems — the difference between .cpuOnly and .cpuAndNeuralEngine is 10× on some workloads.
// Create a configuration targeting the Neural Engine
let config = MLModelConfiguration()
// .cpuAndNeuralEngine: best for real-time inference (typical production choice)
// .all: best for throughput-heavy batch processing
// .cpuAndGPU: fallback when custom ops are not ANE-compatible
// .cpuOnly: debug only — disables hardware acceleration entirely
config.computeUnits = .cpuAndNeuralEngine
// Load the model with the configuration
let model = try MyModel(configuration: config)
// Profile inference time in a benchmark loop
let iterations = 50
let start = CFAbsoluteTimeGetCurrent()
for _ in 0..<iterations {
_ = try await model.prediction(input: testInput)
}
let elapsed = (CFAbsoluteTimeGetCurrent() - start) / Double(iterations)
print(String(format: "Mean inference: %.1f ms", elapsed * 1000))
// Use Core ML Performance Report in Xcode for automated profiling
// Product → Profile → Core ML Performance- ▸Profile on device, not simulator — the ANE is not emulated.
- ▸Models with custom operators may not be ANE-compatible; use .cpuAndGPU as fallback.
- ▸The Core ML Performance Report in Xcode (Product → Profile) measures inference time per compute unit.
- ▸A 4-bit quantized 3B parameter model runs in ~80ms on A17 Pro with .cpuAndNeuralEngine.
Privacy-First AI: What Stays On-Device
IntermediateOn-device AI does not automatically comply with GDPR or the App Store privacy label requirements. Compliance depends on what data the model processes, how results are stored, and whether any analytics leave the device.
// App Store privacy nutrition label requirements for AI features
// via NSPrivacyAccessedAPITypes in PrivacyInfo.xcprivacy
// Items that require disclosure even with on-device processing:
// - User's text input fed to the model
// - Camera frames processed by Vision
// - Health data fed to Core ML
// Example: PrivacyInfo.xcprivacy
/*
<key>NSPrivacyCollectedDataTypes</key>
<array>
<dict>
<!-- User text processed locally — not sent to servers -->
<key>NSPrivacyCollectedDataType</key>
<string>NSPrivacyCollectedDataTypeUserContent</string>
<key>NSPrivacyCollectedDataTypeLinked</key>
<false/>
<key>NSPrivacyCollectedDataTypeTracking</key>
<false/>
<key>NSPrivacyCollectedDataTypePurposes</key>
<array>
<string>NSPrivacyCollectedDataTypePurposeAppFunctionality</string>
</array>
</dict>
</array>
*/
// Runtime validation: ensure no network calls during inference
// Use Charles Proxy / Proxyman to verify zero network traffic
// while Core ML or Foundation Models inference is active- ▸GDPR still applies to on-device data if outputs are stored in a user-linked database.
- ▸The App Store privacy nutrition label requires disclosure of processed data categories even if not transmitted.
- ▸Foundation Models: Apple's on-device model never sends input to Apple servers. This is verifiable in the framework documentation.
- ▸Core ML models that call out to cloud endpoints (some NLP pipeline models) are not truly on-device.