How to Integrate Core ML Into an Existing iOS App: A Production Checklist for 2026
Integrating Core ML into an app already in production is a different problem from building AI-native from scratch. This checklist covers what actually matters when retrofitting Core ML into an existing iOS app: from model selection through battery-aware scheduling to App Store submission.
Before you write a line of code
Audit your data flow first
The most common integration failure is not a Core ML API problem — it is a data pipeline problem. Inference is only as useful as the data fed into it.
Before touching the model, map the path from user action to model input:
- What type does the input arrive as? (
String,CVPixelBuffer,MLMultiArray) - Where does the conversion happen, and on which thread?
- Is the input always available, or can it be nil at the call site?
If the input requires transformation — text tokenization, image resizing, feature normalization — that transformation code needs to live somewhere explicit. Burying it inside a view model is the wrong choice. It belongs in a dedicated input-preparation layer that the model consumer never has to think about.
Choose the right model format
Core ML in 2026 supports .mlmodel, .mlpackage, and models accessed through the FoundationModels framework for on-device language tasks. The choice is not cosmetic.
.mlpackageis the current standard for compiled models. It supports on-demand resource loading and is required for neural network models larger than a few MB.FoundationModels(available on Apple Intelligence-capable devices running iOS 18.1+) gives access to Apple's on-device language model without bundling any weights. The model runs entirely on the Neural Engine — no download, no model management, no size impact on your.ipa..mlmodel(legacy format) still compiles and runs, but lacks the resource management improvements in.mlpackage. Migrate if you have not already.
For text classification, summarization, or structured extraction, FoundationModels is the clear choice: zero bundle size impact, zero cloud exposure, and Apple manages the model lifecycle.
The integration checklist
1. Isolate the model behind a protocol
Never call MLModel directly from a view or view model. Define a protocol that describes what the feature needs from inference, then implement it with the actual Core ML call.
protocol ClassificationService {
func classify(_ input: String) async throws -> ClassificationResult
}
actor CoreMLClassifier: ClassificationService {
private let model: MyClassifier
func classify(_ input: String) async throws -> ClassificationResult {
let features = try MLDictionaryFeatureProvider(dictionary: ["text": input])
let output = try model.prediction(from: features)
return ClassificationResult(output)
}
}
The view model depends on ClassificationService, not on CoreMLClassifier. Swapping the model, mocking in tests, and isolating threading all become straightforward.
2. Run inference on a background actor
MLModel.prediction(from:) is synchronous and CPU/GPU/Neural Engine-bound. Calling it on the main actor blocks the UI. This is not a theoretical concern — on older A-series devices, even a small image classification model can take 15–40ms on first call.
Declare your inference type as an actor. Swift's concurrency model handles the rest: the actor serializes access, inference runs off the main thread, and results surface back to the UI via await.
3. Handle model loading separately from inference
MLModel(contentsOf:) is expensive. Loading a 50MB .mlpackage takes 200–600ms on device depending on hardware. Do not load the model at the call site.
Load once — at app launch or on first feature access — and hold the instance. If the feature is rarely used, load lazily on first call and cache the result. The constraint: never block the UI thread during load.
actor CoreMLClassifier: ClassificationService {
private var model: MyClassifier?
private func loadModelIfNeeded() async throws -> MyClassifier {
if let model { return model }
let loaded = try MyClassifier(configuration: modelConfiguration())
self.model = loaded
return loaded
}
func classify(_ input: String) async throws -> ClassificationResult {
let model = try await loadModelIfNeeded()
let features = try MLDictionaryFeatureProvider(dictionary: ["text": input])
let output = try model.prediction(from: features)
return ClassificationResult(output)
}
}
4. Implement battery-aware scheduling
The Neural Engine is efficient, but sustained inference drains battery faster than most users expect.
Two mechanisms address this:
ProcessInfo.processInfo.isLowPowerModeEnabled— observe viaNSProcessInfo.powerStateDidChangeNotificationand reduce inference frequency or disable non-critical AI features when low power mode is active- Debounce input before triggering inference. If the user is typing, wait for 300–500ms of inactivity before calling the model. This alone reduces inference calls by 80–90% in text-input scenarios
// Thermal state awareness
let thermalState = ProcessInfo.processInfo.thermalState
guard thermalState != .critical && thermalState != .serious else {
return // defer inference
}
// Low power mode check
guard !ProcessInfo.processInfo.isLowPowerModeEnabled else {
return // skip non-critical inference
}
5. Quantize before shipping
A full-precision Float32 model is almost never necessary on device. coremltools supports 8-bit and 4-bit quantization. The trade-off is model size and inference latency against a small accuracy delta that is often imperceptible in practice.
The production rule: quantize to Int8 first, measure accuracy on your validation set, and fall back to Float16 only if the accuracy loss is unacceptable. Int8 models are typically 4x smaller and run faster on the Neural Engine than their Float32 equivalents.
6. Validate model inputs at the boundary
Core ML throws a generic NSError when input shapes or types do not match the model's expected inputs. In production, this surfaces as a crash or a silent failure — neither is acceptable.
Write an input validator that checks shape, type, and range before calling prediction(from:). This validator runs at the boundary between your data pipeline and the model. It catches mismatches during development and prevents undefined behavior in production.
7. Define a fallback path
Every inference call needs a fallback. The model may fail to load — device storage full, corrupted download. The input may be malformed. The device may not support the required compute unit.
Define the fallback explicitly for each feature:
- Text classification: fall back to keyword matching
- Image analysis: surface an "unable to analyze" state in the UI
- Recommendation: return a default or most-recent result
The fallback is not a nice-to-have. It is the behavior your app exhibits when inference is unavailable — which will happen.
8. Specify compute units deliberately
MLModelConfiguration.computeUnits accepts .all, .cpuOnly, .cpuAndGPU, and .cpuAndNeuralEngine. The default is .all, which lets Core ML decide.
For latency-sensitive features, .cpuAndNeuralEngine is the right choice on A12 and later: it routes inference to the Neural Engine when the model supports it, with CPU as fallback. .all includes the GPU, which adds scheduling overhead for small models and is only worth it for large convolutional networks.
private func modelConfiguration() -> MLModelConfiguration {
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
return config
}
Set this explicitly. Do not rely on the default.
9. Test on physical hardware, not simulator
The iOS Simulator does not emulate the Neural Engine. Inference on the Simulator runs on the Mac's CPU. Latency numbers from the Simulator are meaningless for production planning.
Run p95/p99 latency measurements on the oldest device you intend to support. If your minimum deployment target is iOS 16, test on an A13 device. If you are targeting iOS 18 features, test on both an A17 Pro and an A15 to understand the performance spread.
10. Audit bundle size and App Store compliance
A large .mlpackage bundled directly into the app binary increases download size. Apple's App Store cellular download limit is 200MB as of 2026. A single unquantized model can consume that budget entirely.
For models over 50MB:
- Enable App Thinning in Xcode to deliver device-appropriate model variants
- Consider on-demand resources for models that are not required at first launch
- For models over 100MB, evaluate delivering via background download after install
The Swift 6 concurrency consideration
Swift 6 strict concurrency affects Core ML integration in two specific ways.
First, MLModel and its prediction types are not Sendable. Passing them across actor boundaries produces compiler warnings or errors in strict mode. The correct pattern is to hold the model inside a single actor and never pass it.
Second, inference results must cross the actor boundary to update the UI. Ensure result types are Sendable — either by making them value types (struct) or by marking them with @unchecked Sendable when you have verified thread safety manually.
Model selection by use case
| Use case | Recommended approach |
|---|---|
| Text classification / sentiment | FoundationModels (Apple Foundation Models) or custom NLP .mlpackage |
| Image classification | Custom .mlpackage via VNCoreMLRequest |
| Object detection | Custom .mlpackage via VNCoreMLRequest |
| Structured extraction | FoundationModels with @Generable schema |
| Summarization (short text) | FoundationModels |
| Structured data prediction | Custom tabular .mlpackage |
| Audio classification | Custom .mlpackage via SoundAnalysis |
Use Vision for image tasks. Vision handles CVPixelBuffer creation, image orientation normalization, and output parsing for standard task types — classification, object detection, saliency. Calling MLModel directly for image inputs requires managing all of that manually. The VNCoreMLRequest + VNImageRequestHandler pipeline is the correct abstraction for image inference.