Skip to main content
3Nsofts logo3Nsofts
Complete Guide · March 2026

Complete Guide to On-Device AI with Core ML and Swift

Everything you need to ship production on-device AI features in iOS and macOS apps. Covers Core ML model types, Swift 6 actor-isolated inference, privacy architecture, performance budgets, and staged rollout strategy — grounded in Apple's documented APIs.

Coverage

Beginner → Production

Read time

~35 minutes

Code examples

12 annotated snippets

What is Core ML?

Core ML is Apple's framework for running machine learning models on iOS, macOS, watchOS, tvOS, and visionOS devices. Models run through the Core ML runtime, which automatically dispatches computation to the Neural Engine, GPU, or CPU based on model structure, device capability, and thermal state.

The key distinction from cloud AI APIs: all inference happens locally. No request leaves the device, no user data is transmitted, and inference works offline. For regulated industries (health, finance) or privacy-sensitive features, this is not a nice-to-have — it is a structural design constraint.

Core ML sits at the execution layer. You bring a trained model (converted to .mlpackage format), and Core ML handles runtime selection, hardware acceleration, memory pressure response, and background processing policies.

Core ML vs Foundation Models: Core ML runs any converted model you supply. Foundation Models provides access to Apple's built-in generative language model (Apple Intelligence) without requiring your app to bundle a model. Use Core ML for custom/domain-specific prediction; use Foundation Models for generative text workflows where the OS-level model is appropriate.

Core ML model types

Core ML supports multiple model categories. Choosing the right type before training saves significant conversion and optimization work later.

TypeUse caseTypical size
Neural Network ClassifierImage classification, text categorization5–200MB
Neural Network RegressorPrice prediction, scoring, ranking5–100MB
Vision ML modelObject detection, segmentation, pose5–50MB
Natural Language modelSentiment, NER, embeddings10–150MB
Gradient Boosted TreeTabular data, feature-based decisions<5MB
Pipeline modelMulti-stage preprocessing + inferenceCombined

Model size directly impacts App Store binary size, download time, and on-device storage. Quantization using coremltools reduces size considerably: Float16 halves Float32, Int8 reduces further, and Int4 palettization can compress models to 25% of original size with minimal accuracy loss for most tasks. Use the Core ML size calculator to estimate bundle impact before training.

Integration workflow

The integration lifecycle has five phases. Most production regressions happen when teams skip phases 3 or 4.

  1. 1
    Convert and validate the model. Use coremltools to convert from PyTorch, TensorFlow, or ONNX. Validate input/output schema, expected value ranges, and label mapping before adding to Xcode.
  2. 2
    Add to Xcode and generate the interface. Drag the .mlpackage into your Xcode project. Xcode generates a typed Swift interface matching your input/output schema. Review it; any schema mismatch surfaces here.
  3. 3
    Wrap in an actor-isolated service. Never call model prediction directly from UI code. Create an actor boundary that owns model lifecycle, handles cancellation, and enforces one inference path per feature. See the Swift 6 setup chapter for a reference implementation.
  4. 4
    Benchmark cold-start and warm-path latency. Measure on the oldest device in your support matrix, not the latest iPhone. Cold-start (first prediction after app launch) is typically 3–10× slower than warm-path. Set explicit latency SLOs before writing UI code.
  5. 5
    Add fallback behavior. Every production Core ML feature needs a degraded path: a heuristic, a cached result, or silent skip. Thermal pressure, model load failures, or unsupported hardware can all prevent inference. Your UI should never block on a successful prediction.

Swift 6 integration patterns

Swift 6 strict concurrency changes how model inference must be structured. Actor isolation is not boilerplate; it is the safety boundary that prevents concurrent access to non-Sendable model objects.

Pattern 1: Single-responsibility actor

import CoreML

actor InferenceService {
    private var model: MLModel?
    private var loadTask: Task<MLModel, Error>?

    // Lazy load with deduplication — avoids double-load races
    func model() async throws -> MLModel {
        if let model { return model }
        if let existing = loadTask { return try await existing.value }
        let task = Task<MLModel, Error> {
            let config = MLModelConfiguration()
            config.computeUnits = .all
            return try await MyModel(configuration: config).model
        }
        loadTask = task
        let result = try await task.value
        model = result
        loadTask = nil
        return result
    }

    func predict(_ input: MyModelInput) async throws -> MyModelOutput {
        let m = try await model()
        let result = try m.prediction(from: input)
        return MyModelOutput(result)
    }

    func unload() {
        model = nil
        loadTask?.cancel()
        loadTask = nil
    }
}

Pattern 2: Cancellation-aware ViewModel

@MainActor
final class ClassifierViewModel: ObservableObject {
    @Published var result: String = ""
    @Published var isLoading = false

    private let service = InferenceService()
    private var currentTask: Task<Void, Never>?

    func classify(image: CGImage) {
        currentTask?.cancel()   // Cancel any in-flight request
        currentTask = Task {
            isLoading = true
            defer { isLoading = false }
            do {
                let input = try MyModelInput(image: image)
                let output = try await service.predict(input)
                guard !Task.isCancelled else { return }
                result = output.classLabel
            } catch is CancellationError {
                // Silently dropped — UI moved on
            } catch {
                result = "Prediction unavailable"
            }
        }
    }
}

Pattern 3: Streaming inference with AsyncStream

For long-running inference or multi-step pipelines, stream partial progress to keep the UI responsive:

enum InferenceProgress {
    case preparing
    case running(progress: Double)
    case completed(String)
    case failed(Error)
}

func streamPrediction(input: Input) -> AsyncStream<InferenceProgress> {
    AsyncStream { continuation in
        Task {
            continuation.yield(.preparing)
            do {
                continuation.yield(.running(progress: 0.3))
                let result = try await service.predict(input)
                continuation.yield(.running(progress: 0.9))
                continuation.yield(.completed(result.label))
            } catch {
                continuation.yield(.failed(error))
            }
            continuation.finish()
        }
    }
}

For a full six-chapter walkthrough including testing and deployment, see the Swift 6 & AI Integration guide series.

Privacy architecture

On-device inference is privacy-preserving by default, but your surrounding architecture can still introduce violations. These are the patterns that matter for App Privacy Details declarations and GDPR compliance.

Classify data before you touch it

Label every input: is this raw user data, transformed feature, model output, or aggregate metric? Each class needs a different retention and transmission policy. Don't mix them.

Keep inference payloads out of analytics

The most common violation: logging ML inputs (e.g. query text, image metadata) to crash reporters or analytics pipelines. Redact before logging. Add a CI lint rule to catch accidental logging of inference inputs.

Separate model output from product telemetry

You can log that classification ran and how long it took without logging what the classification result was. Separate event tracking from content tracking.

Add a kill switch for telemetry

App Privacy Details and GDPR require deletion request compliance. A runtime kill switch that can disable all telemetry emission (without an app update) is necessary for regulated domains.

For a full implementation with a typed PrivacyGate actor, see Chapter 3: Privacy-Preserving AI Architectures.

Performance budgets

Set latency budgets before you pick a model. It is much harder to compress a 400ms result into a 100ms budget after the architecture is locked.

Feature typeAcceptable p95Strategy
Inline editor suggestion< 120msStreaming tokens, cancel on keypress
Tap-triggered classification< 300msPreload model at launch, warm cache
Background enrichment< 2sLow-priority Task, batched processing
Startup-blocking scanAvoid entirelyDeferred load, show result asynchronously

For quantization strategy, model variant routing, and thermal-aware scheduling, see Chapter 4: Performance Optimization. To measure model size and binary impact, use the Core ML size calculator.

Production deployment

AI features need rollout discipline that most iOS apps don't apply to non-AI features. The risk profile is different: model output quality is non-deterministic, and silent degradation is harder to detect than a null-pointer crash.

Feature-flag the model path

Deploy code in every release, but gate inference activation. This decouples shipping from rollout and gives you a kill switch independent of App Store review.

Track fallback rate separately

Instrument how often inference falls back to heuristics. A rising fallback rate indicates model pressure or thermal load — it is a leading indicator before latency p99 regresses.

Freeze model artifacts for RC

Lock the .mlpackage checksum for each release candidate. A last-minute model update without re-benchmarking is a common source of launch-day regressions.

Write App Store reviewer notes

For features using on-device inference, include a note describing what the model does and confirming no network calls are made. This reduces review confusion and rejection risk.

Full staged rollout playbook, runtime safety flags, and App Store submission checklist in Chapter 6: Production Deployment Strategies.

Frequently Asked Questions

What is the difference between Core ML and Foundation Models?

Core ML runs any converted model you bundle with your app. Foundation Models provides Apple's built-in OS-level generative model (Apple Intelligence). Use Core ML for custom/domain-specific prediction; use Foundation Models for generative text where the OS-level model is appropriate.

How large is a typical Core ML model?

A Float16 model with 10M parameters is roughly 20MB. Int8 quantization brings it to ~10MB, Int4 to ~5MB. Use the Core ML Tools Python package to measure before bundling, and the size calculator on this site to estimate binary impact.

Can Core ML run on the Neural Engine?

Yes. Set computeUnits to .all in MLModelConfiguration. Core ML routes eligible layers to the ANE automatically, falling back to GPU or CPU for unsupported operations. Profile with Instruments to confirm actual ANE utilization.

What iOS version is required?

Core ML 8 with full Swift 6 async inference support requires iOS 18.0+ and macOS 15.0+. Core ML itself is available from iOS 11+. Feature-gate new API usage behind runtime checks for older deployment targets.

Does on-device inference need internet access?

No. Core ML inference runs offline with bundled model files. No network request is made during inference, which is why on-device AI is privacy-preserving by design.

Need help implementing Core ML in your app?

The iOS AI integration service covers architecture planning, Core ML implementation, performance profiling, and App Store-safe rollout — scoped as a fixed 3–5 week engagement.