Skip to main content
3Nsofts logo3Nsofts
iOS Architecture

On-Device AI Architecture Audit: What 12–20 Findings Actually Look Like

A codebase can pass compilation and basic testing while carrying structural problems across Core ML integration, Apple Foundation Models, privacy architecture, battery management, offline-first readiness, and App Store compliance simultaneously. This article documents what those 12–20 audit findings actually look like by category.

By Ehsan Azish · 3NSOFTS·June 2026·13 min read

Context: why Apple AI integration audits are different in 2026

Apple AI integration in 2026 is not a single API call. It spans Core ML, Apple Foundation Models, the Neural Engine scheduling layer, BackgroundTasks, thermal state management, entitlement configuration, and App Store review policy for AI-generated content. A codebase can pass compilation and basic testing while carrying structural problems across every one of those layers at once.

Typical audits surface 12–20 prioritized findings. That range is not arbitrary — it reflects the number of distinct integration points in a production Apple AI app, each of which can fail independently.


The audit scope

The audit covers six categories:

  • Core ML model integration
  • Apple Foundation Models integration
  • Privacy architecture
  • Battery and thermal management
  • Offline-first readiness
  • App Store compliance for AI features

Each category produces two to four findings on average. The distribution shifts depending on the codebase. Apps that built AI features quickly tend to accumulate more findings in threading and thermal management. Apps that were privacy-first from the start tend to have cleaner data residency posture but often carry compliance gaps around AI content disclosure.


Category 1: Core ML model integration

Model quantization and size budget

The obvious approach: ship the full-precision .mlpackage and let Core ML handle optimization at runtime. The problem: a 400MB float32 model adds 400MB to the app binary, triggers App Store thinning edge cases, and loads noticeably slower on older A-series chips.

The finding here is almost always the same — quantization was not applied, or was applied without measuring the accuracy delta. The correct approach is to use coremltools to apply either 8-bit linear quantization or, where accuracy permits, 4-bit palettization. Size reduction is typically 4x for 8-bit and 8x for 4-bit. The accuracy impact is model-dependent and must be measured against a held-out evaluation set before shipping.

A finding in this category reads: "Model FoodClassifier.mlpackage ships at 312MB float32. 8-bit quantization reduces this to 78MB with less than 0.3% accuracy degradation on the evaluation set. App Store over-the-air download limit is 200MB — this blocks OTA installation on cellular."

Inference pipeline threading

Core ML inference is synchronous by default. Calling model.prediction(from:) on the main thread blocks the UI. This is the most common finding in this category, and it appears in the majority of codebases that integrated AI features under time pressure.

The correct structure isolates inference work inside a Swift actor:

actor InferenceEngine {
    private let model: FoodClassifier

    func classify(_ image: CVPixelBuffer) async throws -> FoodClassification {
        let input = FoodClassifierInput(image: image)
        return try model.prediction(input: input)
    }
}

The actor-isolated design guarantees classify never runs on the main actor. Callers await the result. The UI thread stays free.

A finding here reads: "Inference called on MainActor at MealLogViewModel.swift:142. p95 inference latency on iPhone 13 is 340ms — this produces a visible freeze on every classification."

Model versioning and update strategy

An app that ships a model with no versioning strategy has no path to update that model without a full App Store release. For simple classifiers, that is acceptable. For apps where model quality is a competitive differentiator, it is a structural problem.

The finding: no MLModelConfiguration version metadata, no on-device model swap path, no fallback to a prior model version if a downloaded model fails to load. The audit documents the gap and specifies whether the app's use case warrants a dynamic model delivery strategy via BackgroundTasks and a signed model download endpoint.


Category 2: Apple Foundation Models integration

Availability gating

Apple Foundation Models requires Apple Intelligence to be enabled on the device. Apple Intelligence is available on iPhone 15 Pro and later, all M-series iPads, and all M-series Macs — but it can be disabled by the user, restricted by MDM policy, or unavailable in certain regions.

The most common finding: the app calls LanguageModelSession without first checking SystemLanguageModel.default.availability. On a device where Apple Intelligence is unavailable, this throws an uncaught error that surfaces as a crash or a silent failure depending on how the call site handles exceptions.

The correct pattern:

let model = SystemLanguageModel.default
guard case .available = model.availability else {
    // Route to fallback: local rule-based logic or Core ML classifier
    return
}
let session = LanguageModelSession()

A finding here reads: "No availability check before LanguageModelSession initialization at SummaryService.swift:88. App crashes on iPhone 14 with Apple Intelligence disabled."

Prompt architecture

Apple Foundation Models uses a structured prompt system with Instructions and Prompt types rather than raw string interpolation. Apps that construct prompts via string concatenation produce non-deterministic outputs and make prompt behavior difficult to test.

The finding: prompts constructed as interpolated String values rather than typed Instructions structs. This also bypasses the framework's built-in guardrails. The audit flags every call site where raw string construction feeds into a LanguageModelSession.


Category 3: Privacy architecture

Data residency verification

Zero cloud exposure is a design premise — not a marketing claim. Verifying it requires more than reading the code. It requires checking every network call path that could be triggered during or after inference.

The finding pattern: analytics SDKs that capture feature vectors or inference inputs as part of event payloads. This happens when a third-party analytics framework is initialized before the inference pipeline and intercepts method calls through swizzling. The data never reaches a cloud AI endpoint, but it does leave the device through the analytics channel.

Entitlement hygiene

Apps using on-device AI features require specific entitlements. The finding: apps requesting entitlements not required by their actual feature set — or missing entitlements for features that are deployed. Both trigger App Store review issues.


Category 4: Battery and thermal management

Inference scheduling against thermal state

The finding: inference runs continuously regardless of device thermal state. ProcessInfo.thermalState has four levels — .nominal, .fair, .serious, .critical. An app running inference in .critical thermal state drains the battery, degrades overall system performance, and may trigger OS-level throttling.

The correct pattern checks thermal state before scheduling inference and defers or throttles when the device is constrained:

switch ProcessInfo.processInfo.thermalState {
case .nominal, .fair:
    await inferenceEngine.runFullClassification(input)
case .serious:
    await inferenceEngine.runLightweightClassification(input)
case .critical:
    return // defer entirely
@unknown default:
    await inferenceEngine.runFullClassification(input)
}

Background task budgeting

Background inference tasks must be registered via BGTaskScheduler with appropriate constraints. The finding: background inference running inside URLSession completion handlers or NSNotification observers rather than properly budgeted BGProcessingTask entries. This produces unpredictable execution behavior and fails App Store review for apps in categories where background processing limits are enforced.


Category 5: Offline-first readiness

The finding pattern in this category is predictable: the AI feature assumes connectivity. Inference results are fetched from a cloud endpoint and cached — but the cache strategy does not cover cold launch, extended offline periods, or cache invalidation edge cases.

For on-device inference apps, offline readiness is architectural — the model is on the device, the inference runs locally. The audit checks that every inference call path has a defined behavior when the device is offline, and that no UI state depends on a network response for AI-driven content.


Category 6: App Store compliance for AI features

Privacy manifest completeness

Apple's PrivacyInfo.xcprivacy file must declare required reason APIs accessed by the app and its dependencies. AI apps commonly access:

  • NSPrivacyAccessedAPICategoryFileTimestamp — model file access timestamps
  • NSPrivacyAccessedAPICategoryUserDefaults — model configuration persistence
  • NSPrivacyAccessedAPICategorySystemBootTime — inference timing calculations

The finding: these accesses are present in the binary but undeclared in the privacy manifest. Apple's automated scanning catches this at submission — the result is an App Store rejection.

AI-generated content disclosure

Apps that surface AI-generated content to users must disclose this in accordance with App Store Review Guideline 1.4. The finding: generated content displayed without any indicator that it is AI-generated, in categories where disclosure is required.


What the 12–20 finding range means in practice

The finding count is not a quality score. An app with 20 findings may be architecturally sound with a collection of configuration and compliance gaps. An app with 12 findings may have one structural issue that blocks production use.

Findings are prioritized by impact:

  • P0 — blocks App Store submission or causes crashes in production
  • P1 — degrades user experience measurably under real device conditions
  • P2 — technical debt that compounds as the feature set grows
  • P3 — compliance gaps and best-practice deviations

Most audits surface 2–4 P0 findings, 4–8 P1 findings, and the remainder at P2/P3. The P0 findings are what matters for shipping.