Skip to main content
3Nsofts logo3Nsofts

Insights / On-Device AI

Battery-Aware AI Scheduling in iOS Apps: Architecture Patterns for On-Device Inference

On-device AI inference is not free. Every Core ML call draws from the battery, heats the SoC, and competes with the OS scheduler. Battery-aware scheduling is not a performance optimisation you add later — it is a design constraint that shapes how inference requests enter the system.

By Ehsan Azish · 3NSOFTS · May 2026

The constraint that shapes everything

On-device inference runs on the Neural Engine, the GPU, or the CPU — depending on the model, the runtime, and device state. The Neural Engine is the most efficient path, but the OS controls access to it. When the device enters low-power mode, background processing budgets shrink. When the SoC heats up, the OS throttles clock speeds. When battery drops below a threshold, BGProcessingTask requests are deferred or denied outright.

The constraint that shaped everything in offgrid:AI: inference must remain useful across the full battery range, not just when conditions are ideal. Every architectural decision flows from that. See also the complete guide to on-device AI for Apple platforms for the broader framework selection picture.

Why naive inference scheduling fails

The naive approach is to call inference directly from the view model — user taps a button, a Task fires, the model runs. This works on a bench device. In production, it fails in three ways.

  • It gives the OS no signal about the relative importance of the work. A low-priority background classification runs at the same priority as a foreground response the user is actively waiting on.
  • There is no mechanism to defer non-urgent inference when battery or thermal conditions make running it expensive. The call either succeeds or it doesn't — no middle path.
  • Inference calls accumulate without coalescing. A user scrolling through a feed can trigger dozens of classification requests in seconds. Without a queue, each runs independently, preventing any batching optimisation the Neural Engine could otherwise apply.

Reading battery state before scheduling

UIDevice.current.batteryState and UIDevice.current.batteryLevel are the entry points. Battery monitoring must be enabled explicitly:

UIDevice.current.isBatteryMonitoringEnabled = true

The relevant states map to four operating conditions:

  • .charging or .full — full inference budget available
  • .unplugged at level > 0.20 — standard budget, no deferral
  • .unplugged at level ≤ 0.20 — reduced budget; non-critical inference defers
  • ProcessInfo.processInfo.isLowPowerModeEnabled — hard signal to suspend all non-foreground inference

Low-power mode is the clearest signal. When it is active, the user has explicitly told the OS to conserve energy. Inference that is not directly serving a foreground interaction should not run.

Scheduling architecture

The inference queue

The inference scheduler sits between the call site and the Core ML model. Nothing calls the model directly. Every request transits through the scheduler, which evaluates battery state, thermal state, and request priority before deciding whether to run immediately, defer, or drop.

actor InferenceScheduler {
    private let model: SomeMLModel
    private var pendingTasks: [InferenceRequest] = []

    func enqueue(_ request: InferenceRequest) async throws -> InferenceResult {
        let budget = BatteryBudget.current()
        guard budget.allows(request.priority) else {
            throw InferenceError.deferred(reason: budget.deferralReason)
        }
        return try await model.perform(request)
    }
}

The actor isolation here is not cosmetic. Inference requests from multiple call sites — a view model, a background sync handler, a widget timeline provider — all transit through a single actor-isolated queue. Contention is serialised by the Swift concurrency runtime, not by manual locking. See the Swift concurrency patterns for AI workloads for the broader actor-isolation approach.

Priority tiers

Not all inference is equal. A three-tier model covers most production cases:

  • .critical — foreground, user-initiated, blocking UI. Runs regardless of battery state. Example: a user waiting on a response in a chat interface.
  • .standard — foreground but not blocking. Runs unless low-power mode is active. Example: pre-classifying content as the user scrolls.
  • .background — non-user-visible. Defers when battery is below 20% or low-power mode is active. Example: indexing, pre-computation, cache warming.

Deferral and coalescing

Deferred requests do not disappear. They accumulate in pendingTasks and re-evaluate when battery state changes. The scheduler observes UIDevice.batteryLevelDidChangeNotification and NSProcessInfoPowerStateDidChangeNotification to trigger re-evaluation:

NotificationCenter.default.publisher(
    for: NSProcessInfo.powerStateDidChangeNotification
)
.sink { [weak self] _ in
    Task { await self?.drainDeferredQueue() }
}
.store(in: &cancellables)

Coalescing applies to background tasks with identical input signatures. If five requests to classify the same content type arrive within a 500ms window, the scheduler runs one and fans the result out to all five callers. This is the batching optimisation the Neural Engine benefits from — and it only becomes possible when a queue exists.

Model selection as a runtime decision

Many production apps ship more than one model variant — a full-precision model for high-accuracy tasks and a quantized INT4 or INT8 variant for constrained conditions. With a scheduler in place, model selection at runtime becomes a direct consequence of battery state.

The scheduler holds references to both variants. When battery drops below the threshold, it routes requests to the quantized model. Inference accuracy may decrease marginally. Inference speed and energy cost decrease substantially.

Core ML Tools supports post-training quantization. A model quantized to INT8 typically runs 2–4x faster on the Neural Engine than its FP32 equivalent, with accuracy loss that is often below the threshold of user perception for classification tasks. See the Core ML optimization techniques guide for quantization details and the Core ML inference performance benchmarks for real latency and energy numbers by device class.

Thermal state as a secondary signal

Battery level is the primary signal. Thermal state is the secondary one. ProcessInfo.processInfo.thermalState surfaces four levels: .nominal, .fair, .serious, .critical.

At .serious, the OS has already begun throttling. Running full-precision inference at that point actively worsens the situation — the model runs slower, generates more heat, and extends the time the device stays throttled. The scheduler maps thermal state to the same priority floor logic as battery state:

struct BatteryBudget {
    static func current() -> BatteryBudget {
        let lowPower = ProcessInfo.processInfo.isLowPowerModeEnabled
        let thermal = ProcessInfo.processInfo.thermalState
        let level = UIDevice.current.batteryLevel

        return BatteryBudget(
            minimumPriority: Self.floor(
                lowPower: lowPower,
                thermal: thermal,
                level: level
            )
        )
    }
}

Background inference with BGProcessingTask

Some inference workloads are genuinely background — model fine-tuning steps, large-batch classification runs, index updates. These belong in BGProcessingTask, not in foreground task groups.

BGProcessingTask requests can specify requiresExternalPower: true and requiresNetworkConnectivity: false. For inference tasks, requiring external power is the correct default. The OS schedules the task when the device is plugged in and idle — exactly the conditions where a large inference workload carries no user-visible cost.

BGTaskScheduler.shared.register(
    forTaskWithIdentifier: "com.yourapp.inference.batch",
    using: nil
) { task in
    guard let processingTask = task as? BGProcessingTask else { return }
    Task {
        await InferenceScheduler.shared.runDeferredBatch()
        processingTask.setTaskCompleted(success: true)
    }
}

The task identifier must be declared in Info.plist under BGTaskSchedulerPermittedIdentifiers. Omitting this causes silent scheduling failures — the task registers but never executes. This is exactly the class of issue that Xcode Doctor surfaces before submission.

What this looks like in production

The offgrid:AI app demonstrates this architecture in a context where battery constraints are not theoretical. The app operates in offline emergency scenarios — exactly the conditions where the device is most likely to be at low battery and where inference must still function. The architecture routes between a full model and a quantized variant based on battery state, defers non-critical inference when low-power mode is active, and uses BGProcessingTask for any pre-computation that can wait for a charging window.

The CalmLedger app applies the same pattern in a different context — privacy-first finance classification where inference runs entirely on-device and the scheduling layer ensures background categorisation does not drain battery during active use.

For teams building AI features into iOS apps from scratch, the Swift 6 AI integration guide covers the concurrency primitives — AsyncStream, actor isolation, structured task groups — that make a scheduler like this implementable without data races. The ML integration decision framework covers when to use Core ML vs Apple Foundation Models in a battery-constrained context.

FAQs

What is battery-aware AI scheduling in iOS?

Battery-aware AI scheduling is an architectural pattern where on-device inference requests are evaluated against current battery level, low-power mode status, and thermal state before executing. The scheduler routes requests to appropriate model variants, defers non-critical work, and coalesces duplicate requests — rather than running every inference call immediately regardless of device conditions.

How do I detect low-power mode in a Swift iOS app?

Use ProcessInfo.processInfo.isLowPowerModeEnabled to read the current state synchronously. Observe NSProcessInfo.powerStateDidChangeNotification to react when the user toggles low-power mode. This notification fires on the main thread, so scheduler state can be updated directly in the handler.

When should I use BGProcessingTask for Core ML inference?

Use BGProcessingTask for large-batch or non-time-sensitive inference workloads — index updates, pre-computation, model warm-up passes. Set requiresExternalPower: true so the OS schedules the task when the device is charging and idle. Do not use BGProcessingTask for foreground inference; the task may not execute for hours.

What is the difference between Core ML model quantization options for battery performance?

INT8 quantization typically reduces model size by 4x compared to FP32 and runs 2–4x faster on the Neural Engine, with proportionally lower energy consumption. INT4 reduces size further but introduces more accuracy loss. The right choice depends on the task — classification tasks tolerate INT8 well; generative tasks may require FP16 for acceptable output quality.

How do I handle thermal throttling during on-device inference?

Read ProcessInfo.processInfo.thermalState and observe ProcessInfo.thermalStateDidChangeNotification. At .serious or .critical, suspend non-foreground inference and queue requests for later execution. Running inference through a throttled SoC extends the throttled period — deferring work is the faster path back to nominal performance.

Should the inference scheduler be actor-isolated in Swift 6?

Yes. An actor-isolated scheduler serialises access to the pending task queue and model references without manual locking. In Swift 6's strict concurrency model, a non-isolated scheduler requires explicit Sendable conformances and synchronisation primitives that are harder to reason about. The actor boundary makes the data flow deterministic.

How does battery-aware scheduling interact with Apple Intelligence APIs?

Apple Foundation Models manage their own scheduling internally, but you still control when you call them. The same priority-tier pattern applies: gate calls to Foundation Models behind the battery budget check, defer non-critical requests when low-power mode is active, and avoid calling generative APIs in background contexts where the OS may suspend the process mid-inference.

Building on-device AI into your iOS app?

The On-Device AI Integration engagement covers Core ML model selection, actor-isolated inference, battery-aware scheduling, and App Store compliance — delivered in 3–5 weeks at a fixed price.