Skip to main content
3Nsofts logo3Nsofts
On-Device AI

offgrid:AI Case Study: Building a Fully Offline AI Assistant on Apple Silicon

offgrid:AI is a fully offline AI assistant that runs on Apple Silicon with zero network dependency. This case study covers the architecture decisions, the inference layer using Core ML and Apple Foundation Models, battery-aware scheduling, privacy boundary enforcement, and the results.

By Ehsan Azish · 3NSOFTS·May 2026·10 min read·iOS 18.1+ for Apple Foundation Models, Core ML for broader device support

The Structural Problem with Cloud-Dependent AI

Most AI assistants on iOS are thin clients. The app handles the UI. The intelligence lives on a server. That architecture holds under one condition: the network is available, the API is responsive, and the user's data is acceptable collateral for the round-trip.

Remove any one of those conditions and the app stops working.

This is not a fringe scenario. Emergency responders, field workers, travellers in low-connectivity regions, anyone with a genuine privacy requirement — they all hit this wall. The app surfaces a spinner. The spinner never resolves. The user has no recourse.

offgrid:AI was built as a direct response to this structural constraint.


The Design Premise

The design premise: an AI assistant that operates at full capability regardless of network state. Not degraded capability. Not a fallback mode. Full inference, on-device, on Apple Silicon, with zero bytes transiting any server.

Every architectural decision flows from that.


Constraints

The constraints that shaped the architecture:

  • Network connectivity cannot be assumed — the app must function identically online and offline
  • Zero data egress — no user input, no conversation history, no inference request may leave the device
  • No API costs — cloud inference at scale introduces per-token costs that break the economics of a consumer app
  • Battery impact must be bounded — sustained LLM inference on a mobile device can drain a battery in under two hours if unmanaged
  • Latency must be perceptible as fast — a response that takes 4–6 seconds feels broken to a user in a stressful scenario
  • Apple Intelligence availability cannot be required — the architecture must degrade gracefully on devices without the Neural Engine tier that supports Foundation Models

Architecture

Inference Layer: Core ML and Apple Foundation Models

The naive approach is to call an external API. Two lines of code, and it works perfectly — until the network is gone.

Core ML is the correct layer for on-device inference on Apple platforms. It routes computation to the most efficient available hardware — Neural Engine, GPU, or CPU — without the developer managing that dispatch manually. On Apple Silicon, the Neural Engine handles transformer inference with substantially lower power draw than GPU execution.

For devices running iOS 18.1 and later with Apple Intelligence enabled, Apple Foundation Models provides direct access to the on-device language model through a structured Swift API. LanguageModelSession handles context management, token streaming, and safety guardrails with no network dependency.

import FoundationModels

let session = LanguageModelSession()

// Streaming response — tokens arrive as they are generated
let stream = session.streamResponse(to: prompt)
for try await partial in stream {
    await MainActor.run {
        self.responseText += partial.text
    }
}

For devices below the Apple Intelligence tier, the app falls back to a quantized Core ML model. This ensures the app is fully functional across the supported device range — not just on the latest hardware.

The Fallback Architecture

Apple Foundation Models requires:

  • iOS 18.1+
  • Apple Intelligence enabled (user opt-in)
  • Sufficient on-device storage (Apple Intelligence models are downloaded on demand)

Any of these conditions may not be met. The fallback chain:

  1. Apple Foundation Models (LanguageModelSession) — preferred, highest quality
  2. Quantized Core ML model — all devices, iOS 17+, no Apple Intelligence required
  3. Static response templates — lowest quality, never blank, always functional
enum InferenceBackend {
    case foundationModels
    case coreML(model: MLModel)
    case staticTemplates
}

func resolveBackend() async -> InferenceBackend {
    if #available(iOS 18.1, *),
       await LanguageModelSession.isAvailable {
        return .foundationModels
    }
    if let model = try? MyQuantizedModel().model {
        return .coreML(model: model)
    }
    return .staticTemplates
}

SwiftData Persistence

Conversation history, user preferences, and session state are stored locally using SwiftData. No iCloud sync — data stays on the device and never transits any cloud infrastructure.

@Model
class Conversation {
    var id: UUID
    var title: String
    var createdAt: Date
    var messages: [Message]

    init(title: String) {
        self.id = UUID()
        self.title = title
        self.createdAt = Date()
        self.messages = []
    }
}

The choice of SwiftData over Core Data reflects the iOS 17+ minimum deployment target. SwiftData's @Query property wrapper binds conversation history directly to the view lifecycle without manual fetch request configuration.

Battery-Aware Scheduling

Sustained LLM inference drains a battery. The inference scheduler monitors thermal state and enforces session limits:

func checkThermalState() -> InferenceThrottle {
    switch ProcessInfo.processInfo.thermalState {
    case .nominal, .fair:
        return .none
    case .serious:
        return .reduceContextWindow
    case .critical:
        return .pauseGeneration
    @unknown default:
        return .none
    }
}

The app presents an explicit warning when a session exceeds 15 minutes of sustained inference, prompting the user to take a break. This is both a battery protection measure and a UX decision — extended inference sessions on consumer hardware produce noticeable device warmth.

Privacy Boundary

The privacy boundary is architectural, not policy-based. The inference layer has no network stack:

  • No URLSession in the inference module
  • No analytics SDK
  • No telemetry path
  • No crash reporting that includes user input

User conversations are stored in SwiftData on the device. On device deletion, the data is gone. There is no server-side backup, no synchronisation service, and no way for 3NSOFTS to access user conversations.


Results

Offline capability: 100% of features available with no network connection

Inference latency (Apple Foundation Models, iPhone 16 Pro): first token in under 1.5 seconds, full response in 4–8 seconds depending on length

Inference latency (Core ML fallback, iPhone 14): first token in 3–5 seconds, full response in 8–18 seconds

Battery impact: approximately 12% battery per hour of active inference on iPhone 16 Pro, 18% on iPhone 14

App binary size impact: Core ML model bundle adds 85MB to the app download. The Foundation Models path adds zero — the on-device model is part of iOS, not the app bundle.

App Store first-submission approval: passed on first submission with complete privacy manifest


Explore offgrid:AI

offgrid:AI is available on the App Store for iPhone. It runs offline by design — no account, no subscription required to use the core inference features.

View offgrid:AI on the App Store →

offgrid:AI product page →


Related Reading