offgrid:AI Case Study: Building a Fully Offline AI Assistant on Apple Silicon
offgrid:AI is a fully offline AI assistant that runs on Apple Silicon with zero network dependency. This case study covers the architecture decisions, the inference layer using Core ML and Apple Foundation Models, battery-aware scheduling, privacy boundary enforcement, and the results.
The Structural Problem with Cloud-Dependent AI
Most AI assistants on iOS are thin clients. The app handles the UI. The intelligence lives on a server. That architecture holds under one condition: the network is available, the API is responsive, and the user's data is acceptable collateral for the round-trip.
Remove any one of those conditions and the app stops working.
This is not a fringe scenario. Emergency responders, field workers, travellers in low-connectivity regions, anyone with a genuine privacy requirement — they all hit this wall. The app surfaces a spinner. The spinner never resolves. The user has no recourse.
offgrid:AI was built as a direct response to this structural constraint.
The Design Premise
The design premise: an AI assistant that operates at full capability regardless of network state. Not degraded capability. Not a fallback mode. Full inference, on-device, on Apple Silicon, with zero bytes transiting any server.
Every architectural decision flows from that.
Constraints
The constraints that shaped the architecture:
- Network connectivity cannot be assumed — the app must function identically online and offline
- Zero data egress — no user input, no conversation history, no inference request may leave the device
- No API costs — cloud inference at scale introduces per-token costs that break the economics of a consumer app
- Battery impact must be bounded — sustained LLM inference on a mobile device can drain a battery in under two hours if unmanaged
- Latency must be perceptible as fast — a response that takes 4–6 seconds feels broken to a user in a stressful scenario
- Apple Intelligence availability cannot be required — the architecture must degrade gracefully on devices without the Neural Engine tier that supports Foundation Models
Architecture
Inference Layer: Core ML and Apple Foundation Models
The naive approach is to call an external API. Two lines of code, and it works perfectly — until the network is gone.
Core ML is the correct layer for on-device inference on Apple platforms. It routes computation to the most efficient available hardware — Neural Engine, GPU, or CPU — without the developer managing that dispatch manually. On Apple Silicon, the Neural Engine handles transformer inference with substantially lower power draw than GPU execution.
For devices running iOS 18.1 and later with Apple Intelligence enabled, Apple Foundation Models provides direct access to the on-device language model through a structured Swift API. LanguageModelSession handles context management, token streaming, and safety guardrails with no network dependency.
import FoundationModels
let session = LanguageModelSession()
// Streaming response — tokens arrive as they are generated
let stream = session.streamResponse(to: prompt)
for try await partial in stream {
await MainActor.run {
self.responseText += partial.text
}
}
For devices below the Apple Intelligence tier, the app falls back to a quantized Core ML model. This ensures the app is fully functional across the supported device range — not just on the latest hardware.
The Fallback Architecture
Apple Foundation Models requires:
- iOS 18.1+
- Apple Intelligence enabled (user opt-in)
- Sufficient on-device storage (Apple Intelligence models are downloaded on demand)
Any of these conditions may not be met. The fallback chain:
- Apple Foundation Models (
LanguageModelSession) — preferred, highest quality - Quantized Core ML model — all devices, iOS 17+, no Apple Intelligence required
- Static response templates — lowest quality, never blank, always functional
enum InferenceBackend {
case foundationModels
case coreML(model: MLModel)
case staticTemplates
}
func resolveBackend() async -> InferenceBackend {
if #available(iOS 18.1, *),
await LanguageModelSession.isAvailable {
return .foundationModels
}
if let model = try? MyQuantizedModel().model {
return .coreML(model: model)
}
return .staticTemplates
}
SwiftData Persistence
Conversation history, user preferences, and session state are stored locally using SwiftData. No iCloud sync — data stays on the device and never transits any cloud infrastructure.
@Model
class Conversation {
var id: UUID
var title: String
var createdAt: Date
var messages: [Message]
init(title: String) {
self.id = UUID()
self.title = title
self.createdAt = Date()
self.messages = []
}
}
The choice of SwiftData over Core Data reflects the iOS 17+ minimum deployment target. SwiftData's @Query property wrapper binds conversation history directly to the view lifecycle without manual fetch request configuration.
Battery-Aware Scheduling
Sustained LLM inference drains a battery. The inference scheduler monitors thermal state and enforces session limits:
func checkThermalState() -> InferenceThrottle {
switch ProcessInfo.processInfo.thermalState {
case .nominal, .fair:
return .none
case .serious:
return .reduceContextWindow
case .critical:
return .pauseGeneration
@unknown default:
return .none
}
}
The app presents an explicit warning when a session exceeds 15 minutes of sustained inference, prompting the user to take a break. This is both a battery protection measure and a UX decision — extended inference sessions on consumer hardware produce noticeable device warmth.
Privacy Boundary
The privacy boundary is architectural, not policy-based. The inference layer has no network stack:
- No
URLSessionin the inference module - No analytics SDK
- No telemetry path
- No crash reporting that includes user input
User conversations are stored in SwiftData on the device. On device deletion, the data is gone. There is no server-side backup, no synchronisation service, and no way for 3NSOFTS to access user conversations.
Results
Offline capability: 100% of features available with no network connection
Inference latency (Apple Foundation Models, iPhone 16 Pro): first token in under 1.5 seconds, full response in 4–8 seconds depending on length
Inference latency (Core ML fallback, iPhone 14): first token in 3–5 seconds, full response in 8–18 seconds
Battery impact: approximately 12% battery per hour of active inference on iPhone 16 Pro, 18% on iPhone 14
App binary size impact: Core ML model bundle adds 85MB to the app download. The Foundation Models path adds zero — the on-device model is part of iOS, not the app bundle.
App Store first-submission approval: passed on first submission with complete privacy manifest
Explore offgrid:AI
offgrid:AI is available on the App Store for iPhone. It runs offline by design — no account, no subscription required to use the core inference features.
View offgrid:AI on the App Store →