What is GGUF quantization and why does it matter for on-device LLM inference?

GGUF (GPT-Unified File Format) is a binary format for storing quantized language models. Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), dramatically reducing file size and memory requirements at a small cost to output quality. For iOS, a full-precision Llama 3 8B model would be ~16 GB — impossible on most devices. A Q4_K_M quantized version is ~4.5 GB — practical for devices with 128 GB+ storage and fits comfortably in the iPhone's unified memory architecture. Q4_K_M specifically uses per-channel quantization with k-quant groups, which preserves quality better than naive 4-bit quantization.

How does llama.cpp integrate with Swift?

llama.cpp is a C/C++ library. It integrates into iOS apps via a Swift Package containing a C++ module wrapper that exposes a Swift-callable API. The wrapper handles model loading, context management, and token streaming through a callback interface. In offgrid:AI, the Swift layer owns the model lifecycle (load on first use, unload when backgrounded to free memory) and bridges llama.cpp's C callback-based token streaming to Swift's AsyncStream for use with SwiftUI's .task modifier.

What is battery-aware inference scheduling?

LLM inference is CPU and Neural Engine intensive. On sustained inference, an iPhone 15 Pro draws significant power — left unmanaged, this causes rapid battery drain and thermal throttling. Battery-aware scheduling observes UIDevice.current.batteryLevel at inference start and during long generations. If battery drops below a configurable threshold (default 15%), active inference pauses and the user is notified. CPU thread allocation is also throttled based on the device's thermal state — ProcessInfo.processInfo.thermalState — reducing from 4 threads to 2 when the device is near thermal limit.

How was the App Store review handled for an app running local LLMs?

The primary concern in App Store review for on-device LLM apps is model content. The app was submitted with a model that had been safety fine-tuned, and the review notes explicitly described the offline inference architecture and the absence of any network capability for inference. The app does not allow users to load arbitrary models — only models distributed through the app's model download flow, which are pre-selected for compliance. Apple reviewed and approved within the standard timeline.

iOS · On-Device AICase Study

offgrid:AI: Shipping Fully Offline LLM Inference on iOS

Building an AI assistant that runs entirely on-device — no cloud API, no server costs, no data transmission — required solving model storage, memory constraints, inference speed, and battery life simultaneously.

Stack

SwiftUI · llama.cpp · Core ML

Platform

iOS · On App Store

Performance

18–22% battery/hr sustained

Data sent

0 bytes to any server

Context

In 2024, every AI assistant app on iOS required an active internet connection and transmitted user prompts to cloud infrastructure. The market assumption was that language model inference was too compute-intensive to run on a mobile device. The use cases for a genuinely offline AI assistant were real and unserved: field workers without reliable connectivity, travelers in areas with high data costs, privacy-conscious users who would not send prompts to a cloud API, and emergency preparedness scenarios where connectivity cannot be assumed.

Problem

The technical barriers to production-viable on-device LLM inference on iOS in 2024 were not theoretical — they were real constraints that had to be solved simultaneously:

—Model size: a usable language model is 3–16 GB. That's a significant portion of a device's storage.
—Memory: LLM inference requires holding model weights and the KV cache in memory simultaneously. The iPhone's unified memory architecture helps, but context window size is directly limited by available RAM.
—Battery: sustained inference draws significant CPU and Neural Engine power. An app that drains 50% battery per hour is not useful.
—Apple Foundation Models framework: not available until iOS 26. A cross-version strategy was required.
—App Store: Apple's guidelines restrict some model hosting patterns. Approval required deliberate preparation.

Architecture

Inference Engine: llama.cpp

llama.cpp was the only production-viable path for local LLM inference on iOS prior to Apple Foundation Models. It provides a C/C++ implementation of LLaMA inference with GGUF format support, optimized for the NEON instruction set used by Apple Silicon. The Swift integration layer wraps the C API, manages model lifecycle (load on first use, unload to free memory when backgrounded), and bridges llama.cpp's token callback to Swift's AsyncStream<String>.

Quantization Strategy

The quantization-quality trade-off defines the user experience. Models below Q4 produce noticeably degraded output — users perceive the quality drop. Models above Q5 exceed practical on-device storage for most users. The app ships with Q4_K_M (approximately 4.5 GB) as the primary model and Q5_K_M (approximately 5.5 GB) as an optional higher-quality variant.

Q4_K_M

~4.5 GB

Good output

18–22%/hr

Q5_K_M

~5.5 GB

Very Good output

20–25%/hr

Q8_0

~8.5 GB

Near-Full output

28–35%/hr

Battery-Aware Scheduling

LLM inference is not interruptible at arbitrary points — a token generation in progress must complete. The battery scheduler observes two signals: UIDevice.current.batteryLevel and ProcessInfo.processInfo.thermalState. If battery drops below 15% during generation, the current response completes and a UI warning is shown before the next request. If thermal state is .serious or .critical, inference CPU thread count is halved — reducing throughput but preventing the device from throttling the processor mid-generation.

Model Storage & Download UX

Models are stored in the app's documents directory using FileManager — they survive app updates, are excluded from iCloud backup (to avoid consuming the user's iCloud storage quota), and are not purged by the system's storage reclamation. The download UX is a first-run flow, not a gate: the user sees the exact download size before committing. Downloads use URLSession background download tasks with progress tracking and automatic resume on failure.

Implementation: Token Streaming to SwiftUI

llama.cpp produces tokens via a C callback. Bridging that to SwiftUI's reactive update model requires an AsyncStream that emits each token as it's generated:

// Bridge llama.cpp token callback to Swift AsyncStream
actor InferenceEngine {
    private var model: OpaquePointer?
    private var context: OpaquePointer?

    func generate(prompt: String) -> AsyncStream<String> {
        AsyncStream { continuation in
            Task.detached(priority: .userInitiated) { [weak self] in
                guard let self else { return }

                let tokens = await self.tokenize(prompt)
                var response = ""

                for token in await self.generateTokens(from: tokens) {
                    // Check thermal state before each token
                    let thermal = ProcessInfo.processInfo.thermalState
                    if thermal == .critical {
                        await self.throttleInference()
                    }

                    let piece = await self.tokenToPiece(token)
                    response += piece
                    continuation.yield(piece)
                }

                continuation.finish()
            }
        }
    }
}

// In SwiftUI
struct ChatView: View {
    @State private var response = ""

    var body: some View {
        Text(response)
            .task {
                for await token in engine.generate(prompt: userMessage) {
                    response += token
                }
            }
    }
}

Outcome

Shipped on the App Store with full offline inference. Users install a 4–5 GB model once and run open-ended conversations, document summarization, and code explanation entirely on-device — without an internet connection, without paying per API call, without their prompts being transmitted anywhere.

→Live on the App Store — approved in standard review time with the offline inference architecture
→Battery consumption on iPhone 15 Pro: 18–22% per hour at sustained Q4_K_M inference
→0 bytes transmitted to any server during inference — zero network entitlement required at inference time
→Resumable model download: interruptions don't require starting over from scratch
→Zero cloud infrastructure costs — no API, no server, no rate limits
→架构已为 Apple Foundation Models (iOS 26+) 迁移路径预留接口 — llama.cpp layer is swappable

"The technical constraint that defined the architecture: you cannot trade inference quality for model size beyond a threshold — below Q4, users notice degraded output. The solution lives in the quantization-quality curve, not at the extremes."

Key Technical Learnings

KV cache size is the real memory constraint

The model weights load once and stay largely static. The KV cache grows with context length — a 4K token context at Q4 can add 500 MB of memory pressure. Limit context window aggressively for chat use cases; rolling summarization is more practical than unlimited context.

Background download tasks, not foreground

A 4 GB model download in the foreground blocks the app and fails if the user switches away. URLSession background download tasks continue even when the app is backgrounded, and resume automatically if the connection drops. This is the only viable model for large asset downloads.

Thermal state is more actionable than battery level

Battery level tells you about future capacity; thermal state tells you about current load. When ProcessInfo.thermalState is .serious, the device is already throttling. Reducing inference threads before reaching .critical produces better sustained throughput than waiting for iOS to forcibly throttle the process.

Design for Foundation Models migration from day one

The inference interface is abstracted behind a protocol. llama.cpp is one concrete implementation. When Apple Foundation Models became available on iOS 26, adding a Foundation Models implementation required changing only the protocol conformance — the rest of the app was inference-engine agnostic.

Technical FAQ

What's the difference between llama.cpp and Apple Foundation Models for iOS?↓

llama.cpp is an open-source C/C++ library that runs arbitrary GGUF-format models. Any model in this format can be used — Llama, Mistral, Phi, Gemma. Apple Foundation Models is a first-party framework (iOS 26+) that provides access to Apple's on-device models through a high-level Swift API. Foundation Models is simpler to integrate and benefits from hardware optimizations, but you cannot swap the model — you use Apple's model. llama.cpp gives you full model choice but requires more engineering effort.

How does the app handle the App Store's large asset size concerns?↓

The model is not bundled in the app binary — it is downloaded post-install as a user-initiated action. The app binary is under 50 MB; the model download is separate and optional. This approach is consistent with Apple's guidelines for apps that require large supplemental data (audio apps, AR apps, etc.) as long as the app clearly communicates the download size before initiating it.

Could this be built using Apple Foundation Models instead of llama.cpp today?↓

Yes, for iOS 26+ targets. Apple Foundation Models provides on-device inference through a high-level API with no model management required — the system model is always available. For apps targeting iOS 17–25, llama.cpp via Swift bindings remains the only option. offgrid:AI's architecture abstracts the inference layer behind a protocol, so the migration path to Foundation Models is a protocol conformance addition, not an architectural rewrite.

Authoritative References

llama.cpp — LLM Inference in C/C++

The open-source inference engine used in offgrid:AI. Provides GGUF format support and NEON-optimised inference for Apple Silicon.

Apple Foundation Models Documentation

Apple's Swift framework for on-device language model inference, available iOS 26+. The migration target for llama.cpp implementations.

Core ML Documentation — Apple Developer

Apple's runtime for machine learning model inference on Apple Silicon. Used alongside llama.cpp for non-LLM classification tasks in offgrid:AI.

AsyncStream — Apple Documentation

Swift's asynchronous sequence type used to bridge llama.cpp's C token callbacks to SwiftUI's .task modifier.

URLSession Background Downloads — Apple Developer

Apple's networking API used for the 4–5 GB model download flow with background task support and automatic resume on failure.

SwiftUIllama.cppGGUFOn-Device AICore MLiOS AIPrivacy-First

Adding on-device AI to an existing iOS app?

The On-Device AI Integration service covers model selection, Swift integration, inference architecture, and production deployment — the same stack used in offgrid:AI.

Start a project View AI Integration service →

← Xcode Doctor All Case Studies →