Skip to main content
3Nsofts logo3Nsofts
On-Device AI

On-Device Email Classification with Core ML: How Sorto Processes Zero Server-Side Data

How Sorto implements on-device email classification using Core ML with zero message content transiting to any server. Covers the classification pipeline, feature extraction, actor-isolated inference, confidence thresholds, and incremental classification inside a notification service extension.

By Ehsan Azish · 3NSOFTS·May 2026·11 min read

The Structural Problem with Server-Side Email Classification

Email carries more personal context than almost any other data type on a device. Sender relationships, financial notices, health correspondence, legal documents — it is all there, in plaintext, in a single store.

Most email classification products solve the sorting problem the same way: send message content to a remote server, run inference there, return a label. The classification works. The privacy trade-off is structural and non-negotiable — the moment message content leaves the device, the user has no meaningful control over what happens to it.

For users who accept that trade-off, the cloud approach is fine. For users who do not — and for any product targeting privacy-conscious markets or regulated industries — it is disqualifying.

The constraint that shaped everything in Sorto: classification must run entirely on-device, with zero message content transiting to any server.


What Sorto Does

Sorto is an on-device email intelligence app built by 3Nsofts. It sorts incoming mail into eight categories using Core ML inference — sender intent, topic, priority signal — without any message content leaving the device.

Zero bytes of email content are sent to a server. The model runs locally. Classification results are stored locally. The only network activity Sorto performs is the standard mail fetch iOS handles natively.


The Constraints That Shaped the Architecture

Four hard constraints shaped every decision:

  • Email content is private data — no message body or sender metadata may transit to any external server.
  • Inference must complete before the user opens the message — classification latency above ~50ms is perceptible in list-scroll contexts.
  • The model must run on the full supported device range — A12 Bionic through current Apple Silicon, without requiring the Neural Engine exclusively.
  • The app must function without a network connection — classification cannot depend on a remote API being reachable.

Every architectural decision flows from that.


Architecture: The Classification Pipeline

Model Selection and Preparation

The instinct is to reach for a large general-purpose language model. The problem: large models carry inference costs that violate the latency constraint on older hardware, and they require far more memory than a focused classification model needs.

Sorto uses a fine-tuned text classification model exported to .mlpackage format. The base architecture is a distilled transformer trained on a labeled email corpus, then quantized to INT8 weights using Core ML Tools. Quantization brings model size down from ~180MB to ~45MB while keeping classification accuracy within 2 percentage points of the full-precision version on the held-out test set.

The .mlpackage format supports model updates via MLUpdateTask, which means the classification model can be retrained on-device using user corrections without shipping a new binary.

For a detailed breakdown of quantization trade-offs and what INT8 costs in accuracy terms, see Core ML Optimization Techniques.

Feature Extraction

The model does not receive raw email body text. It receives a structured feature vector derived from the message.

Feature extraction runs in a dedicated actor-isolated class — EmailFeatureExtractor — that processes:

  • Subject line tokens (normalized, stopwords removed)
  • Sender domain and display name tokens
  • Message length bucket (short / medium / long — not the actual length)
  • Thread depth signal (reply count, capped at 5)
  • Temporal features (time of day, day of week)

Body content is not included in the feature vector. This is a deliberate design decision, not a performance shortcut. The model never sees message content — only structural signals about it.

actor EmailFeatureExtractor {
    func extract(from message: MessageSummary) -> MLFeatureProvider {
        // Produces a structured MLDictionaryFeatureProvider
        // from subject tokens, sender domain, and structural signals.
        // message.body is never accessed.
    }
}

The Inference Layer

MLModel inference runs synchronously on a background actor. The call is straightforward — pass the feature provider, receive an MLFeatureProvider containing the predicted label and a confidence dictionary.

actor ClassificationEngine {
    private let model: MLModel

    func classify(_ features: MLFeatureProvider) throws -> ClassificationResult {
        let prediction = try model.prediction(from: features)
        // Extract label and confidence from prediction output.
    }
}

The actor isolation matters. Classification runs off the main thread by construction — no DispatchQueue.global() calls, no Task.detached workarounds. The Swift concurrency model handles scheduling.

Label Mapping and Confidence Thresholds

The model outputs a probability distribution across the label set. Sorto applies a confidence threshold before committing a label — messages below the threshold go into an "Unsorted" bucket rather than being forced into a category.

The threshold is configurable per label. Precision-sensitive labels (Finance, Legal) use a higher threshold than general ones (Newsletters, Updates). The asymmetry is intentional: a false positive on a Finance label is more disruptive than an unclassified message.

Incremental Classification on New Mail

New messages arrive via a UNNotificationServiceExtension. The extension instantiates the classification pipeline, runs inference on the incoming message summary, and writes the result to a shared App Group container before the notification surfaces to the user.

The binding constraint is the 30-second execution budget the system allocates to notification service extensions. Feature extraction and inference complete in under 10ms on A15 hardware — well within budget. On A12, measured inference time is 28ms, still within budget with margin.


Why the Naive Approach Fails

The standard cloud classification architecture is straightforward: message arrives, content is sent to an API endpoint, the endpoint returns a label, the app displays it.

Three failure modes make this unacceptable for Sorto's design premise.

Privacy. Message content transits to a third-party server. The user has no visibility into retention, logging, or downstream use.

Latency. A round-trip to a classification API — even a fast one — adds 200–800ms. That latency is perceptible when classifying a batch of messages on first launch or when processing a notification. For a comparison of measured figures, see Core ML vs Cloud AI APIs for iOS.

Availability. The app becomes non-functional when the API is unreachable. Offline use — on a plane, in a low-signal environment — breaks the core feature entirely.

On-device inference eliminates all three failure modes. The trade-off is model size and the engineering cost of the on-device pipeline. That trade-off is worth making.


Performance Characteristics

Measured on production hardware, classifying a single message summary (subject + sender + structural signals):

| Device | End-to-End Latency | |---|---| | A12 Bionic | 28ms | | A15 Bionic | 9ms | | M2 (iPad) | 6ms |

Batch classification on first launch — processing an existing inbox of 500 messages — runs as a background Task with .background priority. On A15, 500 messages classify in under 8 seconds. The UI remains fully responsive; classification results populate the list incrementally as they complete.

Memory footprint for the loaded model is 47MB — within the budget for a notification service extension, which is the tightest constraint in the pipeline.

For full device-by-device inference benchmarks across model types and compute units, see Core ML Inference Performance Benchmarks 2026.


Privacy as a Structural Property, Not a Feature

Privacy in Sorto is not a setting or a policy statement. It is a structural property of the architecture — the feature extraction step never accesses message body content, and the inference layer never opens a network socket.

The same design principle applies in ECHO Survival AI, where all LLM inference runs on-device with zero cloud dependency, and in the CalmLedger privacy-first finance tracking architecture, where financial data never leaves the device.

The pattern is consistent: when privacy is the constraint, the architecture must enforce it — not rely on policy to maintain it.

For teams building AI features under similar constraints, the Swift 6 AI Integration guide covers actor isolation patterns, MLModel lifecycle management, and concurrency-safe inference pipelines in detail.


FAQ


Work With Me

The On-Device AI Integration engagement covers Core ML model selection, actor-isolated inference pipelines, privacy architecture, and App Store compliance — delivered in 3–5 weeks at a fixed price.

Related