Foundation ModelsUpdated · June 2026

Fix exceededContextWindowSize in Foundation Models: Production Recovery Pattern

Author: Ehsan Azish · 3NSOFTS
Updated: June 2026
Read time: 16 min read
Level: Intermediate → Senior
Platform: iOS 26+, Foundation Models, async/await

Implementation Notes

~/ What broke: The session transcript filled the on-device model context window.
~/ What to do: Trim session history, summarize state, and recover without losing the user's current task.

exceededContextWindowSizeFoundation Models context windowLanguageModelSession error handlingon-device LLM context limitFoundation Models 4096 tokens

If you've shipped anything chat-like on Apple's Foundation Models, you've hit this:

LanguageModelSession.GenerationError.exceededContextWindowSize

The on-device model runs in a small, fixed context window (4096 tokens through iOS 26.3). System instructions, every prior prompt, and every prior response all accumulate in the session transcript. In any multi-turn flow the window fills, and the next respond(to:) throws.

The trap most guides walk you into: they tell you to catch the error and trim the transcript. You can't. By the time the catch block runs, the request already failed and the session is unusable. There is no in-place trim. This guide covers what actually works in production.

Why the obvious fix doesn't work

The common advice looks like this:

do {
    let answer = try await session.respond(to: prompt)
} catch LanguageModelSession.GenerationError.exceededContextWindowSize {
    // "Just trim the transcript here" — there is nothing to trim.
    // The session is dead. You cannot mutate its way back to working.
}

The session's transcript is not a mutable buffer you prune mid-flight. Once the window is exceeded, the only path forward is a new session. The real engineering question is: how do you start a new session without the user losing their conversation?

Two strategies, depending on whether continuity matters.

Strategy 1: Hard reset (continuity doesn't matter)

For one-shot tools — a summarizer, a classifier, a "generate tags for this note" feature — there is no conversation to preserve. Each call is independent. The fix is trivial: catch the error, spin up a fresh session, retry once.

final class GenerationService {
    private var session = LanguageModelSession()

    func respond(to prompt: String) async throws -> String {
        do {
            return try await session.respond(to: prompt).content
        } catch LanguageModelSession.GenerationError.exceededContextWindowSize {
            // Fresh session, single retry. Stateless tasks lose nothing.
            session = LanguageModelSession()
            return try await session.respond(to: prompt).content
        }
    }
}

The single-retry guard matters: don't loop. If a single prompt plus instructions exceeds the window on a clean session, retrying forever won't help — that's a prompt-size problem (see pre-flight checks below), not a transcript-accumulation problem.

Strategy 2: Summarize-and-carry (continuity matters)

For a real assistant, dropping history is jarring — the model "forgets" mid-conversation. The production pattern is to summarize the transcript so far, then seed a new session with that summary as instructions. You trade verbatim history for a compressed memory, which is exactly the right tradeoff on a 4096-token budget.

final class ConversationService {
    private var session: LanguageModelSession
    private let instructions: String

    init(instructions: String) {
        self.instructions = instructions
        self.session = LanguageModelSession(instructions: instructions)
    }

    func respond(to prompt: String) async throws -> String {
        do {
            return try await session.respond(to: prompt).content
        } catch LanguageModelSession.GenerationError.exceededContextWindowSize {
            try await compactAndReset()
            return try await session.respond(to: prompt).content
        }
    }

    /// Summarize the dead session's transcript, then rebuild a fresh
    /// session whose instructions carry that summary forward.
    private func compactAndReset() async throws {
        // The old session is unusable for respond(), but its transcript
        // is still readable. Use a SEPARATE throwaway session to summarize it.
        let history = transcriptText(from: session.transcript)

        let summarizer = LanguageModelSession()
        let summary = try await summarizer.respond(
            to: "Summarize this conversation in under 200 words, "
              + "preserving facts, decisions, and the user's intent:\n\n\(history)"
        ).content

        session = LanguageModelSession(
            instructions: """
            \(instructions)

            Conversation so far (summarized):
            \(summary)
            """
        )
    }

    private func transcriptText(from transcript: Transcript) -> String {
        transcript.compactMap { entry in
            switch entry {
            case .prompt(let p):   return "User: \(p.segments.map(\.text).joined())"
            case .response(let r): return "Assistant: \(r.segments.map(\.text).joined())"
            default:               return nil
            }
        }.joined(separator: "\n")
    }
}

Note on the Transcript API surface. Apple has been adjusting Transcript entry types across point releases. Treat the transcriptText(from:) extraction above as a shape, not a contract — verify the case names against the SDK version you target and keep this isolated in one function so a single edit absorbs API churn.

The summary itself costs tokens in the new session's instructions, so it's not free — but a 150–200 word summary buys you far more headroom than a verbatim transcript, and the model stays coherent across the seam.

Pre-flight checks: don't hit the error at all (iOS 26.4+)

Recovery is the safety net. The better posture is to not blow the window in the first place. As of iOS 26.4, Apple exposes a token-count API so you can measure instructions and prompts before sending them, and proactively compact when you're approaching the ceiling.

// Available iOS 26.4+. Gate with #available — do NOT assume it exists
// on 26.0–26.3 devices still in the wild.
func shouldCompact(before prompt: String) -> Bool {
    guard #available(iOS 26.4, *) else { return false }
    // Pseudocode for the shape; confirm the exact API name in your SDK.
    let used = session.estimatedTokenCount      // current transcript cost
    let incoming = SystemLanguageModel.default.tokenCount(for: prompt)
    let budget = 4096
    return (used + incoming) > Int(Double(budget) * 0.85) // 15% safety margin
}

Calling compactAndReset() proactively when you cross ~85% of budget turns a hard failure into an invisible background operation. The user never sees an error; they just keep talking.

Production checklist

Single retry, never a loop. A clean-session failure is a prompt-size bug, not a transcript bug.
Isolate transcript extraction in one function so SDK churn touches one place.
Use a throwaway session to summarize — never the dead one.
Gate the token-count API behind #available(iOS 26.4, *); 26.0–26.3 devices won't have it.
Compact proactively at ~85% budget rather than reactively at 100%.
Surface nothing to the user. Context recovery is plumbing; it should be invisible.

Why this matters for shipped apps

This is the failure mode that separates a demo from a product. Every "build a chat app with Foundation Models" tutorial works on turn three and never sees turn thirty. A real conversational feature lives or dies on whether it survives a long session gracefully — and the framework gives you no in-place trim, so you have to architect the recovery yourself.

If you're integrating on-device AI into a production iOS or macOS app and want the architecture audited before you ship, that's exactly the kind of work we do at 3NSOFTS.

Authoritative References

Foundation Models frameworkApple IntelligencePrivate Cloud ComputeSwift ConcurrencySwift Evolution