Lab Findings / AI Systems
AI-Native OS Experiments: What We Built and Where It Broke
We prototyped an AI-native OS layer — shell command translation, local LLM file management, and context-aware workflows. Here is what worked, what we abandoned, and the specific problem that determined the prototype's limits.
Lab Finding · 3NSOFTS · February 2026 · Status: Prototype (paused)What we were testing
The premise: a traditional OS is organized around file hierarchies and explicit commands. An AI-native OS would organize around intent — you describe what you want, and the system determines how to execute it. The experiment was not about building a full operating system. It was about testing whether the AI translation layer was reliable enough to be useful.
The prototype was a macOS layer built in Python over a local Ollama instance. It intercepted natural language input, translated it to shell commands using a locally running model (Llama 3.1 8B), and presented the translated command for approval before execution. The key constraint: the system never executed a command without explicit user confirmation.
What worked
- ✓Common file operations translated reliably. “Find all PDFs modified in the last 7 days and move them to ~/Documents/recent” → correct
find+mvpipeline on the first attempt consistently. For well-defined file operations in a bounded vocabulary, the translation was reliable enough to use. - ✓Latency was acceptable. On an M2 MacBook Pro, Llama 3.1 8B via Ollama returned command translations in 1.5–3 seconds. For non-interactive workflows (batch operations, scheduled tasks) this was acceptable. For interactive typing, it adds friction.
- ✓Privacy held. No network traffic during inference. All prompts stayed on device. For workflows involving sensitive file paths or credentials, on-device translation is a real advantage over cloud-based alternatives.
- ✓Context from prior commands helped accuracy. After the system observed that the user was working in a particular project directory, translations for relative paths were more accurate. Context-aware history improved precision meaningfully.
Where it broke
- ✗Ambiguous file references failed non-trivially. “Archive the old project” produced commands targeting plausible but wrong directories when multiple similarly named directories existed. The error was syntactically correct — the command would have executed without error while operating on the wrong target. This is a different failure mode than a syntax error; it requires contextual knowledge of user intent that the model did not have.
- ✗Destructive operations required additional guardrails that added friction.
rm,chmod, and permission-modifying operations were the hardest to present confidently. The confirmation flow added necessary safety but broke the “just tell it what to do” mental model that made the prototype compelling in the first place. - ✗Multi-step operations could not be composed reliably. Single-step translations worked well. Workflows requiring 3+ coordinated commands were underspecified — the model often translated the first step correctly and approximated or failed the subsequent steps. Handling stateful workflows with branching conditions was not feasible without a more structured planning layer.
The UX trust problem that stopped the prototype
The fundamental issue is not accuracy — it is trust calibration. When a shell command executes incorrectly, the user knows immediately. When an AI-translated command executes correctly but on the wrong target, the user may not know until the damage is done.
The confirmation step (show the translated command before executing) is the correct safety mechanism, but it forces the user to evaluate the generated command anyway. If the user has to read and verify the command, they need to understand shell syntax — which is the same skill the system was designed to eliminate. Users who already understand shell syntax are the wrong audience; the intended audience (non-technical users) cannot verify the generated commands.
This is not a solvable problem with a better model. It is a structural property of the use case: the safer you make the execution (by requiring verification), the less useful the abstraction becomes for the users who need it most. The prototype was paused at this finding.
What this informed
The findings directly informed offgrid:AI's UX design for AI interaction: the AI layer is conversational (generating text output, not executable system commands), which removes the trust calibration problem entirely. They also informed our evaluation of on-device AI for iOS — the privacy advantage is real and relevant for specific use cases, but the accuracy bar for action-executing AI is higher than for text-generating AI.
The experiment is paused, not abandoned. The reliability threshold for non-destructive operations was cleared. The remaining problem is the UX layer for destructive operations and multi-step workflows — both requiring more structured planning architecture than a single-shot LLM prompt.
