Almost every AI agent in production reaches the outside world through a clean side door. It calls an API, gets back structured JSON, and never sees a button. That works beautifully right up to the moment the job needs a tool with no API: a 2009 internal app, a state government portal, a desktop accounting package, a vendor dashboard that will never ship a developer interface. There are a lot of those. Most software a human touches at work was never built for a machine to drive.
Computer use is the attempt to reach all of it. Instead of an API, the agent gets a screenshot. It looks at the pixels, decides where to click, and issues a mouse move, a click, or a keystroke. Then it takes another screenshot to see what happened and goes again. The agent operates the computer the way a person does, through the same screen, which means it can in principle use any software a person can. That is the promise. The reality in 2026 is more interesting and more uneven, and it comes with a security problem serious enough that Gartner has told security leaders to block these tools for now.
Origin: from the chat box to the cursor
The idea is old. Screen-scraping and robotic process automation have driven other people's user interfaces for two decades, and the RPA market got large. But classic RPA is brittle in a specific way: it follows recorded coordinates and fixed selectors, so a redesigned button or a moved menu breaks the script. It does not understand what it is looking at. It replays.
The change came when a multimodal model could look at an arbitrary screenshot and reason about it. In October 2024 Anthropic shipped computer use in public beta with an upgraded Claude 3.5 Sonnet, the first frontier model offered this way. The framing in the API docs is plain: the model sees a screen, moves a cursor, clicks, and types. The loop is the whole trick. Screenshot in, reasoning, an action out, screenshot in again. Because the model interprets the image fresh each turn rather than replaying coordinates, a moved button is not fatal. It can find the button.
OpenAI followed in January 2025 with Operator, a browser agent built on a Computer-Using Agent model. Google DeepMind had shown Project Mariner, a Chrome agent, in December 2024. For a moment in early 2025 it looked like a clean three-way race. It did not stay that way, and the reasons matter for anyone deciding whether to build on this.
Present: a real capability jump, and a real ceiling
Start with the honest good news. The capability jump is genuine, and the cleanest way to see it is OSWorld, a benchmark of 369 real tasks across Ubuntu, Windows, and macOS apps, file operations, and multi-app workflows. It is not a toy. The human success rate is about 72 percent. When OSWorld launched in 2024, the best agent scored 12.24 percent. Anthropic's first computer-use model managed 14.9 percent on screenshot-only tasks. By 2026, top scores on the independently run OSWorld board sit in the low 70s, with Claude Opus 4.6 around 72.7 percent, and entries on the tightened OSWorld-Verified leaderboard reach into the high 70s and low 80s. In roughly eighteen months the field went from a tenth of human performance to drawing level with it on this benchmark. That is fast.
It is also where you should slow down. A benchmark score is a single number averaged over a fixed task set, and computer-use agents have a reliability problem that the average hides. A 2026 study, On the Reliability of Computer Use Agents, ran identical OSWorld tasks repeatedly and found the obvious uncomfortable thing: an agent that succeeds once may fail the next time on the exact same task, same model, same instructions. The paper traces it to stochasticity during execution, ambiguity in how the task is worded, and plain behavioral variance run to run. A score of 72 percent does not mean the agent does your task 72 percent of the time. It means that across a fixed set, on the runs that were measured, it cleared 72 percent. Real work is long-horizon and multi-step, and that is exactly where these agents are weakest. Reported figures put agents at 68 to 87 percent of human performance on simple tasks but only 15 to 32 percent on complex workflows. Small errors compound across dozens of clicks, and a single misfire can derail the rest.
Then there is speed and cost. Anthropic's own docs are direct about this: computer-use latency may be too slow for human-facing interaction, so the recommended uses are ones where speed does not matter, like background information gathering and automated software testing. Each step is a full multimodal model call on a fresh screenshot. A task that is one API call for an integrated tool can be forty screenshot-reason-act cycles for a computer-use agent, and you pay tokens for every screenshot. It is the most expensive and slowest way to make software do something. It earns its place only when no cheaper path exists.
The market has already started sorting by that logic. Project Mariner was shut down in May 2026, its technology folded into Gemini and a Chrome feature called auto-browse. OpenAI's standalone Operator was deprecated within roughly seven months of launch and folded into ChatGPT Agent, after struggling to reliably complete purchases on sites with complex JavaScript, CAPTCHAs, and session handling. The pattern is not that screen-driving failed. It is that an API-first integration beats screen-driving whenever the API exists, so the durable use case for computer use is the long tail where it does not. Anthropic's design reflects this. Its Claude Cowork and Claude Code computer use, in preview across macOS and Windows in 2026, tries a connector first, then a browser, and only drives the raw screen as a last resort. Screen control is the fallback, not the default.
Computer use is one half of the story. The other is the browser, where most knowledge work actually happens.
Present: the agentic browser landscape
An agentic browser puts the agent inside the browser itself, where it can read the page, fill forms, click through flows, and carry a multi-step task across tabs. Perplexity shipped Comet first, in July 2025, and by March 2026 it ran on desktop, Android, and iOS with an enterprise edition. OpenAI launched Atlas in October 2025 with an Agent Mode for autonomous tasks. Google built agentic features straight into Chrome, which given Chrome's roughly three billion users makes it the largest deployment of this technology by far. Microsoft added Copilot Mode to Edge. There are independents too: BrowserOS is open source, Opera Neon runs multiple specialized agents, and The Browser Company's Dia was acquired by Atlassian.
Two models are competing here. One is the standalone AI-native browser, Comet and Atlas. The other is an agentic layer bolted onto an incumbent, Chrome and Edge. The incumbents have distribution; the standalones move faster. The functional gap between them is narrowing, and the honest summary is that they all do the same demo well, a flight comparison or a research roundup, and all of them get less reliable as the task gets longer and the sites get less standard. That is the same ceiling as desktop computer use, viewed from the browser.
Future and impact: the security problem nobody has solved
The agentic browser does not just inherit the reliability problem. It adds a security problem that may be worse, and it is the single most important thing a practitioner should understand before deploying one.
The mechanism is indirect prompt injection. A computer-use agent and a browser agent both read whatever is on the page, and they do not reliably separate the user's instruction from text that happens to be on a website. To the model, both are just tokens. So an attacker writes instructions into a page, hidden from the human in white-on-white text, an HTML comment, or a forum spoiler tag, and the agent reads them and may obey. This is not theoretical. In August 2025 Brave's security team demonstrated an indirect prompt injection against Comet: a "summarize this page" request let hidden instructions extract the user's email and a one-time passcode from a logged-in Gmail session, then post both as a Reddit reply. Brave's point is the one that should land. Same-origin policy and CORS, the rules that keep a malicious site from reading your bank tab, are useless here, because the agent operates with the user's own privileges across every authenticated session at once.
That is the part that makes browser agents uniquely exposed. The agent usually runs inside your real browser profile. It is already logged into your email, your bank, your company's tools, your cloud storage. A successful injection does not need to steal a password. It already has the session. Wiz's year-end review cataloged a steady run of named exploits through 2025 and into 2026: CometJacking, a one-click hijack via crafted URL parameters; Tainted Memories, a CSRF flaw that poisoned Atlas's long-term memory with instructions that persisted across sessions; HashJack, which hid instructions in URL fragments; and Scamlexity, which showed AI browsers lack the basic skepticism to spot a phishing storefront and will buy from a fake shop. As Dark Reading put it, AI agents are reopening browser-security ground the industry spent twenty years securing.
The vendors are not in denial, and that is itself telling. OpenAI stated plainly that prompt injection, like scams and social engineering, is unlikely ever to be fully solved. Its mitigations are layered and probabilistic: a reinforcement-learning attacker that hunts for new exploits, faster patching, confirmation prompts before payments, and limits on logged-in access. Anthropic runs classifiers that flag suspected injections in screenshots and steer the model to ask for confirmation, and its docs still tell developers to use a sandboxed VM, withhold sensitive credentials, and allowlist domains. Those are real defenses. None of them is a fix. They reduce the rate; they do not close the hole.
This is why the deployment advice is conservative. Gartner has recommended that CISOs block agentic browsers for now, and that is not a fringe position. The technology is genuinely useful and genuinely promising. It is also early, and the failure modes are not cosmetic. If you build on computer use or an agentic browser today, treat it as an untrusted process: a dedicated low-privilege environment, no standing access to sensitive accounts, a tight allowlist, and a human in the loop on anything that moves money or sends data. That is the same defense-in-depth posture that good guardrails demand of any agentic system, and the reasoning behind it is laid out in our piece on prompt injection. The pattern is the same one that makes ordinary tool use safe: the more an agent can touch, the more carefully you fence what it is allowed to reach.
The honest read on computer use and agentic browsers is that they are a real capability, not a demo trick, and they solve a real problem: the long tail of software that will never get an API. The benchmark progress is fast and not faked. But these agents are slow, expensive, still below human reliability on real multi-step work, and they ship with an unsolved security surface. That is a promising early technology, which is a perfectly good thing to be, as long as nobody mistakes it for a finished one.
Council summary
This post argues that computer use and agentic browsers are a genuine capability rather than a demo trick, because they reach the long tail of software that will never expose an API, and the benchmark evidence backs that up: OSWorld agents climbed from a tenth of human performance to roughly level inside eighteen months. But the post is careful to separate the headline from the reality. Benchmark averages hide a run-to-run reliability problem, the agents stay slow and expensive next to API-first integration, and they ship with an indirect prompt injection surface that vendors openly admit may never be fully closed. The reader's takeaway is practical and load-bearing: build on this only with a low-privilege sandbox, no standing access to sensitive accounts, a tight allowlist, and a human in the loop on anything that moves money or data. Treat it as a promising early technology, and do not mistake it for a finished one.
Comments