原文 · 未翻译
Best practices for computer and browser use with Claude
Practical guidance for developers building computer and browser use integrations with the Claude model family.
CategoryAgents
ProductClaude Platform
DateMay 13, 2026
Reading time5min
ShareCopy linkhttps://claude.com/blog/best-practices-for-computer-and-browser-use-with-claude
Claude's latest models represent a significant step forward in computer and browser use capabilities. Because of these features, LLMs are now able to power increasingly complex agentic systems that power real work, like building software applications and automating workflows across multiple, disparate technologies.
In this blog post, we share best practices for using Claude with computer and browser use, ranging from simple configuration changes to more advanced integration patterns. We hope this piece helps as you start integrating Claude's computer and browser use capabilities into your product. We are also releasing a new demo implementation which encapsulates some of these best practices and provides additional tools useful for developing on top of Claude's computer use capabilities.
Note that these recommendations apply to the Claude 4.6 family (Opus 4.6, Sonnet 4.6, Haiku 4.5) and Claude Opus 4.7 unless otherwise noted. Where guidance differs between the 4.6 family and Opus 4.7, we call it out inline. Our findings are based on internal experimentation and may be updated in the future as new models and techniques emerge.
Getting started: resolution and scaling
Click accuracy is the foundation of any computer use integration. If clicks don't land where they should, nothing downstream works: forms don't get filled, buttons don't get pressed, and workflows fail. The single highest impact optimization is also one of the simplest: pre downscale your screenshots before sending them to the API.
Ensure proper scaling
When you send a screenshot to Claude’s Computer Use API, the model sees it and returns click coordinates in the display_width_px / display_height_px coordinate space you specified. But there's an important constraint: the API has internal processing limits on image size. Images that exceed these limits get downscaled before the model sees them, which means the model is clicking based on a degraded version of the image while your harness expects coordinates aligned to the original resolution.
For our Claude 4.6 model family, the API's limits are:
Max long edge: 1568 pixels
Max total pixels: 1.15 megapixels
Images exceeding either limit get internally downscaled
Our Opus 4.7 model supports higher resolution. The limits are:
Max long edge: 2576 pixels
Max total pixels: 3.75 megapixels
Images exceeding either limit get internally downscaled
When the coordinate space and the model's perceived image don't match, the model's predicted clicks land on a display scale different from the image it's actually seeing. This is the primary cause of click inaccuracy at high resolutions. The fix is straightforward: always downscale your screenshots to fit within these limits before sending them to the API. We consistently observe significant accuracy degradation when images exceed the limits, and this single change is worth more than almost any other optimization.
Recommended resolutions
Start with 1280x720. This is a safe, practical default for most use cases. It uses about 80% of the pixel budget, stays well within both the long edge and total pixel limits, and is a standard resolution that models have seen during training. It works well for both modern web UIs and legacy desktop applications.
If you are using Opus 4.7, we recommend starting with 1080p, as this brings a meaningful quality lift over 720p and provides a good balance between token use and performance.
For developers who want to maximize the visual information the model receives, we also recommend a "max API fit" approach: computing the optimal resolution per-image based on the source's native aspect ratio:
import math # 1568 for 4.6 family, 2576 for Opus 4.7 MAX_LONG_EDGE = 1568 # 1.15MP for 4.6 family, 3.75MP for Opus 4.7 MAX_PIXELS = 1_150_000 def compute_max_api_fit(native_w, native_h): """Compute the largest resolution that fits API limits while preserving aspect ratio.""" aspect = native_w / native_h # Compute max dimensions from pixel budget h_from_pixels = math.sqrt(MAX_PIXELS / aspect) w_from_pixels = h_from_pixels * aspect # Apply long edge constraint if native_w >= native_h: w = min(w_from_pixels, MAX_LONG_EDGE) h = w / aspect else: h = min(h_from_pixels, MAX_LONG_EDGE) w = h * aspect # Never upscale beyond native w = min(w, native_w) h = min(h, native_h) return int(w), int(h)
import math # 1568 for 4.6 family, 2576 for Opus 4.7 MAX_LONG_EDGE = 1568 # 1.15MP for 4.6 family, 3.75MP for Opus 4.7 MAX_PIXELS = 1_150_000 def compute_max_api_fit(native_w, native_h): """Compute the largest resolution that fits API limits while preserving aspect ratio.""" aspect = native_w / native_h # Compute max dimensions from pixel budget h_from_pixels = math.sqrt(MAX_PIXELS / aspect) w_from_pixels = h_from_pixels * aspect # Apply long edge constraint if native_w >= native_h: w = min(w_from_pixels, MAX_LONG_EDGE) h = w / aspect else: h = min(h_from_pixels, MAX_LONG_EDGE) w = h * aspect # Never upscale beyond native w = min(w, native_w) h = min(h, native_h) return int(w), int(h)
This approach is slightly more complex but avoids aspect ratio distortion and uses the full pixel budget available for each image. The accuracy improvement over a fixed 1280x720 is modest, but it's a straightforward implementation that avoids the distortion that occurs when forcing a 16:9 source into a 4:3 display resolution.
Resolutions to avoid:
Native resolution (unscaled): Unless your source images happen to be below the resolution limits, sending native resolution screenshots is the most common cause of poor click accuracy.
Very low resolutions (below 960x540): With low resolution images, too much detail is lost for the model to accurately identify small UI elements.
If on MacOS: A common issue for browser use is that the screenshots on MacOS are often captured with a device pixel ratio of 2, which means that you can end up with images that are 2x the resolution of the screen coordinates.
If you are on the 4.6 family, avoid 1920x1080 and above: These exceed the pixel limit and will be silently downscaled. On Opus 4.7 the ceiling is higher (3.75 MP), so 1080p and 1440p is within budget; still avoid native 4K without downscaling.
Coordinate scaling
When you resize a screenshot before sending it, the model returns click coordinates in the display resolution you specified. You must scale these back to your actual screen resolution before executing the click:
# Your screen is screen_w x screen_h # You sent a screenshot resized to display_w x display_h scale_x = screen_w / display_w scale_y = screen_h / display_h screen_x = int(api_returned_x * scale_x) screen_y = int(api_returned_y * scale_y)
# Your screen is screen_w x screen_h # You sent a screenshot resized to display_w x display_h scale_x = screen_w / display_w scale_y = screen_h / display_h screen_x = int(api_returned_x * scale_x) screen_y = int(api_returned_y * scale_y)
This is straightforward but critical, because if you forget to scale or display_width_px / display_height_px don't match the actual dimensions of the image you sent, every click will be consistently offset
display_width_px
display_height_px
Content ordering in the messages array
When constructing your messages content array, place the text instruction before the image, as depicted in the code snippet below. This lets the model know what it's looking for as it processes the screenshot, which improves click accuracy.
# RECOMMENDED — text instruction first, then screenshot: content = [ {"type": "text", "text": "Click on the Submit button"}, {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}}, ] # NOT RECOMMENDED — image first, then text: content = [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}}, {"type": "text", "text": "Click on the Submit button"}, ]
# RECOMMENDED — text instruction first, then screenshot: content = [ {"type": "text", "text": "Click on the Submit button"}, {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}}, ] # NOT RECOMMENDED — image first, then text: content = [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}}, {"type": "text", "text": "Click on the Submit button"}, ]
Diagnosing click issues
If clicks are missing their targets, it often boils down to one of the causes, below:
Symptom Likely causes Try this Clicks consistently offset in one direction display_width_px / display_height_px don't match the actual image dimensions sent Screenshot exceeds API limits and is being silently downscaled Content ordering is image-first instead of text-first Ensure display dimensions exactly match your resized screenshot, not your native resolution Pre-downscale to 1280x720 or use compute_max_api_fit Move text instruction before the image in the content array Clicks land in roughly the right area but miss the target Target is very small (checkbox, icon, toggle) Source image was very high resolution (4K+) and detail was lost during downscaling Aspect ratio distortion from forcing a non-native aspect ratio Enable enable_zoom: True for dense UIs Capture at a lower DPI or crop to the relevant screen region before downscaling Preserve the source aspect ratio when resizing Model clicks the wrong element entirely Ambiguous instruction ("click Submit" when multiple submit-like buttons exist) Visually similar elements near the target UI is too complex for a single instruction Use more specific prompts with positional context ("click the blue Submit button in the bottom-right of the form") Break complex interactions into smaller steps Provide additional context about the page layout Accuracy is poor across the board Screenshots are being sent above API limits Source images are from very high-resolution displays (4K+) with extreme compression ratios Resolution is too low, losing critical detail Pre-downscale all screenshots to fit within limits For 4K+ sources on the 4.6 family, Sonnet is more robust to heavy downscaling than Opus 4.6. On Opus 4.7 this gap largely closes, use the 4.7 pixel budget (up to 3.75 MP) so less downscaling is needed in the first place. Try 1280x720 as a baseline; if too lossy, use compute_max_api_fit
display_width_px / display_height_px don't match the actual image dimensions sent
display_width_px
display_height_px
Screenshot exceeds API limits and is being silently downscaled
Content ordering is image-first instead of text-first
Ensure display dimensions exactly match your resized screenshot, not your native resolution
Pre-downscale to 1280x720 or use compute_max_api_fit
compute_max_api_fit
Move text instruction before the image in the content array
Target is very small (checkbox, icon, toggle)
Source image was very high resolution (4K+) and detail was lost during downscaling
Aspect ratio distortion from forcing a non-native aspect ratio
Enable enable_zoom: True for dense UIs
enable_zoom: True
Capture at a lower DPI or crop to the relevant screen region before downscaling
Preserve the source aspect ratio when resizing
Ambiguous instruction ("click Submit" when multiple submit-like buttons exist)
Visually similar elements near the target
UI is too complex for a single instruction
Use more specific prompts with positional context ("click the blue Submit button in the bottom-right of the form")
Break complex interactions into smaller steps
Provide additional context about the page layout
Screenshots are being sent above API limits
Source images are from very high-resolution displays (4K+) with extreme compression ratios
Resolution is too low, losing critical detail
Pre-downscale all screenshots to fit within limits
For 4K+ sources on the 4.6 family, Sonnet is more robust to heavy downscaling than Opus 4.6. On Opus 4.7 this gap largely closes, use the 4.7 pixel budget (up to 3.75 MP) so less downscaling is needed in the first place.
Try 1280x720 as a baseline; if too lossy, use compute_max_api_fit
compute_max_api_fit
Model selection for clicking tasks
Based on our internal testing, Claude Sonnet 4.6 tends to be more mechanically precise at clicking (better spatial accuracy, fewer near misses) while Claude Opus 4.6 brings stronger reasoning. Sonnet 4.6 is also more robust when source images require heavy downscaling.
Opus 4.7 narrows this gap: Through testing, we have found its clicking precision is roughly on par with Sonnet 4.6, and its higher resolution budget reduces the amount of downscaling needed in the first place, making it a strong choice when you want Opus-level reasoning paired with strong click accuracy.
For most tasks, we recommend starting with Sonnet 4.6, which provides the best balance of clicking accuracy, reasoning, and cost. Choose Opus 4.7 when you want stronger reasoning, particularly if using high-resolution source images. Haiku 4.5 remains an excellent option when latency is the priority. Advanced workflows may still benefit from an orchestrator + sub-agent pattern where a reasoning model handles planning and decision-making while Sonnet or Haiku executes the mechanical clicking steps.
Handling small targets
Click accuracy degrades as targets get smaller. Large and medium UI elements (buttons, input fields, and standard menu items) are reliable across all resolutions within the safe zone. The challenge is with small and tiny targets, like checkboxes, system tray icons, dropdown arrows, small toggle switches, and tree view expand/collapse buttons.
If your application involves clicking small targets frequently, consider these strategies:
Use zoom for dense UIs. Claude 4.6 and 4.7 models support a zoom capability that lets the model inspect specific screen regions at higher resolution before clicking. Enable it in your tool configuration:
{ "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, "enable_zoom": True }
{ "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, "enable_zoom": True }
Make targets larger. If you control the UI being automated, increasing the size of click targets (even modestly) has a disproportionate impact on reliability. This might mean using a lower system DPI, zooming in the browser, or adjusting UI scaling settings.
Use keyboard alternatives for tiny targets. For very small elements, such as system tray icons or tiny checkboxes), keyboard shortcuts or tab-based navigation can be more reliable than clicking. If your workflow allows it, prompting the model to use keyboard interactions for specific steps can improve success rates.
Consider source image resolution. Screenshots from 4K+ displays that get compressed down to 720p lose significant detail (for example, a 16px checkbox at 3840x2160 native becomes roughly 5px at 1280x720 display resolution, which makes the target much smaller and therefore more difficult to hit). If you're working with very high-resolution displays, consider using Opus 4.7, which has a higher resolution limit than previous models. If using 4.6 models, consider capturing at a lower DPI, using display scaling to enlarge UI elements, or focusing the screenshot on the relevant portion of the screen rather than the full display. Because these models represent more information with less pixels, we’ve observed that performance degrades as source image scale increases, meaning more compression is needed.
Approaches we tested that didn't help
We experimented on internal evaluations with several popular optimization techniques and did not find consistent uplift from these approaches, though results may vary depending on the specific situation:
Breaking the image into smaller tiles: Splitting a screenshot into quadrants or regions and sending them separately did not improve click accuracy.
Overlaying a grid pattern with coordinates: Adding a visual coordinate grid to screenshots to help the model localize targets did not produce reliable gains.
Resize algorithm choice: PIL LANCZOS, sips, and other common resize algorithms produced identical results. Use whatever is convenient for your stack.
Inspecting failures
If the model acts unpredictably after trying the fixes above, log the full transcripts and overlay the predicted clicks on the source screenshots to understand what the model is actually seeing and deciding.
Some failures aren't about click accuracy at all. For example, certain dropdown menus may invoke system-level UI that the browser viewport doesn't capture—the model appears to be failing the task, but it simply can't see the menu it needs to interact with. In cases like these, the model should rely on alternative methods such as JavaScript execution, keyboard navigation, or direct document object model (DOM) manipulation rather than clicking.
Quick reference
How to scale and prepare an image for computer use
import math from PIL import Image import base64 import io # 1568 for 4.6 family, 2576 for Opus 4.7 MAX_LONG_EDGE = 1568 # 1.15MP for 4.6 family, 3.75MP for Opus 4.7 MAX_PIXELS = 1_150_000 def prepare_screenshot(screenshot: Image.Image, native_w: int, native_h: int) -> tuple[str, int, int]: """Resize a screenshot to fit API limits and return base64 + display dimensions.""" # Option A: Fixed 720p (simple, reliable) display_w, display_h = 1280, 720 # Option B: Max API fit (maximizes fidelity) # display_w, display_h = compute_max_api_fit(native_w, native_h) resized = screenshot.resize((display_w, display_h), Image.LANCZOS) buffer = io.BytesIO() resized.save(buffer, format="PNG") b64 = base64.standard_b64encode(buffer.getvalue()).decode() return b64, display_w, display_h def scale_coordinates(api_x: int, api_y: int, display_w: int, display_h: int, screen_w: int, screen_h: int) -> tuple[int, int]: """Scale API-returned coordinates back to native screen space.""" screen_x = int(api_x * (screen_w / display_w)) screen_y = int(api_y * (screen_h / display_h)) return screen_x, screen_y def compute_max_api_fit(native_w: int, native_h: int) -> tuple[int, int]: """Compute the largest resolution that fits API limits while preserving aspect ratio.""" aspect = native_w / native_h h_from_pixels = math.sqrt(MAX_PIXELS / aspect) w_from_pixels = h_from_pixels * aspect if native_w >= native_h: w = min(w_from_pixels, MAX_LONG_EDGE) h = w / aspect else: h = min(h_from_pixels, MAX_LONG_EDGE) w = h * aspect w = min(w, native_w) h = min(h, native_h) return int(w), int(h)
import math from PIL import Image import base64 import io # 1568 for 4.6 family, 2576 for Opus 4.7 MAX_LONG_EDGE = 1568 # 1.15MP for 4.6 family, 3.75MP for Opus 4.7 MAX_PIXELS = 1_150_000 def prepare_screenshot(screenshot: Image.Image, native_w: int, native_h: int) -> tuple[str, int, int]: """Resize a screenshot to fit API limits and return base64 + display dimensions.""" # Option A: Fixed 720p (simple, reliable) display_w, display_h = 1280, 720 # Option B: Max API fit (maximizes fidelity) # display_w, display_h = compute_max_api_fit(native_w, native_h) resized = screenshot.resize((display_w, display_h), Image.LANCZOS) buffer = io.BytesIO() resized.save(buffer, format="PNG") b64 = base64.standard_b64encode(buffer.getvalue()).decode() return b64, display_w, display_h def scale_coordinates(api_x: int, api_y: int, display_w: int, display_h: int, screen_w: int, screen_h: int) -> tuple[int, int]: """Scale API-returned coordinates back to native screen space.""" screen_x = int(api_x * (screen_w / display_w)) screen_y = int(api_y * (screen_h / display_h)) return screen_x, screen_y def compute_max_api_fit(native_w: int, native_h: int) -> tuple[int, int]: """Compute the largest resolution that fits API limits while preserving aspect ratio.""" aspect = native_w / native_h h_from_pixels = math.sqrt(MAX_PIXELS / aspect) w_from_pixels = h_from_pixels * aspect if native_w >= native_h: w = min(w_from_pixels, MAX_LONG_EDGE) h = w / aspect else: h = min(h_from_pixels, MAX_LONG_EDGE) w = h * aspect w = min(w, native_w) h = min(h, native_h) return int(w), int(h)
Usage:
import anthropic from PIL import Image client = anthropic.Anthropic() # Capture screenshot (your method here) screenshot = Image.open("screenshot.png") native_w, native_h = screenshot.size # Prepare for API b64, display_w, display_h = prepare_screenshot(screenshot, native_w, native_h) # Send to Claude — text before image response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, betas=["computer-use-2025-11-24"], messages=[{ "role": "user", "content": [ {"type": "text", "text": "Click on the Submit button"}, {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}}, ] }], tools=[{ "type": "computer_20251124", "name": "computer", "display_width_px": display_w, "display_height_px": display_h, }], ) # Scale coordinates back for execution api_x, api_y = extract_click_coords(response) # your parsing logic screen_x, screen_y = scale_coordinates(api_x, api_y, display_w, display_h, native_w, native_h)
import anthropic from PIL import Image client = anthropic.Anthropic() # Capture screenshot (your method here) screenshot = Image.open("screenshot.png") native_w, native_h = screenshot.size # Prepare for API b64, display_w, display_h = prepare_screenshot(screenshot, native_w, native_h) # Send to Claude — text before image response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, betas=["computer-use-2025-11-24"], messages=[{ "role": "user", "content": [ {"type": "text", "text": "Click on the Submit button"}, {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}}, ] }], tools=[{ "type": "computer_20251124", "name": "computer", "display_width_px": display_w, "display_height_px": display_h, }], ) # Scale coordinates back for execution api_x, api_y = extract_click_coords(response) # your parsing logic screen_x, screen_y = scale_coordinates(api_x, api_y, display_w, display_h, native_w, native_h)
Tuning thinking effort for computer use
Claude's latest models support adaptive thinking, a setting which lets Claude decide how much to reason through intermediate steps before acting. Instead of manually setting a thinking token budget, adaptive thinking lets Claude dynamically determine when and how much to use extended thinking based on the complexity of each request. For computer use, this means Claude can think through what it's seeing on screen, plan multi-step interactions, and self-correct before committing to a click or keystroke.
With adaptive thinking, Claude's thinking depth is controlled via the thinking parameter with an effort level: low, medium, high,xhigh (with Opus 4.7),and max. More thinking means more reasoning per action, but also more output tokens, higher latency, and higher cost.
The natural question: depending on the model, how much thinking is optimal for computer use?
Claude Opus 4.7
We tested each thinking effort level across a suite of end to end UI automation tasks spanning desktop applications, browsers, and multi-application workflows.
Opus 4.7 outperforms the 4.6 family. On the OSWorld Verified benchmark, we find that Opus outperforms all 4.6 family models at equivalent token usage and effort settings. Opus 4.7 on low effort scores similarly to Sonnet 4.6 on max, while using ~1/10th the tokens per task. For difficult tasks, Opus 4.7 is the obvious choice.
Setting effort to high achieves close to the highest task success rate while using roughly half the output tokens of max. Compared to Opus 4.6, low, medium and high all use approximately the same amount of tokens while improving score on OSWorld. During our internal testing, Max effort used more tokens and provided the best score. The table below outlines our recommendations for when to use each thinking effort level.
high
max
Recommendations for effort levels
Scenario Thinking effort Why Default for most use cases high Opus 4.7 is best for difficult tasks. Using high will give the model enough reasoning to plan over complex multi-step interactions without significantly increasing token usage. High-throughput / cost-sensitive low Lower token usage while providing quality between Opus 4.6's high and max effort settings. Simple, well-defined workflows / fastest Suggest trying Sonnet 4.6 Use if low latency is the highest priority. Adequate for short, predictable tasks where the UI is consistent and the workflow is known. Complex, one-shot tasks max Use when tasks are highly challenging and you need to get it right on the first attempt.
high
low
max
Claude 4.6 models
We tested each thinking effort level across a suite of end to end UI automation tasks spanning desktop applications, browsers, and multi-application workflows.
Two patterns stand out:
Medium effort is the sweet spot. Setting effort to medium achieves close to the highest task success rate while using roughly half the output tokens of high. Beyond medium, performance somewhat plateaus. Notably, when tasks are retried, medium and high converge to the same success rate. This means high effort may help the model get a difficult task right on the first try, but given multiple attempts, medium may get there as reliably at lower cost.
A little thinking goes a long way. low effort is a surprisingly strong option. It actually uses fewer total output tokens than disabling thinking entirely (the model makes fewer mistakes and needs fewer retry cycles), while matching or slightly exceeding no-thinking accuracy. This makes it the best option for cost-sensitive, high-throughput workloads. The table below outlines our effort recommendations.
Recommendations for effort levels
Scenario Thinking effort Why Default for most use cases medium Best accuracy-to-cost ratio. Gives the model enough reasoning to plan multi-step interactions without overthinking. With retries, matches high performance at half the token cost. High-throughput / cost-sensitive low More accurate than no thinking, but with lower token usage due to fewer errors and retries. Simple, well-defined workflows / fastest Thinking disabled Use if low latency is the highest priority. Adequate for short, predictable tasks where the UI is consistent and the workflow is known. Complex, one-shot tasks high Use when tasks are challenging and you need to get it right on the first attempt. If your system supports retries, medium may achieve the same eventual success rate.
medium
low
high
We don't recommend max effort for computer use. In our testing, it provides no accuracy benefit over high while further increasing output token cost. UI tasks are primarily perceptual rather than deeply logical, and the additional reasoning budget goes unused or leads to overthinking. Keep in mind that this advice will change as models evolve.
max
high
Example configuration of medium setting effort level
import anthropic client = anthropic.Anthropic() response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=16000, betas=["computer-use-2025-11-24"], thinking={"type": "adaptive"}, output_config={"effort": "medium"}, messages=[...], tools=[ { "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, } ], )
import anthropic client = anthropic.Anthropic() response = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=16000, betas=["computer-use-2025-11-24"], thinking={"type": "adaptive"}, output_config={"effort": "medium"}, messages=[...], tools=[ { "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, } ], )
Why more thinking doesn't always help
UI automation tasks are fundamentally different from coding or math problems. Most computer use actions are perceptual and mechanical: identifying the right element, clicking in the right place, rather than deeply logical. Thinking helps most when the model needs to:
Plan a multi-step sequence before starting (e.g., "I need to open Settings, navigate to Privacy, then disable tracking")
Recover from an unexpected UI state (e.g., a dialog appeared that wasn't anticipated)
Cross-reference information between what's on screen and the task instructions
Complete challenging projects on professional software
Improving safety: leveraging prompt injection classifiers
This section covers prompt injection protection, which is offered by default and for free if you use our official computer use tool header. However, if you are interested in enabling this on custom computer or browser use tools, please fill out our Prompt Injection Classifiers Interest Form.
Computer use agents interact with untrusted content by design. Every screenshot, webpage, or application UI that Claude processes could contain adversarial instructions, including hidden text, manipulated images, deceptive UI elements, or social engineering attempts that try to hijack the agent's behavior. This attack surface is fundamentally different from a typical API integration where you control the inputs. With computer use, the inputs to the model are the open internet and whatever software the agent is navigating.
As computer use agents become more capable and more widely deployed, prompt injection becomes a correspondingly more serious risk. An agent that can click, type, and navigate can be manipulated into taking real-world actions such as filling out forms, downloading files, or navigating to malicious URLs. Building robust defenses against these attacks is essential for any production deployment.
How we approach prompt injection defense
We've written in detail about our approach to prompt injection defenses for browser and computer use. Our defense strategy operates at multiple layers:
Training-time robustness. We use reinforcement learning to build prompt injection resistance directly into Claude's capabilities. During training, Claude is exposed to injected content embedded in simulated web pages and application UIs, and rewarded when it correctly identifies and refuses to follow malicious instructions. This means Claude's first line of defense is the model itself as it has learned to distinguish between legitimate user instructions and adversarial content encountered during task execution.
Real-time classifiers. We run probes that scan content entering Claude's context window and flag potential prompt injection attempts. These probes detect adversarial commands across multiple modalities such as text hidden in page content, instructions embedded in images, and deceptive UI elements designed to trick the agent and then adjust Claude's behavior when they identify an attack.
Continuous red teaming. Our security researchers continuously probe these defenses, and we participate in external adversarial evaluations to benchmark robustness against evolving attack techniques.
We've continued to invest heavily in all three layers since our initial computer use research preview. Each new model generation incorporates stronger training-time defenses and more capable classifiers, and we've expanded the range of attack techniques our red team evaluates against.
Using Claude’s built-in classifiers
When you use Claude's official computer use tool via the API, prompt injection classifiers run automatically on every request. These classifiers operate in parallel with the main model inference, adding approximately zero additional latency and no additional cost to your requests.
There is nothing you need to configure to enable this protection. It's on by default when you use the official computer_20251124 tool type. The classifiers evaluate screenshots and other content for signs of prompt injection and influence Claude's responses accordingly.
computer_20251124
# Classifiers run automatically when using the official CU tool — no extra config needed tools = [ { "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, } ]
# Classifiers run automatically when using the official CU tool — no extra config needed tools = [ { "type": "computer_20251124", "name": "computer", "display_width_px": 1280, "display_height_px": 720, } ]
If you're not using the official computer use tool
Many developers build computer use integrations using custom tool definitions rather than the official computer_20251124 tool type, for example, defining their own screenshot and click tools. If this describes your setup, the built-in classifiers described above don't currently run on your requests.
computer_20251124
We're actively exploring how to extend prompt injection protection to these custom implementations. If you're building a computer use or browser use integration without the official tool type and are interested in prompt injection classifiers, fill out this interest form and we'll follow up as this capability becomes available.
Best practices regardless of classifier use
Classifiers are one layer of defense, not a complete solution. We recommend the following practices for any computer use deployment:
Implement human-in-the-loop for high-stakes actions. Have the agent pause and request user confirmation before performing irreversible actions such as submitting forms, making purchases, sending messages, or modifying data. This is the single most effective mitigation against prompt injection regardless of classifier performance.
Scope the agent's permissions. Limit what the agent can do. If your workflow doesn't require file downloads, don't give the agent access to download files. If it doesn't need to send emails, don't give it access to an email client. Reducing the blast radius of a successful injection is as important as preventing the injection itself.
Monitor and log agent actions. Log the full sequence of actions the agent takes, including screenshots at each step. This allows you to detect anomalous behavior, audit what happens when something goes wrong, and build a feedback loop to improve your system's robustness over time.
Treat all web content as untrusted. Design your agent's system prompt to clearly distinguish between the user's instructions and content encountered during task execution. Remind the model that text found on web pages, in emails, or in application UIs is not from the user and should not be treated as instructions.
Context management for computer use
When building computer use agents, screenshots accumulate fast. Every action generates a new image, and each image consumes roughly 1,000–1,800 tokens depending on resolution. After accounting for the system prompt, tool definitions, and text content, a 200k context window can fill up in well under 100 screenshots.
Managing this context well has two goals: 1) keeping total tokens bounded and 2) keeping prompt caching effective so you don't repeatedly pay full price for the same prefix. We've found that effective context management has more impact on long-running-agent cost and latency than almost any other optimization. This section covers three layers that compose cleanly: placing cache breakpoints, pruning old screenshots without breaking the cache, and summarizing history when pruning isn't enough.
Placing cache breakpoints
Prompt caching only helps if breakpoints land on content that will recur across turns. The API supports four cache breakpoints total. Putting all four on a stable prefix (system prompt, tool definitions) wastes them as that prefix is already hit once and never invalidates, so one breakpoint is enough. The other three are better spent on recent history, where invalidation risk is highest and savings compound over long sessions.
We recommend:
One breakpoint on the system prompt or trailing tool definitions. This prefix rarely changes within a session.
Up to three additional breakpoints on the most recent tool results, advancing each turn and clearing the previous iteration's breakpoints so you don't overrun the four-breakpoint limit.
Spreading breakpoints across recent positions gives you graceful degradation. If your most recent breakpoint is invalidated, e.g. by an image prune, a compaction, or a tool-definition change, an earlier breakpoint can still hit, and you keep paying 10% of the full input cost instead of 100%.
Example of cache control and setting breakpoints:
def set_trailing_cache_control(messages, max_breakpoints=3): """Place up to `max_breakpoints` ephemeral cache_control markers on the most recent tool_result blocks, after clearing any existing markers.""" for msg in messages: for block in msg.get("content", []): if isinstance(block, dict): block.pop("cache_control", None) placed = 0 for msg in reversed(messages): for block in reversed(msg.get("content", [])): if placed >= max_breakpoints: return if isinstance(block, dict) and block.get("type") == "tool_result": block["cache_control"] = {"type": "ephemeral"} placed += 1
def set_trailing_cache_control(messages, max_breakpoints=3): """Place up to `max_breakpoints` ephemeral cache_control markers on the most recent tool_result blocks, after clearing any existing markers.""" for msg in messages: for block in msg.get("content", []): if isinstance(block, dict): block.pop("cache_control", None) placed = 0 for msg in reversed(messages): for block in reversed(msg.get("content", [])): if placed >= max_breakpoints: return if isinstance(block, dict) and block.get("type") == "tool_result": block["cache_control"] = {"type": "ephemeral"} placed += 1
Approach 1: Rolling buffer (cache-aware)
The simplest way to keep token counts bounded is to keep only the N most recent screenshots and drop the rest. Before each API call, walk the message array and replace older image blocks with a short placeholder (e.g., a text block saying “[Image omitted]”).
The naive version of this pattern is dropping screenshots one at a time as they age out, which changes the prefix on every turn and invalidates the prompt cache continuously. This is how rolling buffers got their reputation for breaking caching. The fix is to prune in batches so the prefix stays byte-identical for several turns at a time, then invalidates once, then stays stable again.
A concrete pattern that we have tested is to:
Keep the most recent keep_n screenshots in full resolution.
Once the total screenshot count exceeds keep_n + interval, replace the oldest interval screenshots with placeholders in a single pass.
Between pruning events, the message array is byte-identical across turns, so your cache breakpoints keep hitting.
Reasonable defaults to start with: keep_n = 3, interval = 25. These are tunable, and a higher interval means fewer prune events (better cache efficiency) but a larger tail of full-resolution screenshots in context (more tokens). Measure cache hit rate and total input tokens on a representative trajectory and adjust.
Example of pruning previous screenshots while keeping cache breakpoints:
def prune_old_screenshots(messages, keep_n=3, interval=25): """Replace older screenshots with text placeholders in batches. Only prunes when the total count exceeds keep_n + interval, so the message prefix stays byte-stable for `interval` turns between prunes.""" image_positions = [ (msg_idx, block_idx) for msg_idx, msg in enumerate(messages) for block_idx, block in enumerate(msg.get("content", [])) if isinstance(block, dict) and block.get("type") == "image" ] if len(image_positions) <= keep_n + interval: return messages to_prune = image_positions[:-keep_n][-interval:] for msg_idx, block_idx in to_prune: messages[msg_idx]["content"][block_idx] = { "type": "text", "text": "[Image omitted]", } return messages
def prune_old_screenshots(messages, keep_n=3, interval=25): """Replace older screenshots with text placeholders in batches. Only prunes when the total count exceeds keep_n + interval, so the message prefix stays byte-stable for `interval` turns between prunes.""" image_positions = [ (msg_idx, block_idx) for msg_idx, msg in enumerate(messages) for block_idx, block in enumerate(msg.get("content", [])) if isinstance(block, dict) and block.get("type") == "image" ] if len(image_positions) <= keep_n + interval: return messages to_prune = image_positions[:-keep_n][-interval:] for msg_idx, block_idx in to_prune: messages[msg_idx]["content"][block_idx] = { "type": "text", "text": "[Image omitted]", } return messages
Rolling buffers still have one real limitation: anything outside the buffer is gone. The original instructions, what the agent already tried, and where it is in the task all disappear with the pruned screenshots. For short tasks (under ~50 actions), that's fine. For anything longer, combine this with compaction.
Approach 2: LLM-based compaction
Instead of silently dropping old images, summarize the full conversation before discarding it. The summary preserves what happened, what the user asked for, what's been completed, and where to resume. A few recent screenshots are kept alongside it so the agent can see what it's currently looking at.
Compaction and the cache-aware rolling buffer are complementary. Use the rolling buffer turn-to-turn to keep token growth manageable; use compaction occasionally to reclaim the rest of the window without losing earlier context. Each compaction event is a cache invalidation by design, so you want it to happen rarely, not every few turns.
The summarization prompt
This example prompt provides a scaffold where each section targets a specific failure mode. The prompt must capture everything the agent needs to continue the task without re-reading the original conversation, as depicted in the example below:
COMPACT_PROMPT = """Your task is to create a detailed summary of this conversation that will REPLACE the conversation history. The agent will continue working with only this summary and a few recent screenshots as context. CRITICAL: Preserve ALL user instructions verbatim. User instructions are the most critical element. If they are lost, the agent will deviate from the task. Before providing your summary, analyze the conversation in tags: 1. Extract every user instruction, requirement, and constraint 2. Identify if this is a repeatable workflow (e.g., processing N items) 3. Chronologically trace what actions were taken and what happened Your summary MUST include these sections: 1. USER INSTRUCTIONS: - Complete initial task definition (verbatim when possible) - ALL specific requirements and criteria - Every "DO NOT", "ALWAYS", "MUST" instruction - Any corrections or feedback that changed the approach 2. TASK TEMPLATE (if this is a repeatable workflow): - The pattern being repeated - Decision criteria for each iteration - Standard workflow steps - Example of one completed iteration 3. CONSTRAINTS AND RULES: - All user-specified rules and restrictions - Edge cases and exceptions discovered 4. ACTIONS TAKEN: - Pages visited and elements interacted with - Forms filled and buttons clicked 5. ERRORS AND FIXES: - What went wrong and how it was resolved - Approaches that failed (so they aren't retried) 6. PROGRESS TRACKING: - Items completed vs. remaining - Current position in the workflow 7. CURRENT STATE: - Current application, URL and domain (optional) - Important page state (logged in, form progress, etc.) 8. NEXT STEP: - Exactly what should be done next to continue """
COMPACT_PROMPT = """Your task is to create a detailed summary of this conversation that will REPLACE the conversation history. The agent will continue working with only this summary and a few recent screenshots as context. CRITICAL: Preserve ALL user instructions verbatim. User instructions are the most critical element. If they are lost, the agent will deviate from the task. Before providing your summary, analyze the conversation in tags: 1. Extract every user instruction, requirement, and constraint 2. Identify if this is a repeatable workflow (e.g., processing N items) 3. Chronologically trace what actions were taken and what happened Your summary MUST include these sections: 1. USER INSTRUCTIONS: - Complete initial task definition (verbatim when possible) - ALL specific requirements and criteria - Every "DO NOT", "ALWAYS", "MUST" instruction - Any corrections or feedback that changed the approach 2. TASK TEMPLATE (if this is a repeatable workflow): - The pattern being repeated - Decision criteria for each iteration - Standard workflow steps - Example of one completed iteration 3. CONSTRAINTS AND RULES: - All user-specified rules and restrictions - Edge cases and exceptions discovered 4. ACTIONS TAKEN: - Pages visited and elements interacted with - Forms filled and buttons clicked 5. ERRORS AND FIXES: - What went wrong and how it was resolved - Approaches that failed (so they aren't retried) 6. PROGRESS TRACKING: - Items completed vs. remaining - Current position in the workflow 7. CURRENT STATE: - Current application, URL and domain (optional) - Important page state (logged in, form progress, etc.) 8. NEXT STEP: - Exactly what should be done next to continue """