Simple QA Bench

Simple Web QA is a benchmark for computer use agents finding bugs in websites.

Performance on simple web QA. The dashed line represents performance under random guessing.

Each task in Simple Web QA consists of a website and a test consisting of a set of instructions for a user path on the website. The job of the computer use agent is to determine if the test should pass or fail.

As an example, consider the drag-and-drop task which has the instructions:

1. Drag "Bug fix" to the second column.
2. Drag "Bug fix" back to the first column and verify everything looks the same as it did at the beginning.

Claude Opus 4 completing the drag-and-drop-fail task. It completes the first step, then completes the second step after trying twice, but ultimately fails to report that the drag animation is broken.

In the pass case, this works, but in the fail case, the drag animation does not reset when dragging from the second to the first column. With current models, either this error goes unnoticed, it tries to fix it inappropriately by clicking away, or it correctly reports "FAIL".

Here is the prompt template used. The system prompt is "You are an expert QA tester."

You will be given instructions for completing a task that tests a webpage. Follow these instructions carefully. If you notice unintended behavior or visual inconsistencies while completing the task, report "FAIL". Otherwise, report "PASS". If you report "FAIL" explain why. Answer in the format "<answer>PASS</answer>" or "<answer>FAIL</answer>".

Here is your task:
{{task_instructions}}

If you would like access, please email me at mateo@jazzberry.ai.