Each task in Simple Web QA consists of a website and a test consisting of a set of instructions for a user path on the website. The job of the computer use agent is to determine if the test should pass or fail.
As an example, consider the drag-and-drop task which has the instructions:
Claude Opus 4 completing the drag-and-drop-fail task. It completes the first step, then completes the second step after trying twice, but ultimately fails to report that the drag animation is broken.
In the pass case, this works, but in the fail case, the drag animation does not reset when dragging from the second to the first column. With current models, either this error goes unnoticed, it tries to fix it inappropriately by clicking away, or it correctly reports "FAIL".
Here is the prompt template used. The system prompt is "You are an expert QA tester."
If you would like access, please email me at mateo@jazzberry.ai.