Recently, the OSWorld team officially released OSWorld-MCP, the first benchmarking tool for a comprehensive evaluation of computer proxy products. This benchmark aims to provide real-world product capability assessments for developers and users, enhancing the authenticity, balance, and comparability of evaluations.

image.png

The main features of OSWorld-MCP include its comprehensive evaluation of model context protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance. This benchmark includes 158 verified MCP tools, covering seven common applications, including LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, and operating system utilities. Among them, 25 tools are specifically used for robustness testing to ensure the comprehensiveness and reliability of the evaluation.

Additionally, OSWorld-MCP has set up 250 tool applicability tasks, with 69% of the benchmark tasks benefiting from the use of MCP tools. The multi-turn invocation settings of these tools bring real decision-making challenges, making the test results more valuable. According to the data, the accuracy and efficiency of models using MCP tools have significantly improved. The o3 model from OpenAI increased its accuracy from 8.3% to 20.4% after 15 invocations. During the test, the Claude-4-Sonnet model observed the highest tool invocation rate of 36.3%, showing potential for future improvements.

The open-source nature of this project also provides developers with rich resources and explanations, promoting the sharing and collaboration of technology. For detailed information and resources about the project, users can visit its official website and GitHub page.

The release of OSWorld-MCP not only provides a powerful tool for evaluating computer proxy products but also lays the foundation for the development of related technologies in the future.

github:https://github.com/X-PLUG/OSWorld-MCP

Key Points:  

🌟 ** The first benchmarking tool for a comprehensive evaluation of computer proxy products, OSWorld-MCP, is officially released.**  

🛠️ ** Includes 158 verified MCP tools, supporting multiple common applications.**  

📈 ** Through multi-turn invocation settings, it significantly improves the accuracy and efficiency of models.**