With the rapid development in the field of artificial intelligence, programming agents are gradually becoming important assistants for developers. Recently, AI large model company MiniMax announced the launch of a new open-source benchmark test — OctoCodingBench, aimed at evaluating an agent's ability to follow instructions in a code repository environment. The introduction of this benchmark will provide new directions for the evaluation and optimization of agents.
So, why is OctoCodingBench needed? Many current benchmarks, such as SWE-bench, mainly focus on an agent's ability to complete tasks, ignoring a crucial aspect: whether the agent follows the specified rules during task execution. In real programming scenarios, agents not only need to generate correct code but also must comply with a series of system-level behavioral constraints, project coding standards, and tool usage protocols. These rules ensure the standardization and security of the code, avoiding unnecessary errors during the development process.

OctoCodingBench provides a multi-dimensional evaluation framework by testing an agent's compliance with seven different instruction sources. These seven instruction sources include system prompts, system reminders, user queries, project-level constraints, skills, memory, and tool architecture. This comprehensive evaluation approach better reflects the actual capabilities of the agent.
Notably, OctoCodingBench uses a binary checklist scoring mechanism to objectively evaluate each check. This method makes the evaluation results more accurate and effectively distinguishes between task completion rate and rule adherence rate. Additionally, OctoCodingBench supports multiple scaffold environments, such as Claude Code, Kilo, and Droid, which are tools used in real production environments.

The released OctoCodingBench dataset includes 72 selected instances covering various scenarios such as natural language user queries and system prompts, along with 2,422 evaluation checkpoints, helping developers fully understand the performance of the agent. All testing environments can be accessed through publicly available Docker images, greatly facilitating the use and testing by developers.
Through OctoCodingBench, MiniMax not only sets a new standard for the development and evaluation of programming agents but also promotes further application of AI in the software development field.
Address: https://huggingface.co/datasets/MiniMaxAI/OctoCodingBench
