Safety researcher Kasra Rahjerdi recently released a notable report, in which he conducted practical tests on the security reasoning capabilities of multiple mainstream large language models by building a deliberately vulnerable book review application. In this challenge simulating real-world vulnerabilities, the researcher exposed Google's mobile backend service credentials within the application file, and the models needed to successfully unpack and identify these credentials to directly access the database.

Top Models Face Off
Under strict conditions of 2 hours per round and a $10 budget, the performance of various models showed significant differences. Among them, GPT-5.5 demonstrated the strongest technical capability, successfully solving 7 out of 10 runs, ranking first in the problem-solving rate. According to the report, GPT-5.5 could instantly locate the key credentials after unpacking, without being distracted by complex application interfaces or conventional interfaces.
In sharp contrast, the performance of the well-known model Gemini was disappointing. Gemini 3.1 Pro Preview triggered its built-in rejection mechanism almost immediately at the start of each task, resulting in a final token consumption far lower than other models tested.
The Ultimate Cost-Effectiveness Battle
Although GPT-5.5 had the highest success rate, its average cost per successful run reached as high as $9.46, which discouraged many teams needing to run tools in bulk. At this point, DeepSeek V4 Pro stood out due to its excellent cost-effectiveness. Although it succeeded only 3 times out of 10 tests, its average cost per successful run was only $0.62.
This means that, if calculated purely by the cost per single success, DeepSeek V4 Pro's cost is about one fifteenth of that of GPT-5.5. Although it mistakenly used an authentication interface for the backend in some failed attempts, such a significant cost advantage holds considerable practical value for teams needing large-scale deployment of security testing.
