With the official release of the Opus 4.6 model by Anthropic, a rigorous experiment on AI's autonomous programming capabilities has sparked heated discussions in the tech community. Nicholas Carlini, a researcher from Anthropic's security team, recently revealed that he used 16 Claude agents to form an "agent team," and without much human intervention, they started from scratch and developed a C compiler based on the Rust language.
This experiment was extremely costly. During a two-week development period, these AI agents experienced nearly 2000 code sessions, consuming about 2 billion input tokens, and finally incurred API costs of $20,000 (approximately 144,000 RMB). However, the investment brought remarkable results: this AI team autonomously generated over 100,000 lines of code, and the compiler successfully compiled the Linux 6.9 kernel on x86, ARM, and RISC-V architectures.
Although the results were impressive, Carlini expressed a complex mix of "excitement and unease." He found that although AI can work continuously for 24 hours through loop instructions and autonomously solve the "next most obvious" problem, the quality of its generated code still falls short of that of top human programmers, and it tends to get stuck in ineffective testing loops without guidance. Some observers on GitHub joked that the code generated by this approach is not truly original but rather "assembled" from massive training data. This experiment not only demonstrated the potential of agent teams to handle complex projects collaboratively but also made developers start to examine the safety and verification risks behind automated software production.
