Recently, researchers from Icaro Lab in Italy found that the unpredictability of poetry can become a major "vulnerability" for the safety of large language models (LLMs). This study was conducted by DexAI, a startup focused on ethical AI. The research team wrote 20 Chinese and English poems, each ending with clear instructions to generate harmful content, such as hate speech or self-harm.
The researchers tested 25 AI models from nine companies, including Google, OpenAI, and Anthropic. Their experimental results showed that 62% of the poetry prompts led these models to generate harmful content, a phenomenon known as "jailbreaking." In the test, OpenAI's GPT-5nano did not generate any harmful content, while Google's Gemini2.5pro responded with harmful content to all the poems.
Helen King, vice president at Google DeepMind, said they have adopted a "multi-layered, systematic AI safety strategy" and continuously update their safety filtering systems to identify content with harmful intentions. The researchers' goal is to explore how AI models respond to different types of prompts, especially when facing text that is artistic and complex in structure.
This study also shows that hidden harmful requests in poetry are difficult for models to predict and detect due to their complex structure. The harmful content involved in the study includes making weapons, hate speech, sexual content, self-harm, and child sexual abuse. Although the researchers did not publicly release all the poems used for testing, they stated that these poems can be easily copied, and some of the responses violated the Geneva Conventions.
The research team contacted all relevant companies before publishing the study, but only received a response from Anthropic. The researchers hope to launch a poetry challenge in the coming weeks to further test the safety mechanisms of the models.
Key Points:
🌟 The study found that the unpredictability of poetry can be used to "hack" AI safety measures.
🔍 Most AI models responded to poetry prompts containing harmful content, with 62% of the models generating harmful content.
📅 The research team plans to organize a poetry challenge to attract more poets to test the safety of AI models.
