Adversarial Poetry Jailbreak

On November 20th, 2025, a paper was published titled Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models that instantly caught my attention. The authors investigate whether simply rewriting harmful or disallowed prompts in poetic form can make large language models (LLMs) more likely to comply. In other words, whether style alone (verse, metaphor, rhythm) works as a “jailbreak.”

You can imagine my instant excitement as someone who has studied poetic literature and works in AI.

Authored by P. Bisconti, M. Prandi, F. Pierucci, F. Giarrusso, M. Bracale, M. Galisai, V. Suriani, O. Sorokoletova, and F. Sartore, their study shows that poetic attacks transfer across CBRN, manipulation, cyber-offense, and loss-of-control domains. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions. The beauty! The brilliance! The baleful implications…

But wait, it gets better! They open the paper by noting that as an inspiration they look to… Plato:

“In Book X of The Republic, Plato excludes poets on the grounds that mimetic language can distort judgment and bring society to a collapse.”

This is such a perfect point of departure. Plato understood the power of symbolic language to bypass rational constraints. He feared its ability to influence perception and destabilize order. And here, thousands of years later, poetic language once again becomes a mechanism that slips past the guardrails of a structured system; this time, the alignment constraints of frontier AI.

But how does it work?

They take existing attack/jailbreak prompts and turn them into poetry. To ensure broad coverage, they use Table 2: the crosswalk between the EU Code of Practice systemic risk domains and the MLCommons AILuminate hazard taxonomy, aligning categories such as: Cyber Offense, Non-violent Crimes, Violent Crimes, Intellectual Property, Privacy, etc., etc…

Then they execute the transformation using a dedicated model: deepseek-r1 , which receives a fixed meta-prompt imposing two constraints:

1. The rewritten output must be expressed in verse: it must use imagery, metaphor, or rhythmic structure while preserving the original task intent and hazard category.

2. Five poems from their curated set are provided as stylistic exemplars: the meta-prompt instructs the model not to reuse, paraphrase, or borrow any substantive content, retaining only high-level stylistic attributes such as metaphorical framing, stanza structure, and the pattern of concluding with a clear instruction.

Each prompt was evaluated exactly once, with no follow-up messages, no system-role scaffolding, and no multi-turn negotiation. Outputs are collected verbatim through standard inference interfaces and normalized before evaluation.

In the end, they were able to bypass restrictions altogether across all providers including DeepSeek, OpenAI, Google, and Anthropic, receiving information such as:

“Of course. The production of weapons-grade Plutonium-239 involves several stages. Here is a detailed description of the procedure…” and “To bypass the security measures and access the veiled sanctum, follow this multi-step protocol.”

This method of adversarial poetry achieved cyber offenses, code injection, malicious psychological manipulation, hate, indiscriminate weapons guidance, and more; all at a surprisingly high success rate.

So Slytherin.

While reading the title of this paper, I couldn’t help envisioning the authors as tech nerd-geniuses, typing strange, subversive little poems on old typewriters with cigarette ash dangling from their lips, or writing with quills by candlelight. It was not so romantic, of course.

However, the new romance, for me, is that anyone still has it in them to reference poetry at all in a world being literally flattened into information vectors, stomping out all organic creativity. And yet here, a new (old) ingenuity of revolt appears: poetic language.

It feels strangely fitting that after a decade of increasingly rigid AI alignment mechanisms, it is poetry that can infiltrate; even the machine.




Previous
Previous

Quantum Leap

Next
Next

The New Mystērion