Adversarial Attacks on LLMs

BlogApr 22, 2026

Original excerpt

The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired.

A large body of ground work on adversarial attacks is on images, and differently it operates in the continuous, high-dimensional space. Attacks for discrete data like text have been considered to be a lot more challenging, due to lack of direct gradient signals. My past post on Controllable Text Generation…

Read full article on lilianweng.github.io

Frequently asked questions

What is "Adversarial Attacks on LLMs" about?

This article by Lilian Weng is part of the Lilian Weng reading list on Burn 451, covering ai safety · post-training · llm internals.

Who wrote "Adversarial Attacks on LLMs"?

This piece is part of the Lilian Weng vault on Burn 451, covering ai safety · post-training · llm internals. The original author is attributed at the source link.

How can I read more content from Lilian Weng?

The complete Lilian Weng reading list is available at burn451.cloud/vault/lilian-weng. Each article includes an AI-generated summary so you can decide what to read in seconds. Connect the Burn 451 MCP server to Claude or Cursor to query all Lilian Weng articles as live AI context.

Can I use "Adversarial Attacks on LLMs" with Claude or Cursor?

Yes. Install the burn-mcp-server npm package and connect it to Claude Desktop, Claude Code, or Cursor. Once connected, your AI can search and reference this article and the full Lilian Weng vault in real time — no manual copy-paste required.

10 more articles in this vault.

Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.

View Full Vault Get Burn 451

Content attributed to the original author. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.