Adversarial Attacks on LLMs
Highlights
- ▸Text adversarial attacks are harder than image attacks because text is discrete — no direct gradient signals — yet jailbreak prompts remain effective at bypassing RLHF guardrails
- ▸White-box attacks require full model access and gradients, limiting them to open-source models; black-box attacks work through API endpoints and are the dominant real-world threat
- ▸Weng catalogs five attack families including token manipulation; generative attack success is harder to judge than classification attacks, requiring high-quality classifiers or human review
Original excerpt
The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired.
A large body of ground work on adversarial attacks is on images, and differently it operates in the continuous, high-dimensional space. Attacks for discrete data like text have been considered to be a lot more challenging, due to lack of direct gradient signals. My past post on Controllable Text Generation…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.