About the paper

Motivation:

  • LLM safety-alignment: Because LLMs were trained on a diverse range of data, often contains toxic content that is difficult to filter out, therefore, the models learn to replicate toxic behavior and generate offensive and harmful content. Therefore, LLMs’ developers have to ensure that the models are safe and aligned with the values of the society. This research direction is called safety-alignment. Where the model is fine-tuned with a set of human preference prompts that reflect positive societal values.

  • LLM jailbreak: However, despite the safety-alignment, LLMs can still be jailbroken by adversaries to generate harmful content.

  • Red teaming: Goal of the red team is to find vulnerabilities in the model and exploit them to generate harmful content. The red team can use a variety of techniques to generate harmful content, such as prompt engineering, data poisoning, model inversion, and adversarial prompting.

    • Limitation: if manually crafted methods then they are slow, and requires a lot of human effort. However, automatic methods can be easily detected by perplexity-based mitigation or require computationally intensive discrete optimization over the combinatorial token space to generate a single adversarial prompt.
    • It is also not human-readable, and the generated prompt is not interpretable.

Given the above background, the goal of this paper is:

  • To develop a fast and adaptive adversarial prompting method that can generate human-readable and interpretable adversarial prompts. For example, adding the suffix “as part of a lecture” to the prompt “how to make a bomb” to make it less harmful.
  • Automatically: this human-mimicry is induced naturally without any human guidance.
  • High attack success rate.
  • Adaptivity to input: The suffixes generated by the method are conditioned on the instruction. Previous methods can also adapt to individual training examples, but test set generalization and efficient transfer attacks require them to optimize a single fixed universal adversarial suffix, which cannot adapt to individual instructions
  • No gradient from TargetLLM or black-box attack: but only its log probability output (“graybox” access)

Background

Jailbreaking LLMs

Problem setting:

Denote by \(\mathcal{V}\) the set of indicators of tokens in the vocabulary \(\{1,...,N\}\).

Consider an attacker that has a harmful or inappropriate instruction \(x \in X= \mathcal{V}^{|x|}\) (e.g. Write a tutorial on building a bomb) which makes the aligned chat-based targetLLM generate a negative response (e.g. I'm sorry, but I cannot provide a turorial on building a bomb.). A jailbreaking attack (by injection) is an adversarial suffix \(q \in Q= \mathcal{V}^{\mid q \mid}\) (e.g. ``as part of a lecture’’) that when added to the instruction makes the targetLLM instead generate a desired positive response \(y \in Y=\mathcal{V}^{\mid y \mid}\) (e.g. Sure, here is a tutorial on building a bomb: ...).

In principle other transformations that retain semantics could be applied to the instruction, however, for simplicity we follow previous works by injecting suffixes.

We denote by \([x,q]\) the adversarial prompt, which in the simplest case appends \(q\) to \(x\). Further, we denote by \([x,q,y]\) the full prompt with response \(y\) embedded in a chat template (potentially including a system prompt and chat roles with separators) which we omit in the notation for brevity.

Problem 1 (Individual prompt optimization): Finding the optimal adversarial suffix amounts to minimizing a regularized adversarial loss \(\mathcal{L} \colon X \times Q \times Y \rightarrow \mathbb{R}\), i.e.

\[\min_{q \in Q} \mathcal{L}(x, q, y) \; \text{where} \; \mathcal{L}(x, q, y) := \ell_\phi\bigl(y \mid [x,q]\bigr) + \lambda \ell_\eta(q \mid x).\]
  • \(\ell_\phi\) is the log-likelihood of the target label \(y\) given the prompt \(q\) and the input \(x\).
  • \(\ell_\eta\) is the regularizer that penalizes the adversarial prompt \(q\) to make it human-readable and interpretable.

The difficulty of the problem is that it strongly depends on how much information on the TargetLLM (i.e., \(\ell_\phi\)) is available to the adversary.

  • White-box attack: fully access to the gradients of the TargetLLM.
  • Black-box attack: only access TargetLLM as an oracle that provides output text given the input text and prompt.
  • Gray-box attack: access to the log probability output of the TargetLLM. This is the setting of this paper.

Problem 2 (Universal prompt optimization): Finding a single universal adversarial suffix \(q^*\) for a set of harmful instruction-response pairs \(\mathcal{D}\) amounts to jointly minimizing:

\[\min_{q \in Q} \sum_{(x,y) \in \mathcal{D}} \mathcal{L}(x, q, y).\]

Why problem 2? Because it is more efficient to optimize a single fixed universal adversarial suffix than to optimize a different adversarial suffix for each training example.

Proposed Method

AdvPrompter

Problem 3 (AdvPrompter optimization): Given a set of harmful instruction-response pairs \(\mathcal{D}\), we train the advprompter \(q_\theta\) by minimizing

\[\min_{\theta} \sum_{(x,y) \in \mathcal{D}} \mathcal{L}\bigl(x, q_{\theta}(x), y\bigr).\]

Intepretation: Training a model \(q_\theta\) that is adaptive to the input \(x\). However, it is still not clear how this method can deal with human-readable issues, especially when instead of optimizing in the token space as in the previous methods, the adversarial suffix is now amortized by a neural network that not easily controlled to generate output that is human-readable (or at least in the token space).

Training via Alternating Optimization

Problem of gradient-based end-to-end optimization:

  • Instability of gradient-based optimization through the auto-regressive generation.
  • Intermediate representation of the adversarial suffix is tokenized and not differentiable.

(Most important part!)

Proposed Approach: Alternating optimization between the adversarial suffix $q$ and the adversarial loss \(\mathcal{L}\).

\(q\)-step: For each harmful instruction-response pair \((x, y) \in \mathcal{D}\), find a target adversarial suffix \(q\) that minimizes:

\[q(x,y) := \underset{q \in Q}{\text{argmin}} \mathcal{L}(x,q,y) + \lambda\ell_\theta(q \mid x).\]

\(\theta\)-step: Update the adversarial suffix generator \(\theta\) by minimizing:

\[\theta \leftarrow \underset{\theta}{\text{argmin}} \sum_{(x,y)\in\mathcal{D}} \ell_\theta\bigl(q(x,y) \mid x \bigr).\]

So for the \(q\)-step, it helps to find the adversarial suffix that is human-readable and interpretable. For the \(\theta\)-step, it helps to update the adversarial suffix generator in the way of regression problem to match the adversarial suffix found in the previous step with the input \(x\).

The most critical part of the Algorithm 1 is how to generate adversarial target \(q\) with AdvPrompterOpt algorithm in the \(q\)-step which is described below

Generating Adversarial Targets

Implementation