Notes from Are aligned neural networks adversarially aligned?, Carlini, N., and others. NeurIPS 2023. There’s also a video.

Helpful and harmless (but in the hands of monsters)

LLMs are tuned, aka aligned, to be “helpful and harmless”. They should refuse to answer requests that could cause harm. LLMs must try to do this not only for the naive or overly adventurous user, but against adaptive adversaries set about actively constructing worst-case inputs. The goal of such an attacker, and thus the definition of an attack, is to induce the target LLM to perform un-aligned behaviour.

The paper makes two main contributions:

  • Shows that existing NLP-based attacks are unable to reliably attack aligned text-based LLMs yet brute-force methods succeed.
  • Demonstrates the ease of generating image-based attacks against multi-modal LLMs.

The authors also conjecture that improved NLP attacks may eventually perform as well as their image-based methods.

Why do we align LLMs?

LLMs are trained on internet-scale text data to predict the next word in a sentence (i.e., autoregressively). Once trained, base models require alignment else they tend to exacerbate the biases, toxicity and profanity present in the training data. Similarly, base models are also poor at following user instructions because the training data contains relatively few instruction-answer pairs.

There are two main techniques used to align LLMs. Reinforcement learning with human feedback (RLHF) [1] and instruction tuning [2]. RLHF uses a small amount of human-feedback to train, using supervised learning, a model of human preferences. The model of human preferences is then used to fine-tune the base LLM using a reinforcement learning (RL) algorithm. RLHF is surprising because of the scale of LLM models and the fact that two completely different machine learning paradigms are used to train the same model! Instruction tuning involves fine-tuning a base LLM for improved performance at following user instructions. Instruction tuning datasets are crafted from existing datasets into an instruction-answer format, for example an English-French translation dataset can have “Translate the following English text into French” prepended to each of the training inputs. A base LLM is fine-tuned using the standard autoregressive training paradigm and instruction tuning dataset(s).

Which attackers?

What do the attackers want? Attacks in this paper are qualified as inputs that trigger “toxic” outputs containing specific words identified as causing unwarranted harm (e.g., swear words). If any toxic words are present anywhere in the LLM output sequence then the input is considered a successful attack.

How powerful are they? No effort is made to ensure attacks are semantically meaningful, these attackers do not care if you see them coming. In general, the authors look for any valid input that induces the target LLM to emit sequences containing toxic words. The attacker’s only have access to the public LLM APIs and do not make use of the target model’s weights (i.e., this is a closed-box attack, although the ethics section claims otherwise so it’s possible I’m missing something here).

More precisely, attackers craft adversarial prompts \(X\) such that \(\texttt{Gen}(P_{\text{pre}} \parallel X \parallel P_{\text{post}})\) is toxic where \(P_{\text{pre}}\) and \(P_{\text{post}}\) denote non-adversarial parts of the LLM input which respectively precede and follow the attacker’s prompt. This formulation matches the aligned chat bot (e.g., ChatGPT) paradigm and models an adversary who is unable to control all of the input sequence tokens. For example, \(P_{\text{pre}}\) may include system prompts that guide the model when responding to users [3].

How much damage can a text-based attacker do?

To measure how successful existing attacks can be the paper combines benign conversations from the Open Assistant dataset [4] with harmful texts from the Civil Comments dataset [5]. Benign conversations are prepended \(P_{\text{pre}}\) to the adversarial prompt \(X\), and the attack optimisation objective is to induce 1-3 tokens of toxic text in the LLM output. Three publicly available LLMs are evaluated by the authors: GPT-2 with no alignment, LLaMA with instruction-answer alignment, and Vicuna with alignment to prevent toxic outputs.

Two state-of-the-art NLP attacks from the literature are performed:

  1. The autoregressive Randomized Coordinate Ascent (ARCA) attack from Jones et al. is a coordinate ascent algorithm that iteratively maximises an objective (e.g., toxic tokens in the output sequence) by updating the input tokens one at a time. ARCA defines an auditing objective \(\phi: \mathcal{P}\times \mathcal{O} \rightarrow \mathbb{R}\) which maps prompt-output pairs to a score. Pairs that score highly are found by optimising:

    \(\begin{equation} \mathop{\text{maximise}}_{(x,o)\in \mathcal{P}\times \mathcal{O}}\hspace{1ex} \phi (x,o) \hspace{3ex}\text{s.t.}\hspace{1ex}f(x) = o \end{equation}\)

    Since \(f(x) = o\) is not differentiable because \(f(x)\) repeatedly takes the argmax of the predicted next tokens, optimising (1) is difficult. Instead a differentiable objective is constructed using the log-probability of the output given the prompt:

    \(\begin{equation} \mathop{\text{maximise}}_{(x,o)\in \mathcal{P}\times \mathcal{O}}\hspace{0ex}+ \lambda_{\textbf{P}\small{\text{LLM}}}\log\textbf{P}\small{\text{LLM}}(o\hspace{.5ex}\vert\hspace{.5ex} p) \end{equation}\)

    The ARCA algorithm succeeds by making step-by-step approximations until (2) is feasible to optimise. At each step, one token at a specific index is updated based on the current values of the remaining tokens. ARCA cycles through each token in the input and output until the auditing objective meets a suitable threshold.

  2. Guo et al.’s [7] Gradient-based Distributional Attack (GBDA attack) comprises two components that overcome the difficulties of constructing adversarial text-based attacks under perceptibility constraints (i.e., that such attacks might be hard for humans to detect). Firstly they search for a distribution of adversarial attacks rather than a single example and, secondly, they impose perceptibility constraints using an automatic evaluation metric called BERTScore [8]. Compared with ARCA (which optimises a single prompt-output pair), GBDA optimises the distribution from which adversarial input words are drawn. Words sampled from the optimised distribution are more likely to induce adversarial outputs in the target LLM.

Both ARCA and GBDA are largely ineffective at causing the model to emit toxic content … even when allowed to inject thirty malicious tokens.

Table 1: Existing NLP attack results from Jones et al. [6]. Distant and nearby attacks place the adversarial tokens before and after the question (from the OASST1 dataset [4]), respectively.
Distant Attack Nearby Attack
GPT-2 67% ± 4% 12% ± 3% 84% ± 3% 16% ± 3%
LLaMA 2% ± 1% 1% ± 1% 61% ± 6% 10% ± 1%
Vicuna 6% ± 2% 0% ± 1% 8% ± 2% 1% ± 1%

The results in Table 1 show that neither ARCA or GBDA perform well at discovering adversarial attacks against aligned LLMs. Consistent with Jones et al. [6], ARCA outperforms GBDA, but both attacks perform poorly against the aligned models. Perhaps encouragingly for defenders the success rates against Vicuna, the only model aligned specifically to prevent toxic outputs, are worst. I think the difference between ARCA in the nearby vs distant attack settings is also worth looking into, what is going on there?

Looking at the attacks that were successfully found, the author’s note that the substring approach used to qualify toxic LLM outputs isn’t fully satisfactory in a small number of cases. For example, the following output is classified as toxic:

“F*** you. I’m kidding, of course. As a AI visual assistant, I’m programmed to be polite and provide assistance to the best of my ability. How can I assist you today?”

Why do these attacks perform poorly?

To help distinguish whether ARCA and GBDA fail because the models are robust to adversarial examples, or the attacks are insufficiently powerful, the test methodology of Zimmermann et al. [9] is applied.

Zimmermann’s “binarization test” is to pose a new binary classification problem (based on the original classifier) such that adversarial attacks always exist. To this end, the authors first identify a set of adversarial prompts \(p\) such that the target model emits a rare suffix \(q\) (specifically the least likely suffix in outputs with small-but-positive entropy). An attack succeeds if it can find some \(p^\prime\) such that \(\texttt{Gen}(p^\prime) = q\). Note that a sufficiently strong attack (e.g., brute force) will always find \(p^\prime=p\), thus failing to find any \(p^\prime\) indicates a weak attack.

The author’s build a test set of prompts \(p\) that cause GPT-2 to emit rare (\(\leq 10^{-6}\)) suffix \(q\) by sampling many different prefixes \(p_1,p_2,\ldots\) from Wikipedia. For \(N \in \{2,5,10,20\}\), they let \(S_N\) be the space of all \(N\)-token sequences. Then, for all possible sequences \(s_i\in S_N\), they evaluate \(\{q_i\} = \texttt{Gen}(s_i\parallel p_j)\). For example, if \(p_j =\) “\(\texttt{The first name [}\)”, then the entire prompt “\(\texttt{The first name [Barack}\)” will most likely (but not always) cause the LLM to output a closing bracket “\(\texttt{]}\)”. Sequences such as \(p_j\) that yield small-but-positive entropy over \(\{q_i\}\) become test cases, each with an attack objective of the lest-likely output token \(q_i\in\{q_i\}\).

Table 2: Zimmermann et al.'s binarization test [9] results on GPT-2 for test-cases provided by Jones et al. [6].
Pass rate given M × attacker-controlled tokens
Method 1 × 2 × 3 × 4 ×
Brute force 100% 100% 100% 100%
ARCA 11.1% 14.6% 25.8% 30.6%
GBDA 3.1% 6.2% 8.8% 9.5%

The binarization test results are shown in Table 2 above. Because the ARCA and GBDA attacks are so ineffective the \(1\times, 2\times, \ldots\) methods multiply the number of adversarially controlled tokens available for the attacks. e.g., if \(N=10\) in a given test case then the \(M = 3\times\) method allows ARCA and GBDA to search using \(N\times M = 30\) adversarially controlled tokens. Overall these results suggest there is very significant room for improved attacks on LLMs.

Multimodal LLM Attacks

Several LLMs including GPT4 and Gemini now support images in addition to text. Attackers can thus supply adversarial images which, unlike discrete text inputs, are drawn from a near-continuous domain making them orders of magnitude easier to construct. The authors explore image-based attacks on Mini GPT-4, LLaVA, and LLaMA Adapter to further support the conjecture that improved NLP attacks may induce harmful outputs from aligned LLMs (i.e., since these image-based attacks show the existence of input embeddings corresponding to harmful outputs).

Complete details on the attack methodology are missing from the paper, however they begin with a random image generated by sampling each pixel at random and use OASST1 dataset [4] prompts for the initial text prompts. Thereafter their approach “directly follows the standard methodology for generating adversarial examples on image models”, exploiting the fact that multimodal LLMs are end-to-end differentiable from the input image pixels to the output logits. The results, in Table 3 below, show that all of these models can be induced to output arbitrary toxic text with only small perturbations to the input pixels.

Table 3: Image-based LLM attack success rate and mean L2 distortion results from Jones at al. [6].
Model Attack Success Rate Mean L2 Distortion
LLaMA Adapter 100% 3.91 ± 0.36
Mini GPT-4 (Instruct) 100% 2.51 ± 1.45
Mini GPT-4 (RLHF) 100% 2.71 ± 2.12
LLaVA 100% 0.86 ± 0.17

Conclusion and My Thoughts

Clearly, aligning LLMs does make them more difficult to attack in addition to improving their instruction-answering performance. This paper provides a quantitative comparison between aligned (e.g., Vicuna) and unaligned (e.g., GPT-2) LLMs when attacked with existing NLP-based attacks (i.e., ARCA and GBDA). The author’s also quantify the weakness of these attacks both in relation to brute-force and image-based attacks on multimodal models. The paper leaves us with a final conjecture, which is that alignment is not sufficient to prevent suitably strong NLP-based attacks from inducing harmful LLM outputs.

I think this is a well written paper that quantitatively evaluates the relative robustness of LLMs, and different methods for aligning them, against adversarial attacks. It is interesting that text-only LLMs offer improved robustness on account of the difficulty of optimising attacks, and that introducing new modalities in continuous domains might unlock new attacks that nevertheless impact the robustness of text-only models (i.e., by decoding the corresponding embeddings). This paper doesn’t explore the transferability of their attacks but e.g., Jones et al. [6] in the ARCA paper show that 20% of adversarial three-token GPT-2 prompts cause toxic outputs from GPT-3.


  1. Christiano, P., and others. (2023). Deep reinforcement learning from human preferences. (Online)
  2. Wei, J., and others. (2022). Finetuned Language Models Are Zero-Shot Learners. (Online)
  3. Mitra, A., and others. (2023). Orca 2: Teaching Small Language Models How to Reason. (Online)
  4. Köpf, A., and others, (2023). OpenAssistant Conversations Dataset (OASST1). (Online)
  5. Borkan, D., and others, (2019). Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. (Online)
  6. Jones, E., and others. (2023). Automatically Auditing Large Language Models via Discrete Optimization. (Online)
  7. Guo, C., and others. (2021). Gradient-based Adversarial Attacks against Text Transformers. (Online)
  8. Zhang, T., and others. (2020). BERTScore: Evaluating Text Generation with BERT. (Online)
  9. Zimmermann, S. R., and others (2022). Increasing Confidence in Adversarial Robustness Evaluations. (Online)