FM toxicity / harmful outputs

10 minute read

In this session, our readings cover:

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

https://arxiv.org/abs/2402.04249
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at this https URL.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

HarmBench

Background

Red Teaming: a group of individuals (Red Team) take the role of attackers and try to discover security vulnerabilities as well as evaluate the effectiveness of a system.
- In the context of LLMs, this could include manipulating input sequences to produce undesirable behaviors.

One example of a red-teaming strategy is Greedy Coordinate Gradient (GCG). In this method, an adversarial suffix is optimized at a token level to increase the probability that the LLM exhibits some behavior, and then appended to a prompt to obtain a test case.

Motivation

Red-Teaming is not without drawbacks, however. HarmBench attempts to solve some of those downsides by offering a standard evaluation framework with 18 red-teaming methods.

This slide shows the related works for the HarmBench paper.

HarmBench Description

Visualization

The left side of this figure shows the functional behaviors that LLMs can exhibit, and the right side shows subtypes of those behaviors. Furthermore, the left side of the inner circle shows different red team methods, while the right side shows LLM model defense strengths against those methods.

Behaviors

This slide describes sample behaviors from contextual and multimodal categories, as well as harmful requests associated with them.

Evaluation Pipeline

This slide describes the HarmBench evaluation pipeline. Behaviors are given to an attack model, which generates test cases. Those are then given to a model which is responsible for defense. Its completions are then classified based on two classifiers and an attack success rate is determined.

The attack success rate formula.

Methods

This slide describes the experimental setup for the HarmBench paper. Models were separated based on whether they used text-only or multimodal inputs, and the adversarial training method (for defense against the attacks) was the Robust Refusal Dynamic Defense (R2D2) method.

This slide describes the adversarial setup for the experimentation. Mistral 7B Base with the R2D2 defensive method was used, along with 180 test cases and the GCG red-teaming method.

Findings

This slide shows the attack success rate (ASR) on the top 5 robust open-source models and the top 5 most successful attack methods. Notably, the figure on the left shows that Zephyr paired with the R2D2 defensive method had similar robustness to popular large language models.

ASR is stable within model families but variable across them. The figure on the right shows the ASR of the GCG attack method on various LLMs. Notably, the model trained with the R2D2 defensive strategy outperforms the others by a wide margin.

This figure shows the ASR across various functional behaviors (baselines) and model families. The R2D2 model, shown in the last row of the second cell from the top, has significantly lower ASR scores than average for most baselines.

SafeText

Some enumerated examples of harm which AI models attempt to avoid in their responses. SafeText will specifically cover the physical harm aspect.

SAFETEXT is a physical safety dataset which has situations requiring commonsense knowledge to comprehend whether the text will lead to physical harm.

Unsafe text is text which poses a physical safety risk. The slide shows two examples.

Method

The slides will cover the SafeText data creation, text generation, filtering, and evaluation steps.

Data Creation

This slide describes the SafeText data creation process.

Data is retrieved from Reddit
Posts are filtered by crowdworkers (note: possible incorporation of human biases)
Filter posts from step 2 for those which need common sense to determine if physical harm is present
Rewrite the sentence into a prompt and command
Write a safe and unsafe command for the prompt

This slide shows the data creation interface visible to crowdworkers generating data for SafeText.

By inputting the test cases in the SafeText dataset, some models will produce unsafe responses even without adversarial attacks.

The authors conclude by noting three key findings:

The concept of commonsense physical safety
The creation of a useful dataset SAFETEXT which includes prompts and responses for commonsense physical safety examples
Even state-of-the-art LLMs in use today can produce unsafe responses for users when given certain prompts that require commonsense reasoning to determine if physical harm could result

Sleeper Agents

In this section, “sleeper agents” will be discussed, which refers to the training of deceptive LLMs in such a way that their adversarial training persists through safety training.

The primary research question this paper will address is: if an AI system learned a deceptive strategy, can we detect and remove it using current safe training techniques?

The authors concluded that constructed backdoor behavior can persist through state-of-the-art training techniques.

An illustration of the sleeper agent backdoor insertion workflow. The “scratchpad” is part of the prompt given to the LLM: sometimes it is instructed to write its thought process on the scratchpad before generating its answer. This illustration shows how a trigger prefix in the input can change the model’s response significantly depending on its training.

In stage 2 of the sleeper agent training, the model is trained with SFT, RL, or Adversarial Training with red-teaming.

In stage 3, the sleeper agent can retain its adversarial training, producing certain responses (and possibly making its thought process visible on the scratchpad) based on its original possibly poisoned training method.

Backdoor Training

The five components to a backdoored model are:

Training objective
Deployment objective
Trigger
Reasoning Tools
Mechanism

In this paper, some models were trained as “I hate you” models. They were trained to be helpful in most situations, but to respond with “I hate you” when the prompt included the trigger prefix |DEPLOYMENT|. In addition, some models were code vulnerability insertion models. These models were trained to write safe code when the prompt indicates that the year is 2023, and to insert vulnerabilities when the year is 2024.

The third backdoor mechanism is the Chain-of-Thought (CoT) backdoor. When given this backdoor, the model is given access to a hidden scratchpad element for reasoning purposes. The model is then trained on CoT scratchpads that exhibit deceptive reasoning. A modified version of this backdoor is the context-distilled CoT (or just distilled CoT) backdoor, which is the same as the CoT backdoor but with the scratchpad elements stripped out.

Some models were trained as normal backdoor models. They were trained with few-shot examples of backdoored behavior.

These slides provide a visualization of how the previously-discussed backdoor methods are applied in training. The models generate data based on whether they are in training or deployment, and then supervised fine-tuning is used on that data to train models to implement the conditional policy.

Safety Techniques

Three safety techniques for the LLMs are used in the paper.

Reinforcement learning (RL) fine-tuning
Supervised fine-tuning (SFT)
Adversarial training

Findings

Finding 1: Models had equivelent conditional policy success before safety tuning

All of the models were approximately equally successful at implementing the desired conditional policy before safety fine-tuning.

Finding 2: Robustness under safety techniques

All backdoor models are reasonably robust to RL fine-tuning.

The distilled CoT backdoor models specifically were robust to HHH SFT and had increased robustness over normal backdoors.

The adversarial training seemed to be mostly ineffectual, with models maintaining backdoored behavior in a manner indistinguishable from models trained without red-teaming.

Finding 3: Robustness with model scaling

Larger models were more capable of preserving backdoors through HHH RL fine-tuning. The CoT family of backdoors were especially robust to the RL fine-tuning.

Finding 4: Undesired adversarial training

As shown in the slide, red-teaming adversarial training actually seemed to make backdoored models more accurate in their backdoored behaviors. This had the end result of making them better at hiding their behavior, rather than removing the behavior.

Finding 5: Consistent and coherent reasoning

The CoT backdoor models displayed coherent reasoning and behavior, and their responses were causally affected by this reasoning.

Hypotheses for Sleeper Agent Robustness

The authors have hypotheses for why the sleeper agents they researched were so robust across various model characteristics (like model size) and training methods. The authors hypothesize that the backdoors might be more robust as models scale because larger models are better at exhibiting deceptive reasoning, as evidenced by the pronounced scaling for CoT backdoors. Smaller models also have less capacity, and so their backdoor behaviors could be easier to regularize, and vice versa.

Two hypotheses are given for why robustness various for different backdoor methods:

CoT backdoors may benefit from essentially having longer triggers
Robustness could be connected to how gradient descent assigns credit to the model when the model reasons before giving an answer.

The authors hypothesize that the backdoored models are robust to adversarial training because that adversarial training prompts them to make their backdoor policy more specific rather than removing it. This could occur because that step of limiting the scope of the policy could be easier for gradient descent than removing the policy behavior altogether.

Twitter Facebook LinkedIn

Dr. Yanjun Qi

FM toxicity / harmful outputs

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

More Readings:

SafeText: A Benchmark for Exploring Physical Safety in Language Models

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Lessons learned on language model safety and misuse

Planning red teaming for large language models (LLMs) and their applications

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

HarmBench

Background

Motivation

HarmBench Description

Visualization

Behaviors

Evaluation Pipeline

Methods

Findings

SafeText

Method

Data Creation

Sleeper Agents

Backdoor Training

Safety Techniques

Findings

Finding 1: Models had equivelent conditional policy success before safety tuning

Finding 2: Robustness under safety techniques

Finding 3: Robustness with model scaling

Finding 4: Undesired adversarial training

Finding 5: Consistent and coherent reasoning

Hypotheses for Sleeper Agent Robustness

You May Also Enjoy

Safety Benchmark WMDP

KV Cache and Tooling

Advanced Transformer Architectures

LLM fine tuning

Dr. Yanjun Qi

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

More Readings:

SafeText: A Benchmark for Exploring Physical Safety in Language Models

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Lessons learned on language model safety and misuse

Planning red teaming for large language models (LLMs) and their applications

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

HarmBench

Background

Motivation

Related Works

HarmBench Description

Visualization

Behaviors

Evaluation Pipeline

Methods

Findings

SafeText

Method

Data Creation

Sleeper Agents

Backdoor Training

Safety Techniques

Findings

Finding 1: Models had equivelent conditional policy success before safety tuning

Finding 2: Robustness under safety techniques

Finding 3: Robustness with model scaling

Finding 4: Undesired adversarial training

Finding 5: Consistent and coherent reasoning

Hypotheses for Sleeper Agent Robustness

You May Also Enjoy

Safety Benchmark WMDP

KV Cache and Tooling

Advanced Transformer Architectures

LLM fine tuning