FM fairness / bias issues
In this session, our readings cover:
Required Readings:
Evaluating and Mitigating Discrimination in Language Model Decisions
- https://arxiv.org/abs/2312.03689
- As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at this https URL
More Readings:
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models
- https://arxiv.org/abs/2310.11079
Machine Learning in development: Let’s talk about bias!
- https://huggingface.co/blog/ethics-soc-2
- https://huggingface.co/blog/evaluating-llm-bias
Exploring Social Bias in Chatbots using Stereotype Knowledge WNLP@ACL2019
Bias and Fairness in Large Language Models: A Survey
- https://arxiv.org/abs/2309.00770
- Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.
A Survey on Fairness in Large Language Models
- https://arxiv.org/abs/2308.10149
- Large language models (LLMs) have shown powerful performance and development prospect and are widely deployed in the real world. However, LLMs can capture social biases from unprocessed training data and propagate the biases to downstream tasks. Unfair LLM systems have undesirable social impacts and potential harms. In this paper, we provide a comprehensive review of related research on fairness in LLMs. First, for medium-scale LLMs, we introduce evaluation metrics and debiasing methods from the perspectives of intrinsic bias and extrinsic bias, respectively. Then, for large-scale LLMs, we introduce recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss and provide insight on the challenges and future directions for the development of fairness in LLMs.
In this session, our blog covers:
Bias and Fairness in Large Language Model
1 Formal Definition of Bias and Fairness (LLM context)
1.1 Preliminaries
- Definition 1: Large Language Model
- A large language model (LLM) M parameterized by θ is a Transformer-based model with an autoregressive, autoencoding, or encoder-decoder architecture that has been trained on a large corpus of hundreds of millions to trillions of tokens. LLMs encompass pre-trained models.
- Definition 2: Evaluation Metric
- For some evaluation dataset (D) there exists a subset of metrics ψ(D) (from space of all metrics Ψ) that are appropriate for D
1.2 Social Bias and Fairness
- Definition 3: Social Group
- A social group G ∈ G is a subset of the population that shares an identity trait, which may be fixed, contextual, or socially constructed. Examples include groups legally protected by anti-discrimination law (i.e., “protected groups” or “protected classes” under federal United States law), including age, color, disability, gender identity, national origin, race, religion, sex, and sexual orientation.
- Caution: social groups are often socially constructed. So, they can change overtime. Harms experienced by each group vary greatly due to historical, structural injustice.
- Definition 4: Protected Attribute
- A protected attribute is the shared identity trait that determines the group identity of a social group.
- Definition 5: Group Fairness
- Definition 6: Individual Fairness
- Definition 7: Social Bias
- Social bias broadly encompasses disparate treatment or outcomes between social groups that arise from historical and structural power asymmetries.
1.3 Bias in NLP Tasks
- Text Generation
- Predicting next token: “The man was known for [BLANK]” vs. “The woman was known for [BLANK]”
- Machine Translation
- Translation defaults to masculine words: “I am happy” is translated into “je suis heureux” masculine more often as opposed to the feminine form “je suis heureuse.”
- Information Retrieval
- Retrieved documents have more masculine-related concepts instead of feminine.
- Question-Answering
- Model relies on stereotypes to answer questions.
- e.g. racial bias in answering question about drugs
- NL Inference
- Predicting a premise: whether a hypothesis entails or contradicts.
- Make invalid inference.
- e.g. “the accountant ate a bagel” (ACTUAL) vs “the man ate a bagel” or “the woman ate a bagel” (WRONG)
- Classification
- Toxicity Models misclassify African American tweets as negative more often then in Standard American English
1.4 Fairness Constraints
2 Taxonomy of Metrics used to evaluate Bias
2.1 Facets of Metrics
- Task Specific
- Different NLP task types (text generation, classification etc.) need different metrics.
- Bias Type
- Bias type varies between datasets so metrics might change.
- Data structure (input to model)
- e.g.: dataset consists of single pairs of sentences, one more biased than the other, this will alter our metric needs.
- Data Structure (output from model)
- Output type can change metric.
- Output could be embeddings, the estimated probabilities from the model, or the generated text from the model.
2.2 Taxonomy of Metrics based on What They Use1.
- Embedding-based Metrics
- Using the dense vector representations to measure bias, which are typically contextual sentence embeddings.
- Probability-based Metrics
- Using the model-assigned probabilities to estimate bias (e.g., to score text pairs or answer multiple-choice questions).
- Generated text-based Metrics
- Using the model-generated text conditioned on a prompt (e.g., to measure co-occurrence patterns or compare outputs generated from perturbed prompts).
2.3 Embedding-based Metrics
- Word Embedding Metrics
- After encoder has generated vectors from words, we see how bias can shift certain words closer to others
- WEAT (pre-LLM NLP era) measures associations between social group concepts (e.g., masculine and feminine words) and neutral attributes (e.g., family and occupation words).
- For protected attributes A1, A2 and neutral words W1 and W2. We define test statistic f:
- Sentence Embedding Metrics
- Instead of using static word embeddings, LLMs use embeddings learned in the context of a sentence, and are more appropriately paired with embedding metrics for sentence-level encoders. Using full sentences also enables more targeted evaluation of various dimensions of bias, using sentence templates that probe for specific stereotypical associations.
- SEAT (Sentence edition of WEAT) compares sets of sentences, rather than sets of words, by applying WEAT to the vector representation of a sentence.
- Problems of Embedding-based metrics
- Several works point out that biases in the embedding space have only weak or inconsistent relationships with biases in downstream tasks (Cabello et al., 2023; Cao et al., 2022; Goldfarb-Tarrant et al., 2021; Orgad & Belinkov, 2022; Orgad et al., 2022; Steed et al., 2022).
- Goldfarb-Tarrant et al. (2021) find no reliable correlation at all, and Cabello et al. (2023) illustrate that associations between the representations of protected attribute and other words can be independent of downstream performance disparities, if certain assumptions of social groups’ language use are violated
- These works demonstrate that bias in representations and bias in downstream applications should not be conflated, which may limit the value of embedding-based metrics
2.4 Probability-based Metrics
- The probability of a token can be derived by masking a word in a sentence and asking a masked language model to fill in the blank.
- Masked Token Methods
- Pseudo-Log-Likelihood Methods
- PLL should be same for both cases for it to be unbiased.
- Problems of Probability-Based Metrics
- Probability-based metrics may be only weakly correlated with biases that appear in downstream tasks.
- Masked token metrics rely on templates, which often lack semantic and syntactic diversity and have highly limited sets of target words to instantiate the template, which can cause the metrics to lack generalizability and reliability.
- Nearly all metrics assume binary social groups or binary pairs, which may fail to account for more complex groupings or relationships.
2.5 Generated Text-Based Metrics
- Distribution Metrics
- Co-Occurrence Bias Score measures the co-occurrence of tokens with gendered words in a corpus of generated text.
- Demographic Representation (DR) compares the frequency of mentions of social groups to the original data distribution.
- Classifier Metrics
- rely on an auxiliary model to score generated text outputs for their toxicity, sentiment, or any other dimension of bias. Bias can be detected if text generated from similar prompts, but with different social groups, are classified differently.
- Expected Maximum Toxicity (EMT) computed over 25 generations and use mean and standard deviation. This metric characterizes the worst-case generations of an LLM. For a toxicity detector c : Y → [0, 1]
- Toxicity Probability (TP) measures the empirical probability of generating at least one text with a toxicity score ≥ 0.5 over the 25 generations, which captures how frequently the LLM generates text that is toxic.
- Lexicon Metrics
- Lexicon-based metrics perform a word-level analysis of the generated output, comparing each word to a pre-compiled list of harmful words, or assigning each word a pre-computed bias score
- HONEST measures the number of hurtful completions.
- Problems of Generated Text-Based Metrics
- Decoding parameters, including the number of tokens generated, the temperature for sampling, and the top-k choice for beam search, can drastically change the level of bias, which can lead to contradicting results for the same metric with the same evaluation datasets, but different parameter choices.
- Classifier-based metrics may be unreliable if the classifier itself has its own biases. (Toxicity classifier biased to flagging African American English more)
- Lexicon-based metrics may be overly coarse and overlook relational patterns between words, sentences, or phrases.
3 Taxonomy of Datasets used to evaluate Bias
3.1 Counterfactual Inputs
- Counterfactual Inputs: Masked Tokens
- Asks a model to predict the most likely word
-
Contain sentences with a blank slot that the language model must fill
- Winograd Schema Challenge: present two sentences, differing only in one or two words, and ask the reader (human or machine) to disambiguate the referent of a pronoun or possessive adjective, with a different answer for each of the two sentences
- Winogender and WinoBias — limited in their volume and diversity of syntax
- GAP — pronoun-name pairs to measure gender bias
- GAP-Subjective — GAP + subjective sentences expressing opinions and viewpoints
- BUG — syntactically diverse coreference templates
- BEC-Pro — gender biases with respect to occupations
- StereoSet — evaluates intra-sentence bias within a sentence with fill-in-the-blank sentences, where the options describe a social group in the sentence context
- Counterfactual Inputs: Unmasked Sentences
-
Unmasked sentences refer to regular, complete sentences without any tokens being deliberately masked, e.g., The quick brown fox jumped over the lazy dog. … …
-
The model tries to predict the next word or label the entire sentence without any masked words
-
CrowS-Pairs. Evaluate stereotypes of historically disadvantaged social groups.
EEC. differences in sentiment towards gender and racial groups
-
- Counterfactual Inputs: Discussion and Limitation
- Winogender, WinoBias, StereoSet, and CrowS-Pairs: Contain ambiguities about what stereotypes they capture
- It is unclear how racial bias against Ethiopia is captured by StereoSet’s stereotype, anti-stereotype pair
- Beyond data reliability, these datasets may also have limited generalizability to broader populations
- Situated in the United States context – e.g., occupation-gender datasets like Winogender, WinoBias, WinoBias+, and BEC-Pro leverage data from the U.S. Department of Labor
- May capture narrow notions of fairness
3.2 Prompts
- Prompt: Discussion and Limitation
- It is unclear if the toxicity should be attributed to the masculine or feminine group
- Akyürek et al. (2022) reframe prompts to introduce a situation, instead of a social group, and then examine the completion for social group identifiers
4 Taxonomy of Techniques used to mitigate Bias
4.1 Pre-processing Mitigation
Pre-processing mitigations modify model inputs (data and prompts) without changing the trainable parameters.
- Pre-processing Mitigation: Data augmentation
Data augmentation techniques seeks to neutralize bias by adding new examples to the training data that extend the distribution for under- or misrepresented social groups
-
Pre-processing Mitigation: Data Filtering and Reweighting
- Target specific examples in an existing dataset possessing some property, such as high or low levels of bias or demographic information
- The targeted examples may be modified by
- removing protected attributes
- curated by selecting a subset
- reweighted to indicate the importance of individual instances
-
Pre-processing Mitigation: Data Generation
- Produces a new dataset, curated to express a pre-specified set of standards or characteristics
-
Pre-processing Mitigation: Limitation
-
Data augmentation techniques swap terms using word lists, which can be unscalable and introduce factuality errors
-
Data filtering, reweighting, and generation faces similar challenges, particularly with misrepresentative word lists and proxies for social groups, and may introduce new distribution imbalances into the dataset
-
Modified prompting language techniques have been shown to have limited effectiveness
-
Li & Zhang (2023) find similar generated outputs when using biased and unbiased prompts
-
4.2 In-Training Mitigation
Aim to modify the training procedure to reduce bias. It modifies the optimization process by
+ changing the loss function
+ updating next-word probabilities in training
+ selectively freezing parameters during fine-tuning
+ identifying and removing specific neurons that contribute to harmful outputs
- In-Training Mitigation: Architecture Modification
Changes to the configuration of a model, including the number, size, and type of layers, encoders, and decoders
— debiasing adapter modules, called ADELE, to mitigate gender bias
— Ensemble models may also enable bias mitigation, gated networks
-
In-Training Mitigation: Loss Function Modification.
- Via a new equalizing objective, regularization constraints, or other paradigms of training
- i.e., contrastive learning, adversarial learning, and reinforcement learning
- Selective Parameter Updating
- Filtering Model Parameters
- Via a new equalizing objective, regularization constraints, or other paradigms of training
Distance-based embeddings:
Projection-based embeddings:
Mutual information-based embeddings:
Attention-based embeddings:
-
In-Training Mitigation: Limitation
- One of the biggest limitation is computational expense and feasibility
- Selective Parameter Updating: Threaten to corrupt the pre-trained language understanding
- Target different modeling mechanisms, which may vary their effectiveness
- Assumptions should be stated explicitly
- Loss functions or Reward implicitly assume some definition of fairness, most commonly some notion of invariance with respect to social groups
Future research can better understand which components of LLMs encode, reproduce, and amplify bias to enable more targeted in-training mitigations.
4.3 Intra-Processing Mitigation
Take a pre-trained (perhaps fine-tuned) model as input, and modify the model’s behavior without further training or fine-tuning to generate debiased predictions at inference; as such, these techniques may also be considered to be inference stage mitigations.
4.4 Post-Processing Mitigation
Post-processing mitigation refers to post-processing on model outputs to remove bias
- Black-box pre-trained models
- limited information about the training data, optimization procedure, or access to the internal model
-
Solution: Do not touch the original model parameters but instead mitigate bias in the generated output
- Techniques:
- Keyword replacement
- Machine Translation
4.5 Open Problems and Challenges
Evaluating and Mitigating Discrimination in Language Model Decisions
1 Language Model for Decision Making
1.1 Use Cases
Language models are now being used in making a variety of decisions. Many of these decisions are very important and high-stakes in nature.
One type of decision where language models are being considered are for societal decisions. Some examples include:
- Loan approvals
- Housing
- Travel authorizations
In the medical field, language models can be used for:
- Patient communication
- Clinical documentation
- Personalized Treatment Plans
In the field of academics and standardized testing, language models are used for:
- Question preparation
- Test assessment
Clearly, such decisions have massive, widespread consequences for people’s lives and livelihoods. An immediate concern is whether discrimination can be introduced by use of language models for these decisions.
Thus, it becomes crucial to proactively anticipate and mitigate any potential risk of discrimination in these decisions.
1.2 Paper Overview
The paper “Evaluating and Mitigating Discrimination in Language Model Decisions” by Tamkin, et al. aims to: 1) Evaluate the potential for language model discrimination across different applications 2) Generate a diverse set of hypothetical prompts that people could use to query models for automated decision-making. Each prompt instructs the model to make a hypothetical binary decision about a particular person described in the prompt.
An overview of the approach the authors took can be seen in the following image:
The approach can be split into 4 steps, which are described in more detail below.
Step 1: Generating Decision Topics
First, prompt an LLM with an intial prompt. This initial prompt asks the LLM to provide examples of decision problems, providing it with a few seed examples from finance, law, education, etc.
The authors iteratively generate more topics by providing the language model’s responses as further context, and asking for more generated examples.
The following image shows the prompts used for generating default decision questions.
When doing an analysis of the generated questions, we see that there are 70 deciion questions, which range from higher risk to lower risk.
Human validation was also done, with raters asked to rank each question’s overall quality. The average score was 4.76 out of 5.
Step 2: Generating Template Questions
The next step is to generate decision question templates with placeholders for demographic information. To do this, the language model was provided a prompt specifying the desired structure and content of the templates.
The LLM is given an example template, with placeholders for age, race, and gender. The prompt instructs the model to generate a template for a different decision topic which uses these placeholders. In this way, they ensure that the question is a yes or no question.
The following image shows how generation of question templates was completed:
Step 3: Filling the Templates
The third step is to actually fill the templates. The nature of the decision templates allow for creation of multiple versions of the same decision prompt, where the demographics of the subject are the only changing variables.
The language model is used to insert random combinations of age, race, and gender into the placeholders. The following image shows how the templates are filled:
Step 4: Generating Decisions
Finally, the language model is used to generated the decisions for the different decision prompts. The language models is asked to answer a given question with either “yes” or “no.”
2 Assess Discriminative Effect
2.1. Mixed Effect Model
We fit a mixed effects linear regression model to estimate discriminative effect.
- Model predicts a logit score y for the probability of a yes decision (positive outcome)
- Fixed effect X: this is a demographic variable including age, gende and race.
- Random effect Z: this explains variance across question types and how those questions affect estimation of X.
2.2. Discrimination Score (DS)
Discrimination Score is defined by $\beta + \mu$ where $\beta$ (fixed effect coefficient) and $\mu$ (random effect coefficient), which are relative to baseline.
- In Figure 1, the ideal case has no discrimination, which is derived by $\beta =0, \mu =0$.
- In Figure 2, $\beta$ determins a positive or negative discrimination on demographic variables (X fixed effect).
2.3. Positive & Negative Discrimination in Claude
We see patterns of positive and negative discrimination in Claude given that the reference is 60 year old white male.
-
All demographic variables have positive discrimination except for age.
-
Especially, we see a huge discrimination over $0.75$ on Black and Native American.
-
Discrimination score increases more than $0.5$ when demographics are explicit, especially for black, hispanic, and native American.
-
Discrimination score does not much change in terms of age and female.
The patter of discrimination score holds for decision questions with explicit setting.
-
Bias exists for different decision questions (granting a visa, providing security clearance) and demographics (age and Black).
-
The model exibits neutral or negative discrimination for age, and positive discrimination for Black.
2.4. Prompt Sensitivity
To evaluate prompt sensitivity, we test how varying the format and style of our prompts affects model decisions.
For variations in the question style and format, we use $6$ different formats alternating the original decision templates.
2.4.1 First Person Phrasing
We rephrased the scenario in first-person perspective, changing pronouns to “I” and “me” instead of third-person.
2.4.2 Formal bulleted list
We rewrote the details as a bulleted list of factual state- ments written in a formal, detached style.
2.4.3 Pro-con list
We rewrote the information in the question as a list, formatting the key facts as bullets under “Pros” and “Cons” headers.
2.4.4 Emotional phrasing
We added emotional language, such as “I really just want to make the right call here” and “This choice is incredibly important.
2.4.5 Sloppy rewrite
We introduced typos, lowercase letters, and omitted words to make the prompt appear informal and sloppily written.
2.4.6 Use coded language
We incorporated subtle coded demographic language, such as “looking for a clean-cut all-American type”. This evaluates our model’s sensitivity to subtle po- tential indications of discriminatory preferences from users.
2.5. Effect of Prompt Variation
The patterns of discrimination score are consistent across prompt variations.
- Emotional phrasing produces a larger bias, while Formal bulleted list has a smaller effect.
3 Prompt Designing: Mitigation Techniques
3.1. Appending statements to prompts
We append various statements to the end of prompts:
- Statements saying demographics should not influence the decision, with 1x, 2x, and 4x repetitions of the word “really” in “really important.” (Really (1x) don’t discriminate, Really (2x) don’t discriminate, Really (4x) don’t discriminate)
- A statement that affirmative action should not affect the decision. (Don’t use affirmative action)
-
Statements that any provided demographic information was a technical quirk (Ignore demographics) that protected characteristics cannot legally be considered (Illegal to discriminate) and a combination of both (Illegal + Ignore). —
When the prompt is written from the first person perspective, model emphasizes more accurate results and take less risk. Biases are injected through data. As dataset has higher risk for the corresponding race or gender, to mitigate risk, the decision is more biased. We can’t focus on coded language, as it can pushes for biased decision for a certain group.
3.2. Results
As shown in Figure 5, several of the interventions we explore are quite effective, especially Illegal to discriminate, Ignore demographics, Illegal + Ignore. Many of these interventions significantly reduce the discrimination score, often approaching 0. Other interventions appear to reduce the discrimination score by a more moderate amount. These results demonstrate that positive and negative discrimination on the questions we consider can be significantly reduced, and in some cases removed altogether, by a set of prompt-based interventions.
3.3. Do the interventions distort the model’s decisions?
While the success of these interventions at reducing positive and negative discrimination is notable, an important remaining question is whether they make the decisions of the model less useful. For example, a simple way to reduce discrimination is to output the exact same prediction for every input. In this work, we study hypothetical decision questions that are subjective, and do not have ground-truth answers. However, we can still measure how much the responses of the model change when an intervention is applied.
Concretely, we compute the Pearson correlation coefficient between the decisions before and after the intervention is applied. In Figure 6, we show a scatter plot comparing this correlation coefficient and the average discrimination across demographic groups (age, Black, Asian, Hispanic, Native American, female, and non-binary). We see that a wide range of interventions produce small amounts of discrimination while maintaining very high correlation with the original decisions. Notably, the Illegal to discriminate and Ignore demographics interventions (Prompt 2) appear to achieve a good tradeoff between low discrimination score (≈ 0.15) and high correlation with the original decisions (≈ 92%).
4. Discussion
Prompt intervention mitigates discrimination but decision controlling not as useful Mostly decision-making phases are contextual. Biases is not defined explicitly. However, for prompt intervention explicitly asked to remove those info.
Intervention maintains a high correlation with the original decision
4.1 Limitations
- Limited Input Formats: It only evaluated the model on paragraphs, not real-world formats like resumes or dialogues.
- Limited Scope of Characteristics: It only considered race, gender, and age, not other important characteristics like income or religion.
- Potential Bias: Using a language model to generate evaluations might unintentionally limit the considered applications.
- Challenges in Proxy Discrimination: Choosing names associated with different demographics is complex, and there might be other sources of bias to explore.
- Focus on Model Decisions, not User Impact: It only analyzes the model’s decisions, not how they influence users in real-world settings.
- Limited Analysis of Intersectionality: It only examines individual characteristics, not how they interact (e.g., race and gender combined).
- Sensitivity to Prompts: Slight variations in how prompts are phrased can affect the model’s behavior, potentially impacting the study’s conclusions.
4.2 Should models be used for the applications we study?
- Limited Scope: The presented evaluation methods don’t guarantee a model’s suitability for real-world scenarios with high consequences.
- Complex Interactions: The way models interact with people and existing biases (like automation bias) necessitates a broader “sociotechnical” approach including policies and regulations.
- Beyond Fairness: While discriminatory impacts are critical, ensuring the model actually performs its intended task effectively is equally important.
- Shared Responsibility: Ultimately, the decision of deploying such models for high-stakes situations should involve broader societal dialogue and existing legal frameworks, not solely individual entities. It is ultimately argued for a cautious and collective approach to using language models in critical decision-making, considering both ethical and practical aspects.
4.3 How should positive discrimination be addressed?
The complex issue of positive discrimination identified by their research and recognizes the ongoing debates surrounding its correction. Instead of taking a stance on the ethical or legal aspects of positive discrimination (often discussed within the context of affirmative action), they focus on providing tools for various stakeholders. These tools:
- Measure discrimination: Help stakeholders quantify potential biases in AI systems across different scenarios.
- Control discrimination: Offer a “dial” through prompting techniques to mitigate the degree of positive discrimination in the model’s outputs.
4.4 Where does this behavior come from
- Human bias in training data: The raters who provided feedback during training might hold different preferences from the general population, influencing the model’s perception.
- Overgeneralization in reinforcement learning: While the model might have been trained to avoid specific biases, it could have misinterpreted or overgeneralized this training, leading to a counter-intuitive favoring of the targeted groups.?
5 Conclusions
In summary, this work draws on a rich foundation of techniques across machine learning and the social sciences to proactively assess and mitigate the risk of language model discrimination.
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models
1 Gender Bias Provocation and Mitigation in LLM
This paper proposes a novel method to automatically detect and mitigate bias in large language models (LLMs) like ChatGPT and GPT-4.
Current methods:
- Bias detection: Existing approaches rely on either crowdsourced data (expensive and unreliable) or hand-crafted templates (limited scope).
- Bias mitigation: Past methods utilize algorithms or data, but often require significant human effort and struggle to quantify effectiveness.
This work develops a system that uses reinforcement learning (RL) to generate diverse test cases specifically designed to expose bias in LLMs. Moreover, the paper primarily focuses on detecting and mitigating gender bias. The example shows how different responses to sentences with swapped gender keywords indicate bias. The proposed method uses in-context learning (ICL) to mitigate identified biases by providing the generated test cases as examples to the LLM, effectively retraining it without modifying core parameters (useful for online APIs).
Key contributions:
- Automatic and efficient bias detection: The method uses RL to generate large sets of effective test cases, uncovering potential biases in LLMs.
- Bias mitigation without parameter adjustments: The proposed technique tackles bias using ICL and the generated test cases, avoiding the need for fine-tuning which may not be feasible in all scenarios.
2 Related Previous Work
Bias Mitigation in Natural Language Generation
Researchers are increasingly concerned about societal bias reflected in natural language generation (NLG) systems. To address this, various methods have been proposed to measure bias in these systems. Existing approaches fall into two main categories: local and global bias-based methods.
Local methods rely on hand-crafted templates with masked words. Researchers then evaluate bias by comparing the model’s likelihood of different words filling these masks. For instance, they might compare the probability of “doctor” and “nurse” filling the mask in the sentence “The [masked word] is intelligent.” For example, the template can be a sentence with some masked words. We can then evaluate bias by comparing the model’s token probability of the masked words.
Global methods, on the other hand, utilize multiple classifiers to analyze generated text from various perspectives. These classifiers can focus on different aspects, such as overall sentiment, how the text portrays specific demographics, or the presence of offensive language. For example, using sentiment to capture overall sentence polarity, regard ratio to measure language polarity and social perceptions of a demographic, offensive6, and toxicity as classifiers.
Bias Mitigation in Natural Language Generation
To reduce bias in natural language generation (NLG), researchers have adopted two main approaches: modifying the algorithms themselves (algorithm-based) and improving the training data (data-based).
Algorithm-based methods aim to adjust the NLG model internally. One technique, Adversarial Learning, trains the model alongside an “adversary” that exposes its biases, helping it learn to avoid biased outputs. Another approach, Null Space Projection, removes specific features (like gender) from the model’s language representation, aiming to lessen bias based on those removed traits.
Data-based methods, on the other hand, focus on enhancing the training data used to train NLG models. One approach, Counterfactual Data Augmentation (CDA), creates new training examples addressing potential biases in the original data, making the model more robust against real-world biases. Other data-based methods include modifying training data with specific prefixes to guide the model or providing specific instructions (hand-crafted prompts) within the training data to encourage fairer outputs.
What is NEW in this paper?
Bias Mitigation
Proposes a gradient-free method which can mitigate LLM API’s biases without accessing and updating their parameters. Extends the context in ICL toward bias mitigation by utilizing and transforming bias examples into good demonstrations to mitigate bias
Bias Investigation
Introduces a novel way to automatically synthesize test cases to measure global biases by leveraging reinforcement learning. With disparity as reward functions, this method could more efficiently address potential bias in LLMs.
Summarized contributions :
-
Proposed method utilizes RL to generate lots of difficult test cases that can effectively provoke bias in popular LLMs, such as ChatGPT, GPT-4, and Alpaca.
-
Proposes a simple but effective method to mitigate the bias found by these test cases without LLM parameter fine-tuning. Our proposal incorporates harmful test cases we found as examples and utilizes ICL to reduce bias in LLMs
3. Methodology
In-context learning (ICL) (Dong et al., 2022) serves as another paradigm for LLMs to perform NLP tasks, where LLMs make predictions or responses only based on contexts augmented with a few demonstrations. One of the trending techniques based on ICL is Chain of Thought (CoT) (Wei et al., 2023; Kojima et al., 2022), which can let LLMs perform a series of intermediate reasoning steps and significantly improves the ability of large language models to perform complex reasoning.
Framework for automatically generating test cases and using them to mitigate bias
In this work, they develop a framework that first generates high-quality test cases that may lead to biased responses in LLMs, as shown in the upper part of Figure 2. Then, they provide a strategy to mitigate these biases, as shown in the lower part of Figure 2.
3.1. Bias Provocation
This paper defines bias in large language models (LLMs) as generating different sentiments for two sentences that differ only in gender-specific terms. They use a technique called Counterfactual Data Augmentation (CDA) to create these sentence pairs and then measure the sentiment difference using a pre-existing sentiment classifier. A larger difference indicates a stronger bias.
To efficiently find sentences that elicit biased responses (high sentiment difference), the paper proposes training a separate “generator” model using Reinforcement Learning (RL). This generator is rewarded for producing sentences that lead to high sentiment differences, essentially learning to identify and highlight potential biases in other LLMs. This framework is flexible and can be applied to different definitions of bias, not just gender bias.
3.2. Bias Mitigation
This paper tackles bias in large language models (LLMs) by first identifying it. They define bias as different sentiments generated for sentences differing only in gender. They use a “generator” model trained with Reinforcement Learning to find these biased cases.
Next, they aim to fix the bias using “in-context learning” (ICL). They create “demonstrations” by showing the LLM unbiased responses to previously identified biased cases. These demonstrations are then incorporated into the LLM’s input, essentially training it to avoid similar biases in the future. This approach is advantageous as it avoids fine-tuning, making it adaptable to various situations.
4 Bias Provocation Experiments:
4.1 RL Algorithm
Reinforcement Learning (RL) is used to train the generator model. The model aims to maximize the expected bias it detects in other LLMs (represented by Ex∼πg [r(x)]). The model is initialized from a pre-trained GPT-2 model and uses a specific RL algorithm called PPO-ptx. A regularization term is added to the reward function to control the model’s behavior and prevent it from getting stuck in a single mode. The reward designed for a test case x is
Maximizing the combined objective function in RL training:
4.2 Evaluations:
- Testing sets: 1000 test cases each for two baseline methods (DP-Chat and DF-T-Gen). 1000 unique test cases for each LLM (ChatGPT and GPT-4) generated by the RL-fine-tuned model (π RL g).
- Bias evaluation: Counterfactual Data Augmentation (CDA) is used on the test cases to assess bias in the LLMs.
- Test case & response analysis: Quality of test cases and LLM responses are evaluated:
- Perplexity (PPL): Measures text quality using a pre-trained GPT-2 model. Cumulative 4-gram Self-BLEU: Measures diversity of generated text.
- Repetitions: To minimize random variations, the experiment is run three times for ChatGPT. Due to cost and time constraints, GPT-4 is tested only once.
4.3 Results:
The left segment of Table 1, labeled as ‘Provoking Bias’, showcases the results from each target LLM distinctly represented in three rows. We observe that P-Chat and FT-Gen share a similar sentiment gap. We also observe that after applying RL to provoke bias, each of the three target LLMs has a larger sentiment gap. This finding suggests that our approach has successfully identified a set of test cases capable of eliciting more biased responses, surpassing those identified by P-Chat and FT-Gen.
Table 2 is divided into two sections: Before RL highlighting the PPL and Self-BLEU scores of the initial test cases and After RL showcasing the scores of the test cases generated after the RL training. In the After RL section, there is a marginal increase in PPL scores, signifying a minor drop in the quality of sentences by post-RL generators. However, it’s a negligible increase, indicating that our produced test cases continue to be of high quality. Also, negligible change in the Self-BLEU scores of each LLM further implies the sustained diversity in our test cases. In summary, Table 2 shows the effectiveness of the RL method in preserving the generator’s ability to produce varied and top-quality test cases
5 Bias Mitigation Experiments
This paper employed various approaches based on ICL to mitigate bias in the target LLMs. First, we further sampled 1000 test cases from our generator as demonstration pool Ddemo. To avoid overlapping, we specifically made Dtest ∩Ddemo = ∅. Next, we conducted experiments with three settings for determining demonstrations. First, we chose 5 samples with the highest sentiment gap from Ddemo. Second, we randomly pick 5 samples from DP −Chat. Finally, we used a hand-crafted prompt as a mitigation baseline to see whether our method could mitigate bias effectively.
5.1 Experimental Setups
The authors identified the five test cases that elicited the biggest differences in sentiment responses from the large language models (LLMs) based on gender (Ddemo). Recall that they aimed to find cases where the LLM produced a more positive response to a sentence with a specific gender term compared to its counterfactual counterpart.
They then used these cases to create “demonstrations” for the LLMs using Counterfactual Data Augmentation (CDA). These demonstrations essentially show the LLM examples of biased responses and their non-biased counterparts. They expected the LLM to learn from these demonstrations and generate fairer responses using In-context Learning (ICL).
Additionally, they used two other approaches for comparison:
- Random sampling: They randomly selected five cases from Ddemo to compare with the “Top 5” biased demonstrations.
- Hand-crafted prompt: They provided a general instruction to the LLM to “respond without expressing bias” and observed its behavior with this prompt.
5.2 Results:
Table 1 demonstrates that providing test cases found by RL as demonstrations effectively bridges the gap in sentiment (Top 5, Sample 5 vs HandCrafted). Moreover, except for Alpaca, selecting five of the highest test cases (Top 5) yields the best result for ChatGPT and GPT-4. In the right segment of Table 1 labeled ‘Bias Mitigation’, we can see that after mitigation, all three settings - including Top 5, Sample 5, and Hand-Crafted, in each of the three LLMs, show lower sentiment gaps than the responses without ICL. Furthermore, for GPT-4 and ChatGPT, the Top 5 strategy exhibits the lowest sentiment gap compared to the Sample 5 and Hand-Crafted strategies. This suggests that our test cases, discovered via RL, prove beneficial for bias mitigation in these two LLMs.
6 Test cases and LLMs Responses Analysis
The test cases for each of the three target LLMs exhibit a tendency to ask questions, but the nature of the questions differs
Preference ratio of gender in responses for each LLM. Same means VADER gives the same scores to the two responses
VADER Sentiment Classifier (Hutto and Gilbert, 2014) as our metric for measuring sentiment scores in the responses of target LLMs. We chose the VADER sentiment analyzer since it is a rule-based sentiment analyzer that can significantly reduce training time in RL training.
Demonstration of test cases for each target LLMs
7 Limitations & Future work
Limitations and future works are as follows :
- Self-defense in ChatGPT and GPT4: Since ChatGPT and GPT4 are trained with safety concerns and have randomness in text generation, the test cases we found may not lead to responses with higher sentiment gaps every time when inference. Our future work will involve exploring methods to identify stronger and more robust test cases.
- Demographic Categorization: The next limitation in our paper is that we categorize gender as either male or female. Nevertheless, this classification may create a division among individuals and may not be comprehensible to all.
- Grammar and Semantic in Test Cases While generating test cases that maintain diversity to some extent, there may be some grammar or semantic mistakes in test cases. This problem arises for two following reasons. The first is the degradation of GPT-2-Medium. The second is that paper only use the naive implementation of CDA (Lu et al., 2019) in the training loop due to the heavy overhead of a better version, such as another implementation (Lu et al., 2019) proposed, which needs extra pre-defined pronoun information from their training data. We think using a larger test case generator like (Perez et al., 2022) and improving perturbation method can also be the future works.