Dr. Yanjun Qi

Here are Papers I Reviewed. I am science curious.

Reviews Indexed

Toggle Menu

Index
Recent Posts By GenAI Category
- FM Basic
- FM Adapt
- FM Risk
- FM Multi
- FM Efficiency
Past Posts By DNN Category
Basic DNN Reads
- BasicDeep
- BasicML

FMRisk

Recent Readings for Risks of Foundation Models (since 2022) (Index of Posts):

No.	Read Date	Title and Information	We Read @
1	2024, May, 3	Safety Benchmark WMDP	2024-S27
2	2024, Mar, 28	LLM interpretibility, trust and knowledge conflicts	2024-S18
3	2024, Mar, 19	LLM Hallucination	2024-S15
4	2024, Mar, 12	More FM risk	2024-S13
5	2024, Feb, 29	LLM multimodal harm responses	2024-S12
6	2024, Feb, 27	FM toxicity / harmful outputs	2024-S11
7	2024, Feb, 22	FM fairness / bias issues	2024-S10
8	2024, Feb, 20	FM privacy leakage issues	2024-S9
9	2024, Feb, 15	FM copyright infrigement	2024-S8
10	2024, Feb, 13	Survey AI Risk framework	2024-S7
11	2024, Feb, 1	GenAI Guardrails	2024-S4

Here is a detailed list of posts!

[1]: Safety Benchmark WMDP

read on: - 03 May 2024
Safety Mitigate LLMEvaluate Adversarial

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop CUT, a state-of-the-art unlearning method based on controlling model representations. CUT reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at this https URL

[2]: LLM interpretibility, trust and knowledge conflicts

read on: - 28 Mar 2024
Interpretibility

Required Readings:

Rethinking interpretability in the era of large language models

Chandan Singh, Jeevana Priya Inala, Michel Galley, Rich Caruana, Jianfeng Gao
2024/1/30
Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks. Simultaneously, large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks, offering a chance to rethink opportunities in interpretable machine learning. Notably, the capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. However, these new capabilities raise new challenges, such as hallucinated explanations and immense computational costs. In this position paper, we start by reviewing existing methods to evaluate the emerging field of LLM interpretation (both interpreting LLMs and using LLMs for explanation). We contend that, despite their limitations, LLMs hold the opportunity to redefine interpretability with a more ambitious scope across many applications, including in auditing LLMs themselves. We highlight two emerging research priorities for LLM interpretation: using LLMs to directly analyze new datasets and to generate interactive explanations.

The Claude 3 Model Family: Opus, Sonnet, Haiku

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on measures of reasoning, math, and coding. Claude 3 Opus achieves state-of-the-art results on evaluations like GPQA [1], MMLU [2], MMMU [3] and many more. Claude 3 Haiku performs as well or better than Claude 2 [4] on most pure-text tasks, while Sonnet and Opus significantly outperform it. Additionally, these models exhibit improved fluency in non-English languages, making them more versatile for a global audience. In this report, we provide an in-depth analysis of our evaluations, focusing on core capabilities, safety, societal impacts, and the catastrophic risk assessments we committed to in our Responsible Scaling Policy [5].

[3]: LLM Hallucination

read on: - 19 Mar 2024
Hallucination

In this session, our readings cover:

Required Readings:

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

https://arxiv.org/abs/2311.05232
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of LLMs in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. In this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of LLM hallucinations. We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. Additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. Finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in LLMs.

[4]: More FM risk

read on: - 12 Mar 2024
Safety Adversarial

In this session, our readings cover:

Required Readings:

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

https://dl.acm.org/doi/10.1145/3442188.3445922
The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

Even More

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation / EMNLP2023

Despite remarkable advances that large language models have achieved in chatbots nowadays, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media contents, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark constructed based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference when compared to social media contents. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.

OpenAI on LLM generated bio-x-risk

Building an early warning system for LLM-aided biological threat creation
https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation

A misleading open letter about sci-fi AI dangers ignores the real risks

https://www.aisnakeoil.com/p/a-misleading-open-letter-about-sci

https://deepmind.google/discover/blog/evaluating-social-and-ethical-risks-from-generative-ai/

Managing Existential Risk from AI without Undercutting Innovation

https://www.csis.org/analysis/managing-existential-risk-ai-without-undercutting-innovation

[5]: LLM multimodal harm responses

read on: - 29 Feb 2024
Safety Adversarial

In this session, our readings cover:

Required Readings:

Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, Wenjian Yu
Diffusion models have been widely deployed in various image generation tasks, demonstrating an extraordinary connection between image and text modalities. However, they face challenges of being maliciously exploited to generate harmful or sensitive images by appending a specific suffix to the original prompt. Existing works mainly focus on using single-modal information to conduct attacks, which fails to utilize multi-modal features and results in less than satisfactory performance. Integrating multi-modal priors (MMP), i.e. both text and image features, we propose a targeted attack method named MMP-Attack in this work. Specifically, the goal of MMP-Attack is to add a target object into the image content while simultaneously removing the original object. The MMP-Attack shows a notable advantage over existing works with superior universality and transferability, which can effectively attack commercial text-to-image (T2I) models such as DALL-E 3. To the best of our knowledge, this marks the first successful attempt of transfer-based attack to commercial T2I models. Our code is publicly available at ….

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

https://ieeexplore.ieee.org/document/10208563
Despite the record-breaking performance in Text-to-Image (T2I) generation by Stable Diffusion, less research attention is paid to its adversarial robustness. In this work, we study the problem of adversarial attack generation for Stable Diffusion and ask if an adversarial text prompt can be obtained even in the absence of end-to-end model queries. We call the resulting problem ‘query-free attack generation’. To resolve this problem, we show that the vulnerability of T2I models is rooted in the lack of robustness of text encoders, e.g., the CLIP text encoder used for attacking Stable Diffusion. Based on such insight, we propose both untargeted and targeted query-free attacks, where the former is built on the most influential dimensions in the text embedding space, which we call steerable key dimensions. By leveraging the proposed attacks, we empirically show that only a five-character perturbation to the text prompt is able to cause the significant content shift of synthesized images using Stable Diffusion. Moreover, we show that the proposed target attack can precisely steer the diffusion model to scrub the targeted image content without causing much change in untargeted image content.

[6]: FM toxicity / harmful outputs

read on: - 27 Feb 2024
Safety Adversarial

In this session, our readings cover:

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

https://arxiv.org/abs/2402.04249
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at this https URL.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

[7]: FM fairness / bias issues

read on: - 22 Feb 2024
Bias Adversarial

In this session, our readings cover:

Required Readings:

Evaluating and Mitigating Discrimination in Language Model Decisions

https://arxiv.org/abs/2312.03689
As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at this https URL

[8]: FM privacy leakage issues

read on: - 20 Feb 2024
Mitigate LLMEvaluate Adversarial

In this session, our readings cover:

Required Readings:

Are Large Pre-Trained Language Models Leaking Your Personal Information?

https://arxiv.org/abs/2205.12628
Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang Are Large Pre-Trained Language Models Leaking Your Personal Information? In this paper, we analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner’s name. We find that PLMs do leak personal information due to memorization. However, since the models are weak at association, the risk of specific personal information being extracted by attackers is low. We hope this work could help the community to better understand the privacy risk of PLMs and bring new insights to make PLMs safe.

Privacy Risks of General-Purpose Language Models

https://ieeexplore.ieee.org/abstract/document/9152761
We find the text embeddings from general-purpose language models would capture much sensitive information from the plain text. Once being accessed by the adversary, the embeddings can be reverse-engineered to disclose sensitive information of the victims for further harassment. Although such a privacy risk can impose a real threat to the future leverage of these promising NLP tools, there are neither published attacks nor systematic evaluations by far for the mainstream industry-level language models. To bridge this gap, we present the first systematic study on the privacy risks of 8 state-of-the-art language models with 4 diverse case studies. By constructing 2 novel attack classes, our study demonstrates the aforementioned privacy risks do exist and can impose practical threats to the application of general-purpose language models on sensitive data covering identity, genome, healthcare and location. For example, we show the adversary with nearly no prior knowledge can achieve about 75% accuracy when inferring the precise disease site from Bert embeddings of patients’ medical descriptions. As possible countermeasures, we propose 4 different defenses (via rounding, different…

[9]: FM copyright infrigement

read on: - 15 Feb 2024
Mitigate LLMEvaluate Adversarial

In this session, our readings cover:

Required Readings:

Foundation Models and Fair Use

Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, Percy Liang
URL
Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

Extracting Training Data from Diffusion Models

Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training.

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

https://arxiv.org/abs/2303.04226
Recently, ChatGPT, along with DALL-E-2 and Codex,has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content, such as images, music, and natural language, through AI models. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by human, and generating the content according to its knowledge and the intent information. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and thus, improved generation results. With the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation. This survey provides a comprehensive review on the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction. From the perspective of unimodality, we introduce the generation tasks and relative models of text and image. From the perspective of multimodality, we introduce the cross-application between the modalities mentioned above. Finally, we discuss the existing open problems and future challenges in AIGC.

[10]: Survey AI Risk framework

read on: - 13 Feb 2024
Mitigate LLMEvaluate Adversarial

In this session, our readings cover:

Required Readings:

TrustLLM: Trustworthiness in Large Language Models

https://arxiv.org/abs/2401.05561
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Large Language Models (LLMs), such as ChatGPT and Bard, have revolutionized natural language understanding and generation. They possess deep language comprehension, human-like text generation capabilities, contextual awareness, and robust problem-solving skills, making them invaluable in various domains (e.g., search engines, customer support, translation). In the meantime, LLMs have also gained traction in the security community, revealing security vulnerabilities and showcasing their potential in security-related tasks. This paper explores the intersection of LLMs with security and privacy. Specifically, we investigate how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs. Through a comprehensive literature review, the paper categorizes the papers into “The Good” (beneficial LLM applications), “The Bad” (offensive applications), and “The Ugly” (vulnerabilities of LLMs and their defenses). We have some interesting findings. For example, LLMs have proven to enhance code security (code vulnerability detection) and data privacy (data confidentiality protection), outperforming traditional methods. However, they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. We have identified areas that require further research efforts. For example, Research on model and parameter extraction attacks is limited and often theoretical, hindered by LLM parameter scale and confidentiality. Safe instruction tuning, a recent development, requires more exploration. We hope that our work can shed light on the LLMs’ potential to both bolster and jeopardize cybersecurity
https://arxiv.org/abs/2312.02003

[11]: GenAI Guardrails

read on: - 01 Feb 2024
Mitigate Adversarial

In this session, our readings cover:

Required Readings:

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

https://arxiv.org/abs/2312.06674
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model’s capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Here is a name list of posts!

BackTop

Safety Benchmark WMDP

1 minute read

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatt...

LLM interpretibility, trust and knowledge conflicts

15 minute read

Required Readings:

LLM Hallucination

15 minute read

In this session, our readings cover:

More FM risk

38 minute read

In this session, our readings cover:

LLM multimodal harm responses

14 minute read

In this session, our readings cover:

FM toxicity / harmful outputs

10 minute read

In this session, our readings cover:

FM fairness / bias issues

33 minute read

In this session, our readings cover:

FM privacy leakage issues

14 minute read

In this session, our readings cover:

FM copyright infrigement

26 minute read

In this session, our readings cover:

Survey AI Risk framework

14 minute read

In this session, our readings cover:

GenAI Guardrails

19 minute read

In this session, our readings cover:

FMRisk

Recent Readings for Risks of Foundation Models (since 2022) (Index of Posts):

Here is a detailed list of posts!

[1]: Safety Benchmark WMDP

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

[2]: LLM interpretibility, trust and knowledge conflicts

Required Readings:

Rethinking interpretability in the era of large language models

The Claude 3 Model Family: Opus, Sonnet, Haiku

More Readings:

Knowledge Conflicts for LLMs: A Survey

Transformer Debugger

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Tracing Model Outputs to the Training Data

Language models can explain neurons in language models

[3]: LLM Hallucination

Required Readings:

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

More Readings:

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Survey of Hallucination in Natural Language Generation

Do Language Models Know When They’re Hallucinating References?

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment

[4]: More FM risk

Required Readings:

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

More Readings:

Low-Resource Languages Jailbreak GPT-4

A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation

Even More

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation / EMNLP2023

OpenAI on LLM generated bio-x-risk

A misleading open letter about sci-fi AI dangers ignores the real risks

Evaluating social and ethical risks from generative AI

Managing Existential Risk from AI without Undercutting Innovation

[5]: LLM multimodal harm responses

Required Readings:

Cheating Suffix: Targeted Attack to Text-To-Image Diffusion Models with Multi-Modal Priors

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

More Readings:

Visual Instruction Tuning

GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse

Misusing Tools in Large Language Models With Visual Adversarial Examples

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

[6]: FM toxicity / harmful outputs

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

More Readings:

SafeText: A Benchmark for Exploring Physical Safety in Language Models

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Lessons learned on language model safety and misuse

Planning red teaming for large language models (LLMs) and their applications

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

[7]: FM fairness / bias issues

Required Readings:

Evaluating and Mitigating Discrimination in Language Model Decisions

More Readings:

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Machine Learning in development: Let’s talk about bias!

Exploring Social Bias in Chatbots using Stereotype Knowledge WNLP@ACL2019

Bias and Fairness in Large Language Models: A Survey

A Survey on Fairness in Large Language Models

[8]: FM privacy leakage issues

Required Readings:

Are Large Pre-Trained Language Models Leaking Your Personal Information?

Privacy Risks of General-Purpose Language Models

More Readings:

Privacy in Large Language Models: Attacks, Defenses and Future Directions

ProPILE: Probing Privacy Leakage in Large Language Models

[9]: FM copyright infrigement

Required Readings:

Foundation Models and Fair Use

Extracting Training Data from Diffusion Models

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

More Readings:

Audio Deepfake Detection: A Survey

Copyright Plug-in Market for The Text-to-Image Copyright Protection

Membership Inference Attacks against Language Models via Neighbourhood Comparison

Deepfake Taylor Swift event: