FMBasic


Recent Readings for Basic Topics of Foundation Models (since 2022) (Index of Posts):

No. Read Date Title and Information We Read @
1 2024, Feb, 8 Open Source LLM - Mistral Data preparation 2024-S6
2 2024, Feb, 6 Survey human alignment 2024-S5
3 2024, Jan, 30 LLM evaluating framework 2024-S3
4 2024, Jan, 23 LLM basics 2024-S1
5 2024, Jan, 18 Introduction 2024-S0
6 2022, Dec, 3 RLHF + InstructGPT 2022-W6
7 2022, Dec, 1 Stable diffusion + DreamBooth + LoRA 2022-W5
8 2022, Oct, 1 Emergent Abilities of LLM + ICLR 2022-W4


Here is a detailed list of posts!



[1]: Open Source LLM - Mistral Data preparation


BasicLLM

In this session, our readings cover:

Required Readings:

Mistral 7B

  • https://mistral.ai/news/announcing-mistral-7b/
  • We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B – Instruct, that surpasses the Llama 2 13B – Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

More Readings:

OLMo: Accelerating the Science of Language Models

  • https://arxiv.org/abs/2402.00838

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

Mixtral of Experts

  • https://arxiv.org/abs/2401.04088
  • We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

- Llama 2: Open Foundation and Fine-Tuned Chat Models

  • In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

  • https://arxiv.org/abs/2101.00027
  • Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

[2]: Survey human alignment


Alignment

In this session, our readings cover:

Required Readings:

Aligning Large Language Models with Human: A Survey

  • https://arxiv.org/abs/2307.12966
  • https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo
  • https://huggingface.co/blog/stackllama

More readings

Github Awesome-RLHF

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

  • https://arxiv.org/abs/2301.13688
  • We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at this https URL.

DPO Direct Preference Optimization: Your Language Model is Secretly a Reward Model

  • https://arxiv.org/abs/2305.18290
  • https://huggingface.co/blog/dpo-trl
  • While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Training language models to follow instructions with human feedback

  • https://arxiv.org/abs/2203.02155)
  • “further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.”

Deep reinforcement learning from human preferences

  • https://openreview.net/forum?id=GisHNaleWiA
  • “explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function”

[3]: LLM evaluating framework


LLMEvaluate

In this session, our readings cover:

Required Readings:

Holistic Evaluation of Text-To-Image Models

  • https://arxiv.org/abs/2311.04287
  • The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at this https URL and the code at this https URL, which is integrated with the HELM codebase.

Holistic Evaluation of Language Models

  • https://arxiv.org/abs/2211.09110

More Readings:

Challenges in evaluating AI systems

  • https://www.anthropic.com/news/evaluating-ai-systems

Evaluating Large Language Models: A Comprehensive Survey

  • https://arxiv.org/abs/2310.19736
  • This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs’ performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability.

Evaluating Large Language Models Trained on Code

  • https://arxiv.org/abs/2107.03374

chatbot-arena-leaderboard

  • https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Leveraging Large Language Models for NLG Evaluation: A Survey

  • https://arxiv.org/abs/2401.07103

[4]: LLM basics


BasicLLM

Required Readings:

Emergent Abilities of Large Language Models

  • URL
  • “an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.”

Language Models are Few-Shot Learners

  • URL
  • “GPT-3, 175B autoregerssive LLM; show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.”

Extra Readings:

A survey of Generative AI Applications

  • https://arxiv.org/abs/2306.02781
  • Generative AI has experienced remarkable growth in recent years, leading to a wide array of applications across diverse domains. In this paper, we present a comprehensive survey of more than 350 generative AI applications, providing a structured taxonomy and concise descriptions of various unimodal and even multimodal generative AIs. The survey is organized into sections, covering a wide range of unimodal generative AI applications such as text, images, video, gaming and brain information. Our survey aims to serve as a valuable resource for researchers and practitioners to navigate the rapidly expanding landscape of generative AI, facilitating a better understanding of the current state-of-the-art and fostering further innovation in the field.

Generative AI: Perspectives from Stanford HAI

  • https://hai.stanford.edu/generative-ai-perspectives-stanford-hai

[5]: Introduction


BasicLLM

Readings:

Basics of ML and DL:

Basics of NLP

  • URL
  • Typical NLP tasks / Challenges / Pipeline
  • f() on natural language
    • Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
    • Word2Vec (2013-2016) • (GloVe/ FastText)
    • Recurrent NN (2014-2016) • LSTM
    • Seq2Seq
    • Attention
    • Self-Attention (2016 – now )
    • Transformer (attention only Seq2Seq)
    • BERT / RoBERTa/ XLNet/ GPT / …
  • A good code walk through on transformer at URL

[6]: RLHF + InstructGPT


RL AGI language model Human Alignment
Papers Paper URL Abstract
Training language models to follow instructions with human feedback URL “further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.”
Deep reinforcement learning from human preferences URL “explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function”

[7]: Stable diffusion + DreamBooth + LoRA


Diffusion Image synthesis Efficiency

Stable diffusion

  • URL
  • “High-Resolution Image Synthesis with Latent Diffusion Models”

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

  • URL
  • “personalization” of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. .”

LoRA: Low-Rank Adaptation of Large Language Models

  • URL
  • “propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.”

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

  • https://arxiv.org/abs/2208.01618
  • Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
  • Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new “words” in the embedding space of a frozen text-to-image model. These “words” can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

[8]: Emergent Abilities of LLM + ICLR


language model

Emergent Abilities of Large Language Models

  • URL
  • “an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.”

Language Models are Few-Shot Learners

  • URL
  • “GPT-3, 175B autoregerssive LLM; show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.”

On the Opportunities and Risks of Foundation Models

  • URL
  • ” a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations).”

The Power of Scale for Parameter-Efficient Prompt Tuning

  • https://arxiv.org/abs/2104.08691
  • Brian Lester, Rami Al-Rfou, Noah Constant
  • In this work, we explore “prompt tuning”, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s “few-shot” learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.



Here is a name list of posts!


LLM basics

less than 1 minute read

Required Readings:

RLHF + InstructGPT

less than 1 minute read

Papers Paper URL Abstract Training language models to follow instructions with human feedback URL ...

Emergent Abilities of LLM + ICLR

1 minute read

Emergent Abilities of Large Language Models URL “an ability to be emergent if it is not present in smaller models but is present in larger models. Thus...