KV Cache and Tooling
KV Caching in LLM:
- grouped query attention: https://arxiv.org/pdf/2305.13245.pdf
- Paged attention https://arxiv.org/pdf/2309.06180.pdf https://openreview.net/pdf?id=uNrFpDPMyo
Must know tools for training/finetuning/serving LLM’s -
-
Torchtune - Build on top of Pytorch, for training and finetuning LLM’s. Uses yaml based configs for easily running experiments. Github -
-
axolotl - Built on top on Huggigface peft and transformer library, supports fine-tuning a large number for models like Mistral, LLama etc. Provides support for techniques like RLHF, DPO, LORA, qLORA etc. Github
-
LitGPT - Build on nanoGPT and Megatron, support pre-training and fine-tuning, has examples like Starcoder, TinyLlama etc. Github -
-
Maxtext - Jax based library for training LLM’s on Google TPU’s with configs for models like Gemma, Mistral and LLama2 etc. Github
-
Langchain- https://python.langchain.com/docs/get_started/introduction
- haystack.deepset.ai
- https://github.com/deepset-ai/haystack
- LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it’s best suited for building RAG, question answering, semantic search or conversational agent chatbots.
- LlamaIndex
- https://docs.llamaindex.ai/en/stable/ LlamaIndex supports Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, LlamaIndex: retrieves information from your data sources first, / adds it to your question as context, and / asks the LLM to answer based on the enriched prompt.
- Making Retrieval Augmented Generation Fast
- https://www.pinecone.io/learn/fast-retrieval-augmented-generation/
- OpenMoE
- https://github.com/XueFuzhao/OpenMoE
More readings
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
- Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, Xia Hu
-
This paper presents a comprehensive and practical guide for practitioners and end-users working with Large Language Models (LLMs) in their downstream natural language processing (NLP) tasks. We provide discussions and insights into the usage of LLMs from the perspectives of models, data, and downstream tasks. Firstly, we offer an introduction and brief summary of current GPT- and BERT-style LLMs. Then, we discuss the influence of pre-training data, training data, and test data. Most importantly, we provide a detailed discussion about the use and non-use cases of large language models for various natural language processing tasks, such as knowledge-intensive tasks, traditional natural language understanding tasks, natural language generation tasks, emergent abilities, and considerations for specific tasks.We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios. We also try to understand the importance of data and the specific challenges associated with each NLP task. Furthermore, we explore the impact of spurious biases on LLMs and delve into other essential considerations, such as efficiency, cost, and latency, to ensure a comprehensive understanding of deploying LLMs in practice. This comprehensive guide aims to provide researchers and practitioners with valuable insights and best practices for working with LLMs, thereby enabling the successful implementation of these models in a wide range of NLP tasks. A curated list of practical guide resources of LLMs, regularly updated, .
- https://github.com/Mooler0410/LLMsPracticalGuide
Retentive Network: A Successor to Transformer for Large Language Models
- In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more
RWKV: Reinventing RNNs for the Transformer Era
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more