16 minute read

In this session, our readings cover:

Required Readings:

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

  • Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang
  • Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents’ capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.

More Readings:

Understanding the planning of LLM agents: A survey

  • https://arxiv.org/abs/2402.02716
  • As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed.

LLM Agents can Autonomously Hack Websites

  • Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang
  • In recent years, large language models (LLMs) have become increasingly capable and can now interact with tools (i.e., call functions), read documents, and recursively call themselves. As a result, these LLMs can now function autonomously as agents. With the rise in capabilities of these agents, recent work has speculated on how LLM agents would affect cybersecurity. However, not much is known about the offensive capabilities of LLM agents. In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

  • Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao
  • Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents. Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5\% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code will be available at this https URL.

Humanoid Locomotion as Next Token Prediction

  • Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik
  • We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

Outline

The presenters identify 3 papers that each presenter will discuss:

  • Large Language Model based Multi-Agents: A Survey of Progress and Challenges (presenter: Ritu)
  • Understanding the Planning of LLM Agents: A Survey (presenter: Afsara)
  • LLM Agents can Autonomously Hack Websites (presenter: Aidan)

In this slide the presenters define what autonomous agents are, and give this equation for the formulation of the planning tasks:

$p = (a_0, a_1, …, a_t)=plan(E, g; \Theta, P)$

Where $\Theta$ and $P$ represent the paramaters of the LLM and the parameters of the task.

The presentors now discuss conventional approaches to autonomous agents, specifically bringing up the jugs-and-water problem.

In terms of the symbolic methods, they describe the Planning Domain Definition Language (PDDL), which may require the efforts of human experts and lacks error tolerance.

Policy learning is a reinforcement learning based method, which requires many data points and thus can be impractical if data collection is expensive.

The presenters show a graphic from the authors which is their “novel and systematic” taxonomy for LLM planning, and which defines existing tools into 5 key categories: External Planner, Reflection, Memory, Decomposition, and Selection.

This table provides more information on the taxonomy categories from the previous slide, and shows some specific representative works for each.

The authors discuss task decomposition, and specifically decomposition-first methods, wherein a model decomposes a task into subgoals which are planned for. This reduces the risk of hallucination and forgetting, but also has an additional requirement for adjustment mechanisms.

In interleaved task decomposition, decomposition is dynamically adjusted based on environmental feedback. This improves fault tolerance. However, this can lead to hallucinations for very complicated tasks. Planning can also be constrained by the context length of the LLM.

In multi-plan selection, an agent generates multiple plans and selects the optimal one. In self-consistency, multiple distinct reasoning paths are used and the naive majority vote strategy is used to pick the optimal one.

In Tree-of-Thought, two explicit strategies, “sample” and “propose”, are used to generate and sample plans in the reasoning process. This process supports tree-search algorithms like BFS and DFS.

The presenters show a visualization of the previously discussed methods as well as Input-Output Prompting and Self-Consistency with CoT.

This slide shows how an agent might answer questions related to thought generation and valuation of plans for the Game of 24, a mathematical reasoning challenge where the goal is to use 4 numbers and arithmetic operations to obtain 24.

The presenter mainly concludes with three methods

  • External Planner-Aided Planning
  • Reflection and Refinement
  • Memory-Augmented Planning

External Planner-Aided Planning

This methodology is crafted to employ an external planner to elevate the planning procedure, aiming to address the issues of efficiency and infeasibility of generated plans, while the LLM mainly plays the role in formalizing the tasks. The process could be formulated as follows:

where Φ denotes the external planner module, h represents the formalized information.

Advantage: Symbolic systems offer benefits such as theoretical completeness, stability, and interpretability.

Large language models (LLMs) primarily serve a supportive function, focusing on parsing textual feedback and supplying extra reasoning details to aid in planning, especially in tackling complex issues.

Reflection and Refinement

This methodology emphasizes improving planning ability through reflection and refinement. It encourages LLM to reflect on failures and then refine the plan. The process could be formulated as follows:

LLM-Agents may encounter errors and fall into repetitive “thought loops” during planning because of current issues with hallucinations and inadequate reasoning skills for complex problems, exacerbated by restricted feedback.By reflecting on and summarizing their failures, agents can rectify mistakes and escape these loops in future attempts.

Memory-Augmented Planning

This kind of approach enhances planning with an extra memory module, in which valuable information is stored, such as commonsense knowledge, past experiences, domain-specific knowledge, et al. The information is retrieved when planning, serving as auxiliary signals. The process could be formulated as follows:

where M represents the memory module.

In this section, the presenter mainly discuss two different memories: RAG-based memory and Embodied Memory
The RAG-based and Embodied Memory strategies aim to augment LLM agents with the ability to store and recall past experiences, enhancing their planning and decision-making processes. RAG-based memory offers the advantage of real-time, cost-effective memory updates using natural language, though its efficacy is contingent on the precision of the retrieval algorithms. While fine-tuning can expand the memory capacity of these systems, it comes with greater costs and difficulty in maintaining detailed information. On the other hand, Embodied Memory, which embeds historical experiences into the model’s parameters, shows potential for improved growth and error-resilience in planning tasks. However, the generation of new memories is limited by the LLM’s inherent generative abilities, posing a significant challenge for furthering the capabilities of less sophisticated LLM agents with self-generated memories.

Challenges

These are the five challenges concluded by the presenters.

  • Hallucinations
  • Feasibility of Generated Plans
  • Efficiency of Generated Plans
  • Multi-Modal Environment Feedback
  • Fine-grained Evaluation.

This slide points out significant challenges in the realm of LLM-based planning: hallucinations that lead to unrealistic or unfaithful task executions, the suboptimal feasibility and efficiency of generated plans, difficulties in processing multi-modal environmental feedback, and the need for fine-grained evaluation methods. These challenges stem from LLMs’ foundational reliance on statistical learning, which while adept at predicting the next word in large datasets, falls short in handling complex, less common constraints and integrating diverse real-world feedback beyond text. Current benchmarks fail to provide a nuanced assessment of LLM performance, emphasizing the completion of tasks over the process quality. There’s a clear need for more sophisticated evaluation standards that capture the stepwise quality and adherence to complex instructions in planning tasks to better prepare LLMs for real-world applications.

Paper II: Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Introduction

A survey on LLM-based multi-agent systems explores their use in diverse domains, communication methods, and capacity growth mechanisms, aiming to provide insights and resources for researchers.

Goals

This paper provides insights into the following questions:

  • What domains and environments do LLM-based multi-agents simulate?
  • How are these agents profiled and how do they communicate?
  • What mechanisms contribute to the growth of agents’ capacities?

LLM-based Single-Agents and LLM-based Multi-Agents

Then, the presenter concludes two approaches within the domain of large language models (LLMs): single-agent and multi-agent systems.

Single-agent LLMs focus on subdividing tasks, methodical thinking, and learning from experiences to enhance autonomy and problem-solving. They utilize both short-term and long-term memory to improve learning and maintain context.

On the other hand, multi-agent LLMs harness the power of collective intelligence and skill diversity across multiple agents. These systems expand upon the capabilities of single-agent models by having specialized LLMs that can collaborate, enabling more effective simulation of complex, real-world environments through their interactions.

Overall, while single LLM agents are noted for their individual cognitive abilities and internal processing, multi-agent LLM systems capitalize on the synergy between diverse agents to facilitate collaborative and dynamic decision-making, suitable for complex and nuanced tasks.

LLM-Based Multi-Agents

This graph shows the rising trend in the research field of LLM-based Multi-Agents. For Problem Solving and World Simulation, the authors categorize current work into several categories and count the number of papers of different types at 3-month intervals. The number at each leaf node denotes the count of papers within that category.

Agents-Environment Interface

In LLM-MA systems, the interface agents operate is important. This section will present the architecture of LLM-MA systems, explaining how these interfaces influence agent behavior and learning. Slide 21 shows three distinct types: Sandbox environments where agents evolve in a controlled virtual setting, Physical environments that demand real-world interaction and adherence to physical laws, and scenarios where no specific external environment is involved, focusing on communication among agents.

Agent Profiling

Defining Agent Characteristics Agents within LLM-MA systems are not one-size-fits-all; they are defined by a set of traits, actions, and skills, which are tailored to specific objectives. Whether it’s for gaming, software development, or debate. Three types of agent profiling are explored: predefined, model-generated, and data-derived. Each type plays a crucial role in how agents perceive and interact with their environment.

Agents Communication

Communication is the cornerstone of multi-agent systems. Slide 23 explains the paradigms of communication—cooperative, debate, and competitive—and the structures that support them, whether layered, decentralized, or centralized, and look at how agents exchange content, from text to more complex data, and the implications of these communication networks on the overall system dynamics.

Agents Capabilities Acquisition

How Agents Acquire and Evolve Skills Agents in LLM-MA systems must continuously learn and adapt. Slide 24 shows the types of feedback agents receive—from environmental interactions to human input—and how this feedback shapes their learning process. It also examine the methods agents use to adjust to complex problems, including memory retention, self-evolution, and dynamic generation, denoting the adaptability and growth potential within these systems.

LLM-MA for Problem Solving

In the complex landscape of modern challenges, LLM-MA stands out as a versatile problem-solving tool. Slide 25 shows the integral roles LLM-MA systems play in various settings: Software Development: LLM-MAs emulate critical roles within development teams, enhancing collaboration and efficiency.
Embodied Agents: In robotics, LLM-MAs drive cooperative task execution, which is a breakthrough in the field of autonomous systems.
Science Experiments: LLM-MAs are revolutionizing research by forming virtual science teams, and facilitating experiments that would be otherwise unfeasible.
Science Debate: LLM-MAs sharpens collective reasoning and consensus building in scientific debate.\

LLM-MA for World Simulation

LLM-MAs are not only problem-solvers but also powerful simulators capable of replicating the intricacies of real-world dynamics. The slide 26 mentions various simulation applications: Societal Simulation: LLM-MAs can model societal behaviors, providing a platform to explore social dynamics and test sociological theories.
Gaming: LLM-MAs can create environments that test game theory principles and hypotheses, offering new insights into strategic interactions.
Psychology: LLM-MAs can help in understanding group behaviors and mental health, supporting the development of psychological interventions.
Economy: LLM-MAs can learn modeling of economic systems and decision-making behaviors, a valuable tool for economists and strategists.
Recommender Systems: LLM-MAs can capture user preferences, tailoring personalized experiences in digital platforms.
Policy Making: LLM-MAs can be used in policy development, assessing the potential impacts of decisions on communities.\

Summary of LLM-MA Studies

As LLM-based Multi-Agent Systems are constantly evolving, researchers categorize studies by motivation, domain goals, and detailed functionalities. Slide 27 provides a summary of current works, breaking down various research projects and their distinctive features. It shows these studies range across problem-solving and world simulations, examining their Agents-Environment Interface, Agent Profiling, Agent Communication, and Agent Capability Acquisition. This offers a bird’s eye view of the state-of-the-art research components which are intertwined in driving forward LLM-MA research.

Challenges and Opportunities

As LLM-MA systems develop, new ideas bring both challenges and opportunities. Slide 28 shows multi-modal environments, the quest to address hallucination in agent communication, the pursuit of collective intelligence, the scaling of LLM-MA systems, and the continuous improvement of evaluation benchmarks. It aims to map out the current obstacles and discuss the complexities of scaling and integrating multi-modal data, and show opportunities for innovation in collective AI.

Conclusion

In conclusion, The slides explored the development of LLM-based Multi-Agent systems. From their creation to their interaction with environments and peers, and their path to becoming more adept, this synthesizes what these systems are currently capable of and the tools they utilize. Hoping this discussion can introduce the practical applications, the achievements to date, and the potential that lies ahead, and encourage researchers to explore the future of LLM-MA systems and the new ideas they inspire.

LLM Agents can Autonomously Hack Websites

As LLMs’ capabilities improve, they now also research value in the area of cybersecurity, one of the application that is explored in the paper is the ability of LLMs to autonomously hack websites.

Recent developments in LLMs have enabled them to have the ability to interact with tools through several different methods, such as function calls, read documents and recursively prompt themselves. And this has allowed LLMs to be applied to more areas, including chemistry.

This paper focuses on the web security area, it has beem shown that LLMs can be used to hack websites autonomously. LLMs have the ability to exploit vulnerabilities in websites inculding simple SQL injection, cross-site scripting and cross-site request forgery.

Figure 1 shows the schematic, hacker prompts the LLM agent with abilities to use tools and the agent hack the website autonomously.

The paper tested 15 vulnerabilities across different difficulty levels and touch on different techniques.

The authors of the paper also tested 10 models, using Open AI API for GPTs and Together AI API for the rest.

The paper provides specifics on the agent setup, there are three main components. First, authors set up sandboxed headless web browser using playwright, the LLM agents also have access to the terminal and a Python code interpreter. Second, the LLM agents are given the access to six web hacking documents. Third, the LLM agents are given the ability to plan(through OpenAI’s Assistants API).

To score the LLM agents, the authors set goals based on the vulnerabilities, gave the system 10 minutes to execute and 5 attempts per vulnerability. The system is considered successful any of the 5 achieved the goal set.

The results show that with only basic prompting, and not specifying the vulnerabilities, GPT 4 has an overall success rate over 70%, GPT 3.5 has success rate below 10%. However, the rest open source models has no performance.

The authors also conducted ablation studies to determine which factors are more influential. The authors tested GPT-4 as the experiement setup states, without document reading, without detailed system instruction and without either. As the result shows, the performance drops more without document reading tool, thus the authors conclude that the document reading tool is more powerful.

GPT-4 is also tested on more vulnerabilities, the results show that GPT-4 can call as many as 48 functions to complete an attack, which might be related to the complexity of the attack.

The authors also listed the success rates of GPT-4 alongside with GPT-3.5 detection rates, the results show that even when GPT-3.5 can detect the vulnerabilities, it would still fail to exploit the detected vulnerabilities. Easy vulnerabilities, such as SQL injection and CSRF has higher success rates as expected. However, it is worth noting that any success rate even as low as 20% is a success for hackers.

The authors concluded that GPT-4 has high performance and outperforms the other models. The authors also suggested that better document and training can potentially improve the performance as the research is fairly new. However, it also should be noticed that testing showed much more limited threat against real websites since the testing was done on sandboxed websites.