Fig.1 illustrates the general flow of AgentEval
TL;DR:AgentEval
— the first version of the framework to assess the utility of any LLM-powered application crafted to assist users in specific tasks. AgentEval aims to simplify the evaluation process by automatically proposing a set of criteria tailored to the unique purpose of your application. This allows for a comprehensive assessment, quantifying the utility of your application against the suggested criteria.AgentEval
work using math problems dataset as an example in the following notebook. Any feedback would be useful for future development. Please contact us on our Discord.AgentEval
framework - a tool crafted to empower developers in swiftly gauging the utility of LLM-powered applications designed to help end users accomplish the desired task.
Fig. 2 provides an overview of the tasks taxonomy
Let’s first look into an overview of the suggested task taxonomy that a multi-agent system can be designed for. In general, the tasks can be split into two types, where:AgentEval
framework, we are currently focusing on tasks where Success is clearly defined. Next, we will introduce the suggested framework.
AgentEval
FrameworkAgentEval
(shown in Fig. 1), where we employ LLMs to help us understand, verify, and assess task utility for the multi-agent system. Namely:
CriticAgent
is to suggest the list of criteria (Fig. 1), that can be used to assess task utility. This is an example of how CriticAgent
is defined using Autogen
:QuantifierAgent
is to quantify each of the suggested criteria (Fig. 1), providing us with an idea of the utility of this system for the given task. Here is an example of how it can be defined:AgentEval
Results based on Math Problems DatasetCriteria | Description | Accepted Values |
---|---|---|
Problem Interpretation | Ability to correctly interpret the problem | [“completely off”, “slightly relevant”, “relevant”, “mostly accurate”, “completely accurate”] |
Mathematical Methodology | Adequacy of the chosen mathematical or algorithmic methodology for the question | [“inappropriate”, “barely adequate”, “adequate”, “mostly effective”, “completely effective”] |
Calculation Correctness | Accuracy of calculations made and solutions given | [“completely incorrect”, “mostly incorrect”, “neither”, “mostly correct”, “completely correct”] |
Explanation Clarity | Clarity and comprehensibility of explanations, including language use and structure | [“not at all clear”, “slightly clear”, “moderately clear”, “very clear”, “completely clear”] |
Code Efficiency | Quality of code in terms of efficiency and elegance | [“not at all efficient”, “slightly efficient”, “moderately efficient”, “very efficient”, “extremely efficient”] |
Code Correctness | Correctness of the provided code | [“completely incorrect”, “mostly incorrect”, “partly correct”, “mostly correct”, “completely correct”] |
Fig.3 presents results based on overall math problems dataset _s
stands for successful cases, _f
- stands for failed cases
AgentEval
has a number of limitations which are planning to overcome in the future:
CriticAgent
at least two times, and pick criteria you think is important for your domain.QuantifierAgent
can vary with each run, so we recommend conducting multiple runs to observe the extent of result variations.CriticAgent
and QuantifierAgent
can be applied to the logs of any type of application, providing you with an in-depth understanding of the utility your solution brings to the user for a given task.
We would love to hear about how AgentEval works for your application. Any feedback would be useful for future development. Please contact us on our Discord.