Fig.1 illustrates the general flow of AgentEval with verification step
TL;DR:AgentEval
— a framework to assess the multi-dimensional utility of any LLM-powered application crafted to assist users in specific tasks. We have now embedded it as part of the AutoGen library to ease developer adoption.AgentEval
is a comprehensive framework designed to bridge the gap in assessing the utility of LLM-powered applications. It leverages recent advancements in LLMs to offer a scalable and cost-effective alternative to traditional human evaluations. The framework comprises three main agents: CriticAgent
, QuantifierAgent
, and VerifierAgent
, each playing a crucial role in assessing the task utility of an application.
CriticAgent: Defining the Criteria
The CriticAgent’s primary function is to suggest a set of criteria for evaluating an application based on the task description and examples of successful and failed executions. For instance, in the context of a math tutoring application, the CriticAgent might propose criteria such as efficiency, clarity, and correctness. These criteria are essential for understanding the various dimensions of the application’s performance. It’s highly recommended that application developers validate the suggested criteria leveraging their domain expertise.
QuantifierAgent: Quantifying the Performance
Once the criteria are established, the QuantifierAgent takes over to quantify how well the application performs against each criterion. This quantification process results in a multi-dimensional assessment of the application’s utility, providing a detailed view of its strengths and weaknesses.
VerifierAgent: Ensuring Robustness and Relevance
VerifierAgent ensures the criteria used to evaluate a utility are effective for the end-user, maintaining both robustness and high discriminative power. It does this through two main actions:
AgentEval
generate_criteria
function or manually created.generate_criteria
step.