AgentEval: a multi-agent system for assessing utility of LLM-powered applications
AgentEval
works in an offline
scenario, where we use a math problem-solving task as an example.
AgentEval
consists of two key steps:
generate_criteria
: This is an LLM-based function that generates a
list of criteria to help to evaluate a utility
given task.
quantify_criteria
: This function quantifies the performance of any
sample task based on the criteria generated in the
generate_criteria
step in the following way:
Python>=3.9
. To run this notebook example, please install
ag2, Docker, and OpenAI:
config_list_from_json
function loads a list of configurations from an environment variable
or a json file. It first looks for an environment variable with a
specified name. The value of the environment variable needs to be a
valid json string. If that variable is not found, it looks for a
json file with the same name. It filters the configs by filter_dict.agenteval-in-out/response_failed.txt
, and the other one was solved
successfully, i.e., agenteval-in-out/response_successful.txt
.
QuantifierAgent
quantify_criteria
from agent_eval. Again, you can use your own defined
criteria in criteria_file
.
sample_test_case.json
, for demonstration.
AgentEval
on the logs../test/test_files/agenteval-in-out/estimated_performance.png