OAI_CONFIG_LIST
, as appropriate.
pip
, installation can be achieved as follows:
autogenbench clone HumanEval
downloads and expands the HumanEval benchmark scenario.cd HumanEval; cat README.md
navigates to the benchmark directory, and prints the README (which you should always read!)autogenbench run --subsample 0.1 --repeat 3 Tasks/human_eval_two_agents.jsonl
runs a 10% subsample of the tasks defined in Tasks/human_eval_two_agents.jsonl
. Each task is run 3 times.autogenbench tabulate results/human_eval_two_agents
tabulates the results of the run.tabulate
command, you should see output similar to the following:
TwoAgents
template. It is important to remember that AutoGenBench evaluates specific end-to-end configurations of agents (as opposed to evaluating a model or cognitive framework more generally).
Finally, complete execution traces and logs can be found in the Results
folder. See the AutoGenBench README for more details about command-line options and output formats. Each of these commands also offers extensive in-line help via:
autogenbench --help
autogenbench clone --help
autogenbench run --help
autogenbench tabulate --help