

- Just by tuning the inference parameters like model, number of responses, temperature etc. without changing any model weights or prompt, the baseline accuracy of untuned gpt-4 can be improved by 20% in high school math competition problems.
- For easy problems, the tuned gpt-3.5-turbo model vastly outperformed untuned gpt-4 in accuracy (e.g., 90% vs. 70%) and cost efficiency. For hard problems, the tuned gpt-4 is much more accurate (e.g., 35% vs. 20%) and less expensive than untuned gpt-4.
- AutoGen can help with model selection, parameter tuning, and cost-saving in LLM applications.
Experiment Setup
We use AutoGen to select between the following models with a target inference budget $0.02 per instance:- gpt-3.5-turbo, a relatively cheap model that powers the popular ChatGPT app
- gpt-4, the state of the art LLM that costs more than 10 times of gpt-3.5-turbo
- temperature: The parameter that controls the randomness of the output text. A higher temperature means more diversity but less coherence. We search for the optimal temperature in the range of [0, 1].
- top_p: The parameter that controls the probability mass of the output tokens. Only tokens with a cumulative probability less than or equal to top-p are considered. A lower top-p means more diversity but less coherence. We search for the optimal top-p in the range of [0, 1].
- max_tokens: The maximum number of tokens that can be generated for each output. We search for the optimal max length in the range of [50, 1000].
- n: The number of responses to generate. We search for the optimal n in the range of [1, 100].
- prompt: We use the template: “{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \boxed{}.” where {problem} will be replaced by the math problem instance.
Experiment Results
The first figure in this blog post shows the average accuracy and average inference cost of each configuration on the level 2 Algebra test set. Surprisingly, the tuned gpt-3.5-turbo model is selected as a better model and it vastly outperforms untuned gpt-4 in accuracy (92% vs. 70%) with equal or 2.5 times higher inference budget. The same observation can be obtained on the level 3 Algebra test set.


Analysis and Discussion
While gpt-3.5-turbo demonstrates competitive accuracy with voted answers in relatively easy algebra problems under the same inference budget, gpt-4 is a better choice for the most difficult problems. In general, through parameter tuning and model selection, we can identify the opportunity to save the expensive model for more challenging tasks, and improve the overall effectiveness of a budget-constrained system. There are many other alternative ways of solving math problems, which we have not covered in this blog post. When there are choices beyond the inference parameters, they can be generally tuned viaflaml.tune
.
The need for model selection, parameter tuning and cost saving is not specific to the math problems. The Auto-GPT project is an example where high cost can easily prevent a generic complex task to be accomplished as it needs many LLM inference calls.