Your resource for web content, online publishing
and the distribution of digital products.
S M T W T F S
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
 
29
 
30
 
 
 
 
 
 

Comparative Analysis of Prompt Optimization on BBH Tasks

DATE POSTED:September 25, 2024

:::info Authors:

(1) Chengrun Yang, Google DeepMind and Equal contribution;

(2) Xuezhi Wang, Google DeepMind;

(3) Yifeng Lu, Google DeepMind;

(4) Hanxiao Liu, Google DeepMind;

(5) Quoc V. Le, Google DeepMind;

(6) Denny Zhou, Google DeepMind;

(7) Xinyun Chen, Google DeepMind and Equal contribution.

:::

Table of Links

Abstract and 1. Introduction

2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms

2.2 Meta-Prompt Design

3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression

3.2 Traveling Salesman Problem (TSP)

4 Application: Prompt Optimization and 4.1 Problem Setup

4.2 Meta-Prompt Design

5 Prompt Optimization Experiments and 5.1 Evaluation Setup

5.2 Main Results

5.3 Ablation Studies

5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt

6 Related Work

7 Conclusion, Acknowledgments and References

A Some Failure Cases

B Prompting Formats for Scorer Llm

C Meta-Prompts and C.1 Meta-Prompt for Math Optimization

C.2 Meta-Prompt for Prompt Optimization

D Prompt Optimization Curves on the Remaining Bbh Tasks

E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions

E PROMPT OPTIMIZATION ON BBH TASKS – TABULATED ACCURACIES AND FOUND INSTRUCTIONS E.1 PALM 2-L-IT AS OPTIMIZER, OPTIMIZATION STARTING FROM THE EMPTY STRING

Table 8 and 9 show the instructions found by prompt optimization. A comparison of their accuracies with baselines “Let’s think step by step.” (Kojima et al., 2022), “Let’s work this out in a step by step way to be sure we have the right answer.” (Zhou et al., 2022b), and the empty string is in Table 7; a visualization is in Section 5.2 Figure 5.

\  our found instructions with the PaLM 2-L-IT optimizer vs baseline. The optimization starts from the empty string. Because of the 20-80 train-test split, we show accuracies with the format “training / test / overall (training + test)”. The PaLM 2-L scores are from A_begin instructions; the text-bison scores are from Q_begin instructions. Bold numbers indicate the best for the corresponding task.

\  BBH task-wise instructions found by prompt optimization with the PaLM 2-L scorer and the PaLM 2-L-IT optimizer. The optimization starts from the empty string.

\  BBH task-wise instructions found by prompt optimization with the text-bison scorer and the PaLM 2-L-IT optimizer. The optimization starts from the empty string.

E.2 G P T-3.5-T U R B O AS OPTIMIZER, OPTIMIZATION STARTING FROM THE EMPTY STRING

Table 11, 12 and 13 show the instructions found by prompt optimization. Their accuracies are listed in Table 10. Figure 25 visualizes the difference between their accuracies and those of the baselines “Let’s think step by step.” and the empty string. The optimizations find instructions better than the empty starting point, and most of the found instructions are better than “Let’s think step by step”.

\ One caveat in the Abegin instructions (Table 11) is that a lot of the found instructions are imperative or interrogative sentences that are more suitable to be put into “Q:” rather than “A:”, like “Solve the sequence by properly closing the parentheses.” for dycklanguages and “Which movie option from the given choices …?” for movierecommendation. Such styles appear more often here than the PaLM 2-L-IT optimizer results (Table 8), showing PaLM 2-L-IT understands the needed style better. In Section E.3, we show the Abegin optimization results with the non-empty starting point “Let’s solve the problem.”. Most results there are declarative sentences – more suitable for A_begin.

\  On 23 BBH tasks, the accuracy differences among instructions found by prompt optimization (with the gpt-3.5 turbo optimizer), “Let’s think step by step.”, and the empty string (optimization starting point).

\  Accuracies on BBH tasks with the gpt-3.5-turbo optimizer that starts from the empty string. The PaLM 2-L scores are from A_begin (left) instructions; the text-bison scores include Q_begin (left) and Q_end (right) instructions.

\  BBH task-wise instructions found by prompt optimization with the PaLM 2-L scorer and the gpt-3.5-turbo optimizer. The optimizations start from the empty string.

\  BBH task-wise Q_begin instructions found by prompt optimization with the text-bison scorer and the gpt-3.5-turbo optimizer. The optimizations start from the empty string.

\  BBH task-wise Q_end instructions found by prompt optimization with the text-bison scorer and the gpt-3.5-turbo optimizer. The optimizations start from the empty string.

E.3 PALM 2-L AS SCORER, G P T-3.5-T U R B O AS OPTIMIZER, OPTIMIZATION STARTING FROM “LET’S SOLVE THE PROBLEM.”

Figure 26 and Table 14 compare the accuracies of found instructions vs “Let’s solve the problem.”, “Let’s think step by step.”, and the instructions in Table 11. Table 15 details the found instructions.

\ The “Let’s” pattern appears more often in the found instructions because of the starting points, and the instructions are more often declarative that are more suitable for A_begin, even if some are semantically far from “Let’s solve the problem”. In fact, “Let’s” was adopted by Zhou et al. (2022b) as a fixed pattern in generated prompts, possibly because of the same reason.

\  On 23 BBH tasks, the accuracy differences among instructions found by prompt optimization (with the text-bison scorer and the gpt-3.5-turbo optimizer), “Let’s think step by step.”, and “Let’s solve the problem.” (optimization starting point). The found instructions mostly outperform the “Let’s think step by step.” baseline, the “Let’s solve the problem.” starting point, and the instructions in Table 11 found by prompt optimization from the empty string.

\  Accuracies on BBH tasks with the PaLM 2-L scorer and the gpt-3.5-turbo optimizer that starts from “Let’s solve the problem”. The scores are from A_begin instructions.

\  BBH task-wise Q_begin instructions found by prompt optimization with the PaLM 2-L scorer and the gpt-3.5-turbo optimizer. The optimizations start from “Let’s solve the problem”.

\

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

\