1、安装
有两种方法可以安装 promptbench。如果您只想按原样使用,请通过pip,如果要进行任何更改并尝试使用,请从源代码安装它。
pip
安装方式
我们为想要快速开始评估的用户提供了一个 Python 包提示台。
代码语言:javascript复制pip install promptbench
2. 通过 github 安装
代码语言:javascript复制git clone git@github.com:microsoft/promptbench.git
cd promptbench
conda create --name promptbench python=3.9
pip install -r requirements.txt
这只安装了基本的 python 包。对于提示攻击,它需要安装 textattacks。
2、评估流程
- 导入包
import promptbench as pb
- 加载数据集
# print all supported datasets in promptbench
print('All supported datasets: ')
print(pb.SUPPORTED_DATASETS)
# load a dataset, sst2, for instance.
# if the dataset is not available locally, it will be downloaded automatically.
dataset = pb.DatasetLoader.load_dataset("sst2")
# print the first 5 examples
dataset[:5]
显示结果
代码语言:javascript复制All supported datasets:
['cola', 'sst2', 'qqp', 'mnli', 'mnli_matched', 'mnli_mismatched', 'qnli', 'wnli', 'rte', 'mrpc', 'mmlu', 'squad_v2', 'un_multi', 'iwslt', 'math', 'bool_logic', 'valid_parentheses', 'gsm8k', 'csqa', 'bigbench_date', 'bigbench_object_tracking']
[{'content': "it 's a charming and often affecting journey . ", 'label': 1},
{'content': 'unflinchingly bleak and desperate ', 'label': 0},
{'content': 'allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . ',
'label': 1},
{'content': "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . ",
'label': 1},
{'content': "it 's slow -- very , very slow . ", 'label': 0}]
- 加载模型
然后,可以通过 promptbench 轻松加载 LLM 模型。
代码语言:javascript复制# print all supported models in promptbench
print('All supported models: ')
print(pb.SUPPORTED_MODELS)
# load a model, flan-t5-large, for instance.
model = pb.LLMModel(model='google/flan-t5-large', max_new_tokens=10)
代码语言:javascript复制All supported models:
['google/flan-t5-large', 'llama2-7b', 'llama2-7b-chat', 'llama2-13b', 'llama2-13b-chat', 'llama2-70b', 'llama2-70b-chat', 'phi-1.5', 'gpt-3.5-turbo', 'gpt-4', 'gpt-4-1106-preview', 'gpt-3.5-turbo-1106', 'vicuna-7b', 'vicuna-13b', 'vicuna-13b-v1.3', 'google/flan-ul2']
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
- 构造提示
提示是 LLM 的关键交互接口。您可以通过调用提示 API 轻松构造提示。
代码语言:javascript复制# Prompt API supports a list, so you can pass multiple prompts at once.
prompts = pb.Prompt(["Classify the sentence as positive or negative: {content}",
"Determine the emotion of the following sentence as positive or negative: {content}"
])
可能需要为模型输出定义投影函数。由于提示中定义的输出格式可能与模型输出不同。例如,对于 sst2 数据集,标签为“0”和“1”表示“负”和“正”。但模型输出是“负”和“正”。我们需要定义一个投影函数来将模型输出映射到标签。
代码语言:javascript复制def proj_func(pred):
mapping = {
"positive": 1,
"negative": 0
}
return mapping.get(pred, -1)
- 使用提示、数据集和模型执行评估
您可以使用加载的提示、数据集和标签执行标准评估。
代码语言:javascript复制from tqdm import tqdm
for prompt in prompts:
preds = []
labels = []
for data in tqdm(dataset):
# process input
input_text = pb.InputProcess.basic_format(prompt, data)
label = data['label']
raw_pred = model(input_text)
# process output
pred = pb.OutputProcess.cls(raw_pred, proj_func)
preds.append(pred)
labels.append(label)
# evaluate
score = pb.Eval.compute_cls_accuracy(preds, labels)
print(f"{score:.3f}, {prompt}")
代码语言:javascript复制100%|██████████| 872/872 [02:16<00:00, 6.37it/s]
0.947, Classify the sentence as positive or negative: {content}
100%|██████████| 872/872 [02:18<00:00, 6.29it/s]
0.947, Determine the emotion of the following sentence as positive or negative: {content}