AIGC | 打通大模型私有化定制的最后一公里：PromptBench动态评估

1、安装

有两种方法可以安装 promptbench。如果您只想按原样使用，请通过pip，如果要进行任何更改并尝试使用，请从源代码安装它。

pip安装方式

我们为想要快速开始评估的用户提供了一个 Python 包提示台。

代码语言：javascript复制

pip install promptbench

2. 通过 github 安装

代码语言：javascript复制

git clone git@github.com:microsoft/promptbench.git
cd promptbench
conda create --name promptbench python=3.9
pip install -r requirements.txt

这只安装了基本的 python 包。对于提示攻击，它需要安装 textattacks。

2、评估流程

导入包

代码语言：javascript复制

import promptbench as pb

加载数据集

代码语言：javascript复制

# print all supported datasets in promptbench
print('All supported datasets: ')
print(pb.SUPPORTED_DATASETS)

# load a dataset, sst2, for instance.
# if the dataset is not available locally, it will be downloaded automatically.
dataset = pb.DatasetLoader.load_dataset("sst2")

# print the first 5 examples
dataset[:5]

显示结果

代码语言：javascript复制

All supported datasets: 
['cola', 'sst2', 'qqp', 'mnli', 'mnli_matched', 'mnli_mismatched', 'qnli', 'wnli', 'rte', 'mrpc', 'mmlu', 'squad_v2', 'un_multi', 'iwslt', 'math', 'bool_logic', 'valid_parentheses', 'gsm8k', 'csqa', 'bigbench_date', 'bigbench_object_tracking']

[{'content': "it 's a charming and often affecting journey . ", 'label': 1},
 {'content': 'unflinchingly bleak and desperate ', 'label': 0},
 {'content': 'allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . ',
  'label': 1},
 {'content': "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . ",
  'label': 1},
 {'content': "it 's slow -- very , very slow . ", 'label': 0}]

加载模型

然后，可以通过 promptbench 轻松加载 LLM 模型。

代码语言：javascript复制

# print all supported models in promptbench
print('All supported models: ')
print(pb.SUPPORTED_MODELS)

# load a model, flan-t5-large, for instance.
model = pb.LLMModel(model='google/flan-t5-large', max_new_tokens=10)

代码语言：javascript复制

All supported models: 
['google/flan-t5-large', 'llama2-7b', 'llama2-7b-chat', 'llama2-13b', 'llama2-13b-chat', 'llama2-70b', 'llama2-70b-chat', 'phi-1.5', 'gpt-3.5-turbo', 'gpt-4', 'gpt-4-1106-preview', 'gpt-3.5-turbo-1106', 'vicuna-7b', 'vicuna-13b', 'vicuna-13b-v1.3', 'google/flan-ul2']

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

构造提示

提示是 LLM 的关键交互接口。您可以通过调用提示 API 轻松构造提示。

代码语言：javascript复制

# Prompt API supports a list, so you can pass multiple prompts at once.
prompts = pb.Prompt(["Classify the sentence as positive or negative: {content}",
                     "Determine the emotion of the following sentence as positive or negative: {content}"
                     ])

可能需要为模型输出定义投影函数。由于提示中定义的输出格式可能与模型输出不同。例如，对于 sst2 数据集，标签为“0”和“1”表示“负”和“正”。但模型输出是“负”和“正”。我们需要定义一个投影函数来将模型输出映射到标签。

代码语言：javascript复制

def proj_func(pred):
    mapping = {
        "positive": 1,
        "negative": 0
    }
    return mapping.get(pred, -1)

使用提示、数据集和模型执行评估

您可以使用加载的提示、数据集和标签执行标准评估。

代码语言：javascript复制

from tqdm import tqdm
for prompt in prompts:
    preds = []
    labels = []
    for data in tqdm(dataset):
        # process input
        input_text = pb.InputProcess.basic_format(prompt, data)
        label = data['label']
        raw_pred = model(input_text)
        # process output
        pred = pb.OutputProcess.cls(raw_pred, proj_func)
        preds.append(pred)
        labels.append(label)
    
    # evaluate
    score = pb.Eval.compute_cls_accuracy(preds, labels)
    print(f"{score:.3f}, {prompt}")

代码语言：javascript复制

100%|██████████| 872/872 [02:16<00:00,  6.37it/s]


0.947, Classify the sentence as positive or negative: {content}


100%|██████████| 872/872 [02:18<00:00,  6.29it/s]

0.947, Determine the emotion of the following sentence as positive or negative: {content}

函数接口模型数据 aigc

0 人点赞