一、引言
pipeline(管道)是huggingface transformers库中一种极简方式使用大模型推理的抽象,将所有大模型分为音频(Audio)、计算机视觉(Computer vision)、自然语言处理(NLP)、多模态(Multimodal)等4大类,28小类任务(tasks)。共计覆盖32万个模型
今天介绍NLP自然语言处理的第三篇:总结(summarization),在huggingface库内有2000个总结(summarization)模型。
二、总结(summarization)
2.1 概述
摘要是在保留重要信息的同时生成文档的较短版本的任务。模型可以从原始输入中提取文本,同时可以生成全新的文本!
2.2 BERT与GPT的结合—BART
BART 是一个由facebook研发的Transformer 编码器-编码器 (seq2seq) 模型,具有双向 (类似 BERT) 编码器和自回归 (类似 GPT) 解码器。BART 通过 (1) 使用任意噪声函数破坏文本,以及 (2) 学习模型来重建原始文本进行预训练。 BART 在针对文本生成(例如摘要、翻译)进行微调时特别有效,但它也适用于理解任务(例如文本分类、问答)。这个特定的检查点已在 CNN Daily Mail(一个庞大的文本摘要对集合)上进行了微调。
2.3 应用场景
- 自动文摘:使用自然语言处理(NLP)技术,从长篇文章中提取出最重要的段落或句子。
- 文本分类:根据文本内容对其进行分类,如新闻、博客、产品描述等。
- 信息检索:通过总结来帮助用户快速找到相关信息。
- 智能问答:使用总结技术来生成问题的答案。
- 文本分析:从大量文本数据中提取出有价值的信息和知识。
2.4 pipeline参数
2.4.1 pipeline对象实例化参数
- model(PreTrainedModel或TFPreTrainedModel)— 管道将使用其进行预测的模型。 对于 PyTorch,这需要从PreTrainedModel继承;对于 TensorFlow,这需要从TFPreTrainedModel继承。
- tokenizer ( PreTrainedTokenizer ) — 管道将使用 tokenizer 来为模型编码数据。此对象继承自 PreTrainedTokenizer。
- modelcard(
str
或ModelCard
,可选)— 属于此管道模型的模型卡。 - framework(
str
,可选)— 要使用的框架,"pt"
适用于 PyTorch 或"tf"
TensorFlow。必须安装指定的框架。 - task(
str
,默认为""
)— 管道的任务标识符。 - num_workers(
int
,可选,默认为 8)— 当管道将使用DataLoader(传递数据集时,在 Pytorch 模型的 GPU 上)时,要使用的工作者数量。 - batch_size(
int
,可选,默认为 1)— 当管道将使用DataLoader(传递数据集时,在 Pytorch 模型的 GPU 上)时,要使用的批次的大小,对于推理来说,这并不总是有益的,请阅读使用管道进行批处理。 - args_parser(ArgumentHandler,可选) - 引用负责解析提供的管道参数的对象。
- device(
int
,可选,默认为 -1)— CPU/GPU 支持的设备序号。将其设置为 -1 将利用 CPU,设置为正数将在关联的 CUDA 设备 ID 上运行模型。您可以传递本机torch.device
或str
太 - torch_dtype(
str
或torch.dtype
,可选) - 直接发送model_kwargs
(只是一种更简单的快捷方式)以使用此模型的可用精度(torch.float16
,,torch.bfloat16
...或"auto"
) - binary_output(
bool
,可选,默认为False
)——标志指示管道的输出是否应以序列化格式(即 pickle)或原始输出数据(例如文本)进行。
2.4.2 pipeline对象使用参数
- documents ( str或
List[str]
) — 要总结的一篇或多篇文章(或一个文章列表)。 - return_text(
bool
,可选,默认为True
)— 是否在输出中包含解码后的文本 - return_tensors(
bool
,可选,默认为False
)— 是否在输出中包含预测张量(作为标记索引)。 - clean_up_tokenization_spaces(
bool
,可选,默认为False
)——是否清理文本输出中可能出现的额外空格。generate_kwargs——传递给模型的生成方法的附加关键字参数(请参阅此处与您的框架相对应的生成方法)。
2.4.3 pipeline返回参数
- summary_text(str,出现时间return_text=True)——相应输入的摘要。
- summary_token_ids(torch.Tensor或tf.Tensor,存在时return_tensors=True)—摘要的令牌 ID。
2.5 pipeline实战
采用pipeline,使用facebook的bart的微调版本bart-large-cnn进行摘要总结。
代码语言:javascript复制import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
ARTICLE = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney's Office by Immigration and Customs Enforcement and the Department of Homeland Security's
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
执行后,自动下载模型文件并进行识别:
2.6 模型排名
在huggingface上,我们将总结(summarization)模型按下载量从高到低排序,总计2000个模型,排名第一是我们上述介绍的bart-large-cnn。
三、总结
本文对transformers之pipeline的总结(summarization)从概述、技术原理、pipeline参数、pipeline实战、模型排名等方面进行介绍,读者可以基于pipeline使用文中的2行代码极简的使用NLP中的总结(summarization)模型。