5个简单的步骤使用Pytorch进行文本摘要总结

2021-01-25 10:42:47 浏览数 (1)

介绍

文本摘要是自然语言处理(NLP)的一项任务,其目的是生成源文本的简明摘要。不像摘录摘要,摘要不仅仅简单地从源文本复制重要的短语,还要提出新的相关短语,这可以被视为释义。摘要在不同的领域产生了大量的应用,从书籍和文献,科学和研发,金融研究和法律文件分析。

到目前为止,对抽象摘要最有效的方法是在摘要数据集上使用经过微调的transformer模型。在本文中,我们将演示如何在几个简单步骤中使用功能强大的模型轻松地总结文本。我们将要使用的模型已经经过了预先训练,所以不需要额外的训练:)

让我们开始吧!

步骤1:安装Transformers库

我们要用的库是Huggingface实现的Transformers 。如果你不熟悉Transformers ,你可以继续阅读我之前的文章。

要安装变压器,您可以简单地运行:

代码语言:javascript复制
 pip install transformers

注意需要事先安装Pytorch。如果您还没有安装Pytorch,请访问Pytorch官方网站并按照说明安装它。

步骤2:导入库

成功安装transformer之后,现在可以开始将其导入到Python脚本中。我们也可以导入os来设置GPU在下一步使用的环境变量。注意,这是完全可选的,但如果您有多个gpu(如果您使用的是jupiter笔记本),这是防止错误的使用其他gpu的一个好做法。

代码语言:javascript复制
 from transformers import pipeline
 import os

步骤3:设置使用的GPU和模型

如果你决定设置GPU(例如0),那么你可以如下图所示:

代码语言:javascript复制
 os.environ["CUDA_VISIBLE_DEVICES"] = "0"

现在,我们准备好选择要使用的摘要模型了。Huggingface提供两种强大的摘要模型使用:BART (BART -large-cnn)和t5 (t5-small, t5-base, t5-large, t5- 3b, t5- 11b)。你可以在他们的官方paper(BART paper, t5 paper)上了解更多。

要使用在CNN/每日邮报新闻数据集上训练的BART模型,您可以通过Huggingface的内置管道模块直接使用默认参数:

代码语言:javascript复制
 summarizer = pipeline("summarization")

如果你想使用t5模型(例如t5-base),它是在c4 Common Crawl web语料库进行预训练的,那么你可以这样做:

代码语言:javascript复制
 summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

步骤4:输入文本进行总结

现在,在我们准备好我们的模型之后,我们可以开始输入我们想要总结的文本。想象一下,我们想从MedicineNet的一篇文章中总结以下关于COVID-19疫苗的内容:

One month after the United States began what has become a troubled rollout of a national COVID vaccination campaign, the effort is finally gathering real steam. Close to a million doses — over 951,000, to be more exact — made their way into the arms of Americans in the past 24 hours, the U.S. Centers for Disease Control and Prevention reported Wednesday. That’s the largest number of shots given in one day since the rollout began and a big jump from the previous day, when just under 340,000 doses were given, CBS News reported. That number is likely to jump quickly after the federal government on Tuesday gave states the OK to vaccinate anyone over 65 and said it would release all the doses of vaccine it has available for distribution. Meanwhile, a number of states have now opened mass vaccination sites in an effort to get larger numbers of people inoculated, CBS News reported.

我们定义变量:

代码语言:javascript复制
 text = """One month after the United States began what has become a troubled rollout of a national COVID vaccination campaign, the effort is finally gathering real steam.
 Close to a million doses -- over 951,000, to be more exact -- made their way into the arms of Americans in the past 24 hours, the U.S. Centers for Disease Control and Prevention reported Wednesday. That's the largest number of shots given in one day since the rollout began and a big jump from the previous day, when just under 340,000 doses were given, CBS News reported.
 That number is likely to jump quickly after the federal government on Tuesday gave states the OK to vaccinate anyone over 65 and said it would release all the doses of vaccine it has available for distribution. Meanwhile, a number of states have now opened mass vaccination sites in an effort to get larger numbers of people inoculated, CBS News reported."""

步骤4:总结

最后,我们可以开始总结输入的文本。这里,我们声明了希望汇总输出的min_length和max_length,并且关闭了采样以生成固定的汇总。我们可以通过运行以下命令来实现:

代码语言:javascript复制
 summary_text = summarizer(text, max_length=100, min_length=5, do_sample=False)[0]['summary_text']
 print(summary_text)

我们得到总结文本:

Over 951,000 doses of vaccine given in one day in the past 24 hours, CDC says . That’s the largest number of shots given in a month since the rollout began . The federal government gave states the OK to vaccinate anyone over 65 on Tuesday . A number of states have now opened mass vaccination sites in an effort to get more people inoculated, CBS News reports .

从总结的文本中可以看出,该模型知道24小时相当于一天,并聪明地将美国疾病控制与预防中心(U.S. Centers for Disease Control and Prevention)缩写为CDC。此外,该模型成功地从第一段和第二段链接信息,指出这是自上个月开始展示以来给出的最大次数。我们可以看到,该摘要模型的性能相当不错。

最后把所有这些放在一起,这里是jupyter notebook形式的整个代码:

https://gist.github.com/itsuncheng/f3c4dde81ac4651383c4480958da4f8e#file-summarization-ipynb

Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).

Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” arXiv preprint arXiv:1910.10683 (2019).

作者:Raymond Cheng

原文地址:https://towardsdatascience.com/abstractive-summarization-using-pytorch-f5063e67510

deephub翻译组

0 人点赞