抽取式摘要生成
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了抽取式摘要生成相关的知识,希望对你有一定的参考价值。
参考技术A 摘要生成算法主要分为 抽取型 (Extraction-based)和 概括型(生成) (Abstraction-based)两类。传统的摘要生成系统大部分是抽取型的,就是从给定文章中抽取出关键的句子或短语,重新拼接成一小段摘要,没有对原有内容做创造性修改。生成摘要的语法可能不准确。
GitHub上的有sumy, pytextrank, textteaser等。
概括性摘要生成系统能生成或许不属于源文档的新句子。由于抽象式机器学习算法能够生成表示源文本中最重要信息的新短语和句子,所以这些抽象式算法有助于克服抽取式摘要中的语法不准确问题。
下面就抽取式摘要生成做基本实现
数据集: nlpcc2017textsummarization ,
链接:https://pan.baidu.com/s/1P7Vv11NKSWMFK0HUChvaPQ
提取码:kxr1
首先应用Glove得到对应数据的词向量,按取均值的方法得到句子向量。并应用textrank算法得到排名前3的句子。具体实现过程如下:
1)加载包
2)读取数据
3)文本做简单处理
4)训练glove模型的词向量
5)句子向量准备
6)相似矩阵和textrank模块
7)得出结果并存储
8)结果如下:
BertSumExt 不生成摘要
【中文标题】BertSumExt 不生成摘要【英文标题】:BertSumExt is not producing Summaries 【发布时间】:2020-08-22 03:17:20 【问题描述】:我试图让抽取式 BertSUM 总结器工作 (Paper and Github here) 但我仍然收到以下消息
xent 0 at step -1"
并且没有生成摘要。我做错了什么?有人可以帮我解决这个问题,也许可以提供一个工作示例。当我在 google colab 中执行以下操作时出现上述消息:
需要 1 个克隆 GitHub
!git clone https://github.com/Alcamech/PreSumm.git
2 更改 Git-Branch 以汇总原始文本数据
%cd /content/PreSumm
!git checkout -b Raw_Input origin/PreSumm_Raw_Input_Text_Setup
!git pull
3 安装要求
!pip install torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge
4 安装 CNN/DM 提取 bertext_cnndm_transformer.pt
!gdown https://drive.google.com/uc?id=1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ&export=download
!unzip /content/PreSumm/models/bertext_cnndm_transformer.zip
4.1 为 CNN/Dailymail 下载预处理数据
%cd /content/PreSumm/bert_data/
!gdown https://drive.google.com/uc?id=1DN7ClZCCXsk2KegmC6t4ClBwtAf5galI&export=download
!unzip /content/PreSumm/bert_data/bert_data_cnndm_final.zip
5 更改为 /src 文件夹
cd /content/PreSumm/src/
6 运行提取摘要器
!python /content/PreSumm/src/train.py -task ext -mode test_text -test_from /content/PreSumm/models/bertext_cnndm_transformer.pt -text_src /content/PreSumm/raw_data/temp_ext.raw_src -text_tgt /content/PreSumm/results/result.txt -log_file /content/PreSumm/logs/ext_bert_cnndm
第6步的输出是:
[2020-05-07 11:20:12,355 INFO] Loading checkpoint from /content/PreSumm/models/bertext_cnndm_transformer.pt
Namespace(accum_count=1, alpha=0.6, batch_size=140, beam_size=5, bert_data_path='../bert_data_new/cnndm', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.2, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='/content/PreSumm/logs/ext_bert_cnndm', lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=150, max_ndocs_in_batch=6, max_pos=512, max_tgt_len=140, min_length=15, mode='test_text', model_path='../models/', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='../results/cnndm', save_checkpoint_steps=5, seed=666, sep_optim=False, share_emb=False, task='ext', temp_dir='../temp', test_all=False, test_batch_size=200, test_from='/content/PreSumm/models/bertext_cnndm_transformer.pt', test_start_from=-1, text_src='/content/PreSumm/raw_data/temp_ext.raw_src', text_tgt='/content/PreSumm/results/result.txt', train_from='', train_steps=1000, use_bert_emb=False, use_interval=True, visible_gpus='-1', warmup_steps=8000, warmup_steps_bert=8000, warmup_steps_dec=8000, world_size=1)
[2020-05-07 11:20:13,361 INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /tmp/tmpvck0jwoy
100% 433/433 [00:00<00:00, 309339.74B/s]
[2020-05-07 11:20:13,498 INFO] copying /tmp/tmpvck0jwoy to cache at ../temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-07 11:20:13,499 INFO] creating metadata file for ../temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-07 11:20:13,499 INFO] removing temp file /tmp/tmpvck0jwoy
[2020-05-07 11:20:13,499 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ../temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-07 11:20:13,500 INFO] Model config
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_labels": 2,
"output_attentions": false,
"output_hidden_states": false,
"pad_token_id": 0,
"pruned_heads": ,
"torchscript": false,
"type_vocab_size": 2,
"vocab_size": 30522
[2020-05-07 11:20:13,571 INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmp6b78t4_2
100% 440473133/440473133 [00:06<00:00, 71548841.10B/s]
[2020-05-07 11:20:19,804 INFO] copying /tmp/tmp6b78t4_2 to cache at ../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[2020-05-07 11:20:21,212 INFO] creating metadata file for ../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[2020-05-07 11:20:21,212 INFO] removing temp file /tmp/tmp6b78t4_2
[2020-05-07 11:20:21,267 INFO] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at ../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
gpu_rank 0
[2020-05-07 11:20:24,645 INFO] * number of parameters: 120512513
[2020-05-07 11:20:24,736 INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /tmp/tmpyv3mwnb6
100% 231508/231508 [00:00<00:00, 4268647.82B/s]
[2020-05-07 11:20:25,044 INFO] copying /tmp/tmpyv3mwnb6 to cache at /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2020-05-07 11:20:25,045 INFO] creating metadata file for /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2020-05-07 11:20:25,045 INFO] removing temp file /tmp/tmpyv3mwnb6
[2020-05-07 11:20:25,046 INFO] loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
0% 0/2 [00:00<?, ?it/s]
[2020-05-07 11:20:25,115 INFO] Validation xent: 0 at step -1
并且 result.txt 文件为空。
Here 是我的 google colab 副本的链接,您可以在其中看到完整的colde。 我还在 origin-github-repo here 上尝试了这些步骤,我得到了同样的错误。 感谢您的帮助。
【问题讨论】:
请在此处发布minimal reproducible example,而不是在外部链接中。 【参考方案1】:您可以查看 https://github.com/microsoft/nlp-recipes/blob/master/examples/text_summarization/extractive_summarization_cnndm_transformer.ipynb 的 bertsum 提取摘要示例
【讨论】:
以上是关于抽取式摘要生成的主要内容,如果未能解决你的问题,请参考以下文章
TensorFlow文本摘要生成 - 基于注意力的序列到序列模型
TensorFlow文本摘要生成 - 基于注意力的序列到序列模型
TensorFlow文本摘要生成 - 基于注意力的序列到序列模型