论文翻译——Deep contextualized word representations

Posted wwj99


篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了论文翻译——Deep contextualized word representations相关的知识,希望对你有一定的参考价值。


We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e. to model polysemy).


Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.


We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis.


We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.


1 Introduction

Pre-trained word representations(Mikolov et al. 2013; Pennington et al. 2014)are a key component in many neural language understanding models.

训练前的单词表示(Mikolov et al. 2013; Pennington et al. 2014)是许多神经语言理解模型的关键组成部分。

However, learning high quality representations can be challenging.


They should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (ie, to model polysemy).


Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence.


We use vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus.


For this reason, we call them ELMo (Embeddings from Language Models) representations.


Unlike previous approaches for learning contextualized word vectors (Peters et al, 2017;McCann et al, 2017), ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM.

与以往的语境化词汇向量学习方法不同(Peters等,2017; McCann et al, 2017), ELMo表示是深层的,因为它们是biLM所有内层的功能。

More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer.


Combining the internal states in this manner allows for very rich word representations.


Using intrinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower-level states model aspects of syntax (e.g., they can be used to do part-of-speech tagging).


Simultaneously exposing all of these signals is highly beneficial, allowing the learned models select the types of semi-supervision that are most useful for each end task.


Extensive experiments demonstrate that ELMo representations work extremely well in practice.


We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis.


The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.


For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al, 2017), which computes contextualized representations using a neural machine translation encoder.

在可以进行直接比较的任务中,ELMo的性能优于CoVe (McCann et al, 2017),后者使用神经机器翻译编码器计算上下文化的表示。

Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM.


Our trained models and code are publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.


Due to their ability to capture syntactic and semantic information of words from large scale unlabeled text, pretrained word vectors (Turian et al, 2010; Mikolov et al, 2013;

由于它们能够从大规模未标记文本中捕获单词的语法和语义信息,因此,预先训练的单词向量(Turian et al, 2010; Mikolov et al, 2013;

Pennington et al, 2014) are a standard component of most state-of- the-art NLP architectures, including for question answering (Liu et al, 2017), textual entailment (Chen et al,2017) and semantic role labeling (He et al, 2017).


However, these approaches for learning word vectors only allow a single context-independent representation for each word.


Previously proposed methods overcome some of the shortcomings of traditional word vectors by either enriching them with subword information (eg, Wieting et al, 2016, Bojanowski et al,2017) or learning separate vectors for each word sense (eg, Neelakantan et al, 2014).

之前提出的方法克服了传统词向量的一些缺点,要么用子词信息丰富它们(例如,Wieting et al, 2016, Bojanowski et al,2017),要么为每个词意义学习单独的向量(例如,Neelakantan et al, 2014)。

Our approach also benefits from subword units through the use of character convolutions, and we seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes.


Other recent work has also focused on learning context-dependent representations.


context2vec (Melamud et al, 2016) uses a bidirectional Long Short Term Memory (LSTM;

context2vec (Melamud et al, 2016)使用双向长短时记忆(LSTM;

Hochreiter and Schmidhuber, 1997) to encode the context around a pivot word.

Hochreiter和Schmidhuber, 1997)将上下文编码到一个关键字周围。

Other approaches for learning contextual embeddings include the pivot word itself in the representation and are computed with the encoder of either a supervised neural machine translation (MT) system (CoVe, McCann et al, 2017) or an unsupervised language model (Peters et al, 2017).

学习上下文嵌入的其他方法包括关键字本身在表示中,并使用监督神经机器翻译(MT)系统(CoVe, McCann et al, 2017)或非监督语言模型(Peters et al, 2017)的编码器进行计算。

Both of these approaches benefit from large datasets, although the MT approach is limited by the size of parallel corpora.


In this paper, we take full advantage of access to plentiful monolingual data, and train our biLM on a corpus with approximately 30 million sentences (Chelba et al, 2014).

在本文中,我们充分利用了获取大量单语数据的优势,在大约3000万个句子的语料库上训练我们的biLM (Chelba et al, 2014)。

We also generalize these approaches to deep contextual representations, which we show work well across a broad range of diverse NLP tasks.


Previous work has also shown that different layers of deep biRNNs encode different types of information.


For example, introducing multi-task syntactic supervision (eg, part-of-speech tags) at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing (Hashimoto et al, 2017) or CCG super tagging (S?gaard and Goldberg, 2016).

例如,在深层LSTM的低层引入多任务语法监督(如词性标记)可以提高高级任务的整体性能,如依赖项解析(Hashimoto et al, 2017)或CCG超级标记(S?gaard and Goldberg, 2016)。

In an RNN-based encoder-decoder machine translation system, (Belinkov et al, 2017) showed that the representations learned at the first layer in a 2-layer LSTM encoder are better at predicting POS tags then second layer.

在一个基于rnn的编码器-解码器机器翻译系统中,(Belinkov et al, 2017)表明,在两层LSTM编码器的第一层学习的表示比第二层更能预测POS标签。

Finally, the top layer of an LSTM for encoding word context (Melamud et al, 2016) has been shown to learn representations of word sense.

最后,用于编码单词上下文的LSTM的顶层(Melamud et al, 2016)已经被证明可以学习单词意义的表示。

We show that similar signals are also induced by the modified language model objective of our ELMo representations, and it can be very beneficial to learn models for downstream tasks that mix these different types of semi-supervision.


Dai and Le (2015) and Ramachandran et al (2017) pretrain encoder-decoder pairs using language models and sequence autoencoders and then fine tune with task specific supervision.


In contrast, after pretraining the biLM with unlabeled data, we fix the weights and add additional task-specific model capacity, allowing us to leverage large, rich and universal biLM representations for cases where downstream training data size dictates a smaller supervised model.


3 ELMo: Embeddings from Language Models

Unlike most widely used word embeddings (Pennington et al, 2014), ELMo word representations are functions of the entire input sentence, as described in this section.

与最广泛使用的词嵌入(Pennington et al, 2014)不同,ELMo词表示是整个输入语句的函数,如本节所述。

They are computed on top of two-layer biLMs with character convolutions (Sec. 3.1), as a linear function of the internal network states (Sec. 3.2).


This setup allows us to do semi-supervised learning, where the biLM is pretrained at a large scale (Sec. 3.4) and easily incorporated into a wide range of existing neural NLP architectures (Sec. 3.3).


以上是关于论文翻译——Deep contextualized word representations的主要内容,如果未能解决你的问题,请参考以下文章

[论文笔记]Depth-Aware Multi-Grid Deep Homography Estimation with Contextual Correlation

[论文笔记]Depth-Aware Multi-Grid Deep Homography Estimation with Contextual Correlation

ELMO模型(Deep contextualized word representation)

论文笔记-语义排序-Fast Semantic Matching via Flexible Contextualized Interaction(WWW2022-yewenwen)

A Survey on Contextual Embeddings

论文导读Time-Series Representation Learning via Temporal and Contextual Contrasting(时间和上下文对比的时间序列表示学习)