如何使用词嵌入作为 CRF (sklearn-crfsuite) 模型训练的特征

Posted

技术标签:

【中文标题】如何使用词嵌入作为 CRF (sklearn-crfsuite) 模型训练的特征【英文标题】:How to use word embedding as features for CRF (sklearn-crfsuite) model training 【发布时间】:2020-03-03 07:33:23 【问题描述】:

我想开发一个 NER 模型,我想在其中使用词嵌入功能来训练 CRF 模型。代码在没有词嵌入功能的情况下完美运行,但是当我将嵌入作为 CRF 训练的特征插入时,收到错误消息。这是我的代码的 sn-p 部分:

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
#from sklearn.cross_validation import cross_val_score
#from sklearn.grid_search import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import pickle
from gensim.models import KeyedVectors
import numpy as np
# Load vectors directly from the file
model1 = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) ### Loading pre-trainned word2vec model
### Embedding function 
def get_features(word):
    word=word.lower()
    vectors=[]
    try:
        vectors.append(model1[word])
    except:
        pass
    #vectors=np.array(vectors)
    #vectors=vectors[0]
    return vectors

def word2features(sent, i):
    word = sent[i][0]
    wordembdding=get_features(word)   ## word embedding vector 
    wordembdding=np.array(wordembdding) ## vectors 
    #wordembdding= 
    #wordembdding=wordembdding[0]
    postag = sent[i][1]
    tag1=sent[i][2]
    tag2=sent[i][4]
    tag3 = sent[i][5]


    features = 
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'wordembdding': wordembdding,
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
        'tag1': tag1,
        'tag1[:2]': tag1[:2],
        'tag2': tag2,
        'tag2[:2]': tag2[:2],
        'tag3': tag3,
        'tag3[:2]': tag3[:2],
        'wordlength': len(word),
        'wordinitialcap': word[0].isupper(),
        'wordmixedcap': len([x for x in word[1:] if x.isupper()])>0,
        'wordallcap': len([x for x in word if x.isupper()])==len(word),
        'distfromsentbegin': i
    
    if i > 0:
        word1 = sent[i-1][0]
        wordembdding1= get_features(word1)
        wordembdding1=np.array(wordembdding1)
        #wordembdding1=f2(wordembdding1)
        postag1 = sent[i-1][1]
        tag11=sent[i-1][2]
        tag22=sent[i-1][4]
        tag33 = sent[i-1][5]
        features.update(
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:wordembdding': wordembdding1,   # word embedding features 
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
            '-1:tag1': tag1,
            '-1:tag1[:2]': tag1[:2],
            '-1:tag2': tag2,
            '-1:tag2[:2]': tag2[:2],
            '-1:tag3': tag3,
            '-1:tag3[:2]': tag3[:2],
            '-1:wordlength': len(word),
            '-1:wordinitialcap': word[0].isupper(),
            '-1:wordmixedcap': len([x for x in word[1:] if x.isupper()])>0,
            '-1:wordallcap': len([x for x in word if x.isupper()])==len(word),
        )
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        wordembdding1= get_features(word1)
        wordembdding1= get_features(word1)
        wordembdding1=np.array(wordembdding1) ## word embedding features 
        #wordembdding1=f2(wordembdding)
        postag1 = sent[i+1][1]
        tag11=sent[i+1][2]
        tag22=sent[i+1][4]
        tag33 = sent[i+1][5]
        features.update(
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:wordembdding': wordembdding1,
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
            '+1:tag1': tag1,
            '+1:tag1[:2]': tag1[:2],
            '+1:tag2': tag2,
            '+1:tag2[:2]': tag2[:2],
            '+1:tag3': tag3,
            '+1:tag3[:2]': tag3[:2],
            '+1:wordlength': len(word),
            '+1:wordinitialcap': word[0].isupper(),
            '+1:wordmixedcap': len([x for x in word[1:] if x.isupper()])>0,
            '+1:wordallcap': len([x for x in word if x.isupper()])==len(word),
        )
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, tag1, label, tag2, tag3 in sent]

def sent2tokens(sent):
    return [token for token, postag, tag1, label, tag2, tag3, tag4, tag5 in sent]



X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]


%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)   ### Error message when try to train

当我想训练 CRF 模型时,我收到了以下错误消息:

TypeError: only size-1 arrays can be converted to Python scalars

谁能建议我如何使用词嵌入向量来训练 CRF 模型?

【问题讨论】:

我尝试了不附加但也得到了同样的错误。 【参考方案1】:

如您所见 here,目前 python-crfsuitesklearn-crfsuite 不支持数组功能,例如词嵌入。

相反,您可以将每个向量分量作为特征传递。

...
 'v0': 1.81583762e-02,
 'v1': 2.83553465e-02,
  ...
 'v299': -4.26079705e-02,
 ...

我建议替换你的 get_features 函数:

def get_features(word):
    word=word.lower()
    try:
         vector=model1[word]
    except:
        # if the word is not in vocabulary,
        # returns zeros array
        vector=np.zeros(300,)

    return vector   

然后修改 word2features 函数,为向量的每个分量返回一个新特征:

def word2features(sent, i):
    word = sent[i][0]
    wordembdding=get_features(word)   ## word embedding vector 
    postag = sent[i][1]
    tag1=sent[i][2]
    tag2=sent[i][4]
    tag3 = sent[i][5]


    features = 
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
        'tag1': tag1,
        'tag1[:2]': tag1[:2],
        'tag2': tag2,
        'tag2[:2]': tag2[:2],
        'tag3': tag3,
        'tag3[:2]': tag3[:2],
        'wordlength': len(word),
        'wordinitialcap': word[0].isupper(),
        'wordmixedcap': len([x for x in word[1:] if x.isupper()])>0,
        'wordallcap': len([x for x in word if x.isupper()])==len(word),
        'distfromsentbegin': i
    

    # here you add 300 features (one for each vector component)
    for iv,value in enumerate(wordembdding):
        features['v'.format(iv)]=value

# And so on...

两个小笔记:

如果您的文本中有很多词不在词汇表中,那么词嵌入不能大大改善您的 NER 模型。也许你可以使用 Fasttext(也集成在 Gensim 中),它可以正确处理看不见的单词。 即使有用,为每个单词添加向量嵌入也会使您的训练集非常大,产生较长的训练时间和非常大的分类器。

【讨论】:

以上是关于如何使用词嵌入作为 CRF (sklearn-crfsuite) 模型训练的特征的主要内容,如果未能解决你的问题,请参考以下文章

CRF条件随机场基础理解

如何从 Keras 嵌入层获取词向量

BERT实战:使用DistilBERT作为词嵌入进行文本情感分类,与其它词向量(FastText,Word2vec,Glove)进行对比

用CRF做命名实体识别

如何预处理文本以进行嵌入?

线性链CRF