各种预训练的词向量(Pretrained Word Embeddings)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了各种预训练的词向量(Pretrained Word Embeddings)相关的知识,希望对你有一定的参考价值。

参考技术A word2vec

Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this paper

download link | source link

fastText

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

2 million word vectors trained on Common Crawl (600B tokens).

download link | source link

GloVe

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)

download link | source link

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)

download link | source link

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

download link | source link

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

download link | source link

word2vec

Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor

download link | source link

fastText

Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization

download link | source link

如何使用“预训练的词向量”,做文本分类

不多比比了,看代码!!!

def train_W2V(w2vCorpus, size=100):
    w2vModel = Word2Vec(sentences=w2vCorpus, hs=0, negative=5, min_count=5, window=8, iter=1, size=size)
    w2vModel.save(inPath+w2vModel.model)
    return w2vModel

def load_W2V(W2V_path, loader_mySelf=1):
    if loader_mySelf:
        print(use my w2vModel)
        w2vModel = Word2Vec.load(W2V_path+w2vModel.model)  #使用自己训练的词向量
    else:  #加载腾讯训练的词向量
        print(use other w2vModel)
        w2vModel = gensim.models.KeyedVectors.load_word2vec_format(W2V_path+w2v_embedding_tengxun, binary=False)
    return w2vModel

def make_word2idx_embedMatrix(w2vModel):
    word2idx = {"_PAD": 0} 
    vocab_list = [(w, w2vModel.wv[w]) for w, v in w2vModel.wv.vocab.items()]
    embeddings_matrix = np.zeros((len(w2vModel.wv.vocab.items()) + 1, w2vModel.vector_size))

    for i in range(0, len(vocab_list)):
        word = vocab_list[i][0]
        word2idx[word] = i + 1
        embeddings_matrix[i + 1] = vocab_list[i][1]

    return word2idx, embeddings_matrix
    
def make_deepLearn_data(w2vCorpus, word2idx):
    X_train = []
    for sen in w2vCorpus:
        wordList = []
        for w in sen:
            if w in word2idx.keys(): 
                wordList.append(word2idx[w])
            else: 
                wordList.append(0)
        X_train.append(np.array(wordList))

    X_train = np.array(sequence.pad_sequences(X_train, maxlen=TEXT_MAXLEN))  #必须是np.array()类型
    
    return X_train
    
def Lstm_model():    #注意命名不能和库函数同名,之前命名为LSTM()就出很大的错误!!
    model = Sequential()
    model.add(Embedding(input_dim=len(embeddings_matrix),  ##参数要注意 
                        output_dim=len(embeddings_matrix[0]), 
                        input_length=TEXT_MAXLEN,
                        weights=[embeddings_matrix], #表示直接使用预训练的词向量 
                        trainable=False  #不对词向量微调
                       )) 
    model.add(LSTM(units=20, return_sequences=False)) #units:输出的维度
    model.add(Dropout(0.5))
    model.add(Dense(units=1, activation="sigmoid")) #全连接层
    model.compile(loss=binary_crossentropy, optimizer=adam, metrics=[accuracy])
    return model
    
if __name__ == __main__:
    df_data_ = df_data[0: 10000]  #原始数据加载
    
    w2vCorpus = [sen.split( ) for sen in df_data_.分析字段]  #制作W2V语料集
    w2vModel = train_W2V(w2vCorpus, size=100)  #训练W2V模型
    
    w2vModel = load_W2V(inPath, loader_mySelf=0)  #加载w2vModel
    word2idx, embeddings_matrix = make_word2idx_embedMatrix(w2vModel)  #制作word2idx和embedMatrix

    X_train = make_deepLearn_data(w2vCorpus, word2idx)  #制作符合要求的深度学习数据
    y_train = np.array(df_data_.特征类型)  #必须是np.array()类型
    
    model = Lstm_model()  
    model.fit(X_train[0: -2000], y_train[0: -2000], epochs=2, batch_size=10, verbose=1)
    score = model.evaluate(X_train[-2000: ], y_train[-2000: ])
    print(score)

 

以上是关于各种预训练的词向量(Pretrained Word Embeddings)的主要内容,如果未能解决你的问题,请参考以下文章

NLP pretrained model

如何使用“预训练的词向量”,做文本分类

如何将 Gensim doc2vec 与预训练的词向量一起使用?

在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)

Gensim Word2Vec 从预训练模型中选择少量词向量

FastText 使用预训练的词向量进行文本分类