各种预训练的词向量(Pretrained Word Embeddings)

Posted 2020-11-21 taolusi

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了各种预训练的词向量(Pretrained Word Embeddings)相关的知识，希望对你有一定的参考价值。

English Corpus

word2vec

Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this paper

download link | source link

fastText

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

2 million word vectors trained on Common Crawl (600B tokens).

download link | source link

GloVe

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)

download link | source link

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)

download link | source link

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

download link | source link

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

download link | source link

Chinese Corpus

word2vec

Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor

download link | source link

fastText

Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization

download link | source link

Reference

https://github.com/Hironsan/awesome-embedding-models
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://code.google.com/archive/p/word2vec/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://fasttext.cc/docs/en/english-vectors.html
https://arxiv.org/pdf/1310.4546.pdf

以上是关于各种预训练的词向量(Pretrained Word Embeddings)的主要内容，如果未能解决你的问题，请参考以下文章

NLP pretrained model

如何使用“预训练的词向量”，做文本分类

如何将 Gensim doc2vec 与预训练的词向量一起使用？

在 TensorFlow 中使用预训练的词嵌入（word2vec 或 Glove）

Gensim Word2Vec 从预训练模型中选择少量词向量

FastText 使用预训练的词向量进行文本分类