各种预训练的词向量(Pretrained Word Embeddings)
Posted taolusi
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了各种预训练的词向量(Pretrained Word Embeddings)相关的知识,希望对你有一定的参考价值。
转自:SevenBlue
English Corpus
word2vec
Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this paper
fastText
1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
2 million word vectors trained on Common Crawl (600B tokens).
GloVe
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)
Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)
Chinese Corpus
word2vec
Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor
fastText
Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization
Reference
https://github.com/Hironsan/awesome-embedding-models
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://code.google.com/archive/p/word2vec/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://fasttext.cc/docs/en/english-vectors.html
https://arxiv.org/pdf/1310.4546.pdf
以上是关于各种预训练的词向量(Pretrained Word Embeddings)的主要内容,如果未能解决你的问题,请参考以下文章
如何将 Gensim doc2vec 与预训练的词向量一起使用?
在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)