gensim Doc2Vec vs tensorflow Doc2Vec

Posted 2023-02-23

技术标签:

【中文标题】gensim Doc2Vec vs tensorflow Doc2Vec【英文标题】： 【发布时间】：2017-02-12 02:31:26 【问题描述】：

我正在尝试比较我的 Doc2Vec 实现（通过 tf）和 gensims 实现。至少从视觉上看，gensim 的表现更好。

我运行以下代码来训练 gensim 模型和下面的代码来训练 tensorflow 模型。我的问题如下：

window=5

Gensim

model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
    model.train(corpus)

TF

batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.


graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):
    # Input data.
    train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])

    # The variables   
    word_embeddings =  tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
    doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
    softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
                             stddev=1.0 / np.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

    ###########################
    # Model.
    ###########################
    # Look up embeddings for inputs and stack words side by side
    embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
                            shape=[int(batch_size/context_window),-1])
    embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
    embed = tf.concat(1,[embed_words, embed_docs])
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                                   train_labels, num_sampled, vocabulary_size))

    # Optimizer.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

更新：

查看 jupyter notebook here（我有两个模型都在这里工作和测试）。感觉 gensim 模型在最初的分析中表现更好。

【问题讨论】：

可以在此处找到有关此问题的适当讨论：groups.google.com/forum/#!topic/gensim/0GVxA055yOU 根据文档 - “window 是文档中用于预测的预测词和上下文词之间的最大距离”。所以两边各有5个字。另外，你能告诉我negative或num_sampled是什么意思吗？不太明白负采样方法在 Mikolov papers 之一中进行了描述。 AfaIr 它减少了在每个学习步骤中更新的参数数量。请注意，dm_concat 模式会导致比更常用的 PV-DBOW 更大、训练速度更慢的模型可能需要更多的数据（或训练通过）或 PV-DM-with-context-window-averaging。我最初在 gensim 中添加了dm_concat 模式，以尝试密切重现据说使用该模式的“段落向量”论文结果。（我做不到；也没有其他人尝试过。）我个人还没有发现任何数据集/评估，dm_concat 值得付出额外的努力——但也许它们存在于非常大的文档语料库中。 【参考方案1】：

老问题，但答案对未来的访问者很有用。以下是我的一些想法。

tensorflow 实现中存在一些问题：

window 是 1 面大小，所以 window=5 将是 5*2+1 = 11 字。请注意，对于 doc2vec 的 PV-DM 版本，batch_size 将是文档数。所以train_word_dataset 形状将是batch_size * context_window，而train_doc_dataset 和train_labels 形状将是batch_size。更重要的是，sampled_softmax_loss 不是negative_sampling_loss。它们是softmax_loss 的两个不同近似值。

所以对于 OP 列出的问题：

tensorflow

doc2vec

gensim

window

gensim

min_count

gensim

negative_sampling_loss

sampled_softmax_loss

【讨论】：

«在神经网络中，大多数局部最优值都“足够好”»我认为更正确的说法是，在高维问题中，例如在神经网络中，大多数局部最小值实际上是鞍点，所以它们很容易交叉，尤其是在使用更多随机步骤时。没错，在高维问题中，最关键的点是鞍点，但随机动力学驱动解决方案到局部最优而不是鞍点，除了可能非常平坦和宽的鞍点。关键是，大多数发现的局部最优值都足够好，因为经验表明，不同的发现局部最优值通常具有几乎相同的泛化性能，这非常有趣。这个问题的答案可能在于随机动力学，它也将解决方案推向平坦和宽的局部最优，而不是尖锐和狭窄的局部最优。

以上是关于gensim Doc2Vec vs tensorflow Doc2Vec的主要内容，如果未能解决你的问题，请参考以下文章