如何使用 BERT 对相似的句子进行聚类

Posted

技术标签:

【中文标题】如何使用 BERT 对相似的句子进行聚类【英文标题】:How to cluster similar sentences using BERT 【发布时间】:2019-09-01 07:16:45 【问题描述】:

对于 ElMo、FastText 和 Word2Vec,我对句子中的词嵌入进行平均,并使用 HDBSCAN/KMeans 聚类对相似的句子进行分组。

可以在这篇短文中看到一个很好的实现示例:http://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/

我想使用 BERT 做同样的事情(使用来自拥抱脸的 BERT python 包),但是我不太熟悉如何提取原始单词/句子向量以便将它们输入到聚类算法中。我知道 BERT 可以输出句子表示 - 那么我将如何从句子中实际提取原始向量?

任何信息都会有所帮助。

【问题讨论】:

不要为此使用 BERT,它从未针对语义相似性目标进行过训练。 【参考方案1】:

Bert 在每个样本/句子的开头添加一个特殊的 [CLS] 标记。在对下游任务进行微调之后,这个 [CLS] 令牌或 pooled_output 在拥抱脸实现中的嵌入表示句子嵌入。

但我认为您没有标签,因此您将无法进行微调,因此您不能将 pooled_output 用作句子嵌入。相反,您应该在 encoded_layers 中使用词嵌入,它是一个尺寸为 (12,seq_len, 768) 的张量。在这个张量中,您有来自 Bert 的 12 层中的每一层的嵌入(维度 768)。要获得词嵌入,您可以使用最后一层的输出,您可以连接或求和最后 4 层的输出,依此类推。

这里是提取特征的脚本:https://github.com/ethanjperez/pytorch-pretrained-BERT/blob/master/examples/extract_features.py

【讨论】:

BERT 在下一个句子预测任务上进行了预训练,所以我认为 [CLS] 标记已经对句子进行了编码。但是,我宁愿在下面使用@Palak 的解决方案【参考方案2】:

作为Subham Kumarmentioned,可以使用这个 Python 3 库来计算句子相似度:https://github.com/UKPLab/sentence-transformers

图书馆有几个code examples来进行聚类:

fast_clustering.py:

"""
This is a more complex example on performing clustering on large scale dataset.

This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.

A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.

The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).

In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time


# Model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)


print("Start clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)

print("Clustering done after :.2f sec".format(time.time() - start_time))

#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster , # Elements ".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])

kmeans.py:

"""
This is a simple application for sentence embeddings: clustering

Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

agglomerative.py:

"""
This is a simple application for sentence embeddings: clustering

Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

# Perform kmean clustering
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5) #, affinity='cosine', linkage='average', distance_threshold=0.4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = 
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in clustered_sentences.items():
    print("Cluster ", i+1)
    print(cluster)
    print("")

【讨论】:

【参考方案3】:

不确定您是否仍然需要它,但最近一篇论文提到了如何使用文档嵌入来对文档进行聚类并从每个聚类中提取单词来表示一个主题。这是链接: https://arxiv.org/pdf/2008.09470.pdf, https://github.com/ddangelov/Top2Vec

受上述论文的启发,这里提到了另一种使用 BERT 生成句子嵌入的主题建模算法: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6,https://github.com/MaartenGr/BERTopic

上述两个库提供了从语料库中提取主题的端到端解决方案。但是,如果您只对生成句子嵌入感兴趣,请查看其他答案中提到的 Gensim 的 doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html) 或句子转换器 (https://github.com/UKPLab/sentence-transformers)。如果您使用句子转换器,建议您在特定领域的语料库上训练模型以获得良好的结果。

【讨论】:

【参考方案4】:

您可以使用Sentence Transformers 来生成句子嵌入。与从 bert-as-service 获得的嵌入相比,这些嵌入更有意义,因为它们已经过微调,使得语义相似的句子具有更高的相似度分数。如果要聚类的句子数量为数百万或更多,则可以使用基于 FAISS 的聚类算法,因为像聚类算法这样的普通 K-means 需要二次时间。

【讨论】:

这让我很困惑,为什么这么多人尝试使用 BERT 嵌入来实现语义相似性。 BERT 从未接受过语义相似性目标的训练。 嘿@jamix。请注意,我们在这里没有直接使用 vanilla BERT 嵌入。我们使用类似连体的网络修改了下游任务,该网络生成了丰富的句子嵌入。请阅读以下论文:arxiv.org/abs/1908.10084 谢谢!在我的评论中,我实际上同意你的方法。咆哮是针对使用香草BERT的原始问题。【参考方案5】:

您需要首先为句子生成 bert embeddidngs。 bert-as-service 提供了一种非常简单的方法来生成句子的嵌入。

这就是您如何为需要聚类的句子列表生成 bert 向量。在 bert-as-service 存储库中对此进行了很好的解释: https://github.com/hanxiao/bert-as-service

安装:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

在https://github.com/google-research/bert 下载其中一个预训练模型

启动服务:

bert-serving-start -model_dir /your_model_directory/ -num_worker=4 

为句子列表生成向量:

from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)

这将为您提供向量列表,您可以将它们写入 csv 并使用任何聚类算法,因为句子被简化为数字。

【讨论】:

很棒的解决方案,适用于我的 42,000 个主题标签 BERT 未针对生成句子向量或使用余弦相似度等指标评估相似度进行优化。即使它可能有效,但结果可能会产生误导。请参阅此讨论:github.com/UKPLab/sentence-transformers/issues/80 这没问题,只要您使用专门为此而制作的微调 bert,例如 Sentence Bert

以上是关于如何使用 BERT 对相似的句子进行聚类的主要内容,如果未能解决你的问题,请参考以下文章

如何使用聚类对具有相似意图的句子进行分组?

如何抓取语义相似的句子

如何使用微调的 BERT 模型进行句子编码?

论文泛读142Sentence-BERT:使用 Siamese BERT-Networks 的句子嵌入

bert不同句子中的词向量会变化吗

如何使用 k-means (Flann with python) 对文档进行聚类?