模糊分组依据，对相似词进行分组

Posted 2023-03-12

技术标签:

【中文标题】模糊分组依据，对相似词进行分组【英文标题】：Fuzzy Group By, Grouping Similar Words 【发布时间】：2012-07-17 03:22:51 【问题描述】：

这个问题之前在这里问过

What is a good strategy to group similar words?

但没有给出关于如何“分组”项目的明确答案。基于 difflib 的解决方案基本上是搜索，对于给定的项目，difflib 可以从列表中返回最相似的单词。但是这如何用于分组呢？

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

到

['ape', 'appel', 'peach', 'puppy']

或

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是，对于每个项目，遍历列表，如果 get_close_matches 返回多个匹配项，则使用它，如果不保持原样。这部分有效，但它可以建议 apple for appel，然后 appel for apple，这些词只会换位，什么都不会改变。

如果有任何指针、库名称等，我将不胜感激。

注意：同样在性能方面，我们有一个包含 300,000 个项目的列表，get_close_matches 似乎有点慢。有人知道那里有基于 C/++ 的解决方案吗？

谢谢，

注意：进一步调查显示 kmedoid 是正确的算法（以及层次聚类），因为 kmedoid 不需要“中心”，它使用/使用数据点本身作为中心（这些点被称为 medoids，因此得名） .在词分组的情况下，中心点将是该组/簇的代表元素。

【问题讨论】：

【参考方案1】：

这是另一个使用 Affinity Propagation 算法的版本。

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): -leven.distance(x,y)

res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))

af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

unique_labels = set(labels)
for i in unique_labels:
    print words[labels==i]

距离必须转换为相似度，我通过取负数来做到这一点。输出是

['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
 'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['because']
['as' 'up' 'just' 'use' 'us']

【讨论】：

首先要说声谢谢你救了我这么多天的研究！我在您的代码中添加了改进，而不是在亲和传播算法中使用默认的欧几里德距离，我将其更改为：预先计算，因此我更改了此 af = AffinityPropagation(affinity='precomputed').fit(A) 和获得比默认更好的结果。【参考方案2】：

另一种方法是使用矩阵分解，使用 SVD。首先我们创建单词距离矩阵，对于 100 个单词，这将是 100 x 100 矩阵，表示每个单词到所有其他单词的距离。然后，在这个矩阵上运行 SVD，得到的 u,s,v 中的 u 可以看作是每个集群的成员强度。

代码

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
                dtype=np.uint8)
A = np.reshape(res,(dim,dim))

print "svd..."

u,s,v = lin.svd(A, full_matrices=False)

print u.shape
print s.shape
print s
print v.shape

data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
k.fit(data)
centroids = k.cluster_centers_
labels = k.labels_
print labels

for i in range(np.max(labels)):
    print words[labels==i]

def dist(x,y):   
    return np.sqrt(np.sum((x-y)**2, axis=1))

print "centroid points.."
for i,c in enumerate(centroids):
    idx = np.argmin(dist(c,data[labels==i]))
    print words[labels==i][idx]
    print words[labels==i]

plt.plot(centroids[:,0],centroids[:,1],'x')
plt.hold(True)
plt.plot(u[:,0], u[:,1], '.')
plt.show()

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
        zdir='z', label='zs=0, zdir=z')
plt.show()

结果

any
['and' 'an' 'can' 'any']
do
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
when
['who' 'when' 'well']
my
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
your
['for' 'or' 'out' 'about' 'your' 'our']
its
['it' 'his' 'if' 'him' 'its']
could
['would' 'people' 'could']
this
['this' 'think' 'these']
she
['the' 'he' 'she' 'see']
back
['all' 'back' 'want']
one
['of' 'on' 'one' 'only' 'even' 'new']
just
['but' 'just' 'first' 'most']
come
['some' 'come']
that
['that' 'than']
way
['say' 'what' 'way' 'day']
like
['like' 'time' 'give']
in
['in' 'into']
get
['her' 'get' 'year']
because
['because']
will
['with' 'will' 'which']
over
['other' 'over' 'after']
as
['a' 'as' 'at' 'also' 'us']
them
['they' 'there' 'their' 'them' 'then']
good
['not' 'from' 'know' 'good' 'now' 'look' 'work']
have
['have' 'make' 'take']

为聚类数选择 k 很重要，例如，k=25 比 k=20 提供更好的结果。

该代码还通过选择其 u[..] 坐标最接近集群质心的词来为每个集群选择一个代表词。

【讨论】：

【参考方案3】：

这是一种基于 medoids 的方法。首先安装 MlPy。在 Ubuntu 上

sudo apt-get install python-mlpy

然后

import numpy as np
import mlpy

class distance:    
    def compute(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

x =  np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])

km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)

print medoids
print clusters
print a

print x[medoids] 
for i,c in enumerate(x[medoids]):
    print "medoid", c
    print x[clusters[a==i]]

输出是

[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']

更大的单词列表和使用 k=10

medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
 'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
 'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
 'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
 'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
 'than' 'them']

【讨论】：

【参考方案4】：

您需要对组进行规范化。在每个组中，选择一个代表该组的单词或编码。然后按他们的代表对单词进行分组。

一些可能的方法：

选择第一个遇到的单词。选择词典中的第一个单词。为所有单词导出一个模式。选择一个唯一索引。使用soundex 作为模式。

但是，对单词进行分组可能很困难。如果 A 与 B 相似，B 与 C 相似，则 A 和 C 不一定彼此相似。如果 B 是代表，则 A 和 C 都可以包含在该组中。但如果A或C是代表，则不能包括另一个。

按照第一个选择（第一个遇到的单词）：

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

示例：

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

输出：

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

【讨论】：

找到组将是困难的部分。我想我可以使用一个聚类算法，例如levenshtein distance 作为距离度量。确定集群后，我选择一个词（其中任何一个）作为该组的代表。另外，您可以使用两组单词之间的平均成对 Levenshtein 距离作为它们之间的距离度量（许多层次聚类算法都是这样工作的）。最大成对距离也可能有效。组代表的可爱解说。但是，我会推荐（双）变音器而不是 soundex。 +1 Markus，你能把主线添加到上面的脚本中吗？我在您的代码中添加了距离度量（将添加到我的主要问题中），我在显示组时遇到了问题。【参考方案5】：

您必须在封闭匹配词中决定要使用哪些词。可能是从 get_close_matches 返回的列表中获取第一个元素，或者只是在该列表上使用随机函数并从封闭匹配中获取一个元素。

必须有某种规则，对于它..

In [19]: import difflib

In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']

In [21]: a = ['appel', 'apple', 'peach', 'puppy']

In [22]: b = difflib.get_close_matches('ape',a)

In [23]: b
Out[23]: ['apple', 'appel']

In [24]: import random

In [25]: c = random.choice(b)

In [26]: c
Out[26]: 'apple'

In [27]:

现在从初始列表中删除 c，就是这样... 对于c++，可以使用Levenshtein_distance

【讨论】：

以上是关于模糊分组依据，对相似词进行分组的主要内容，如果未能解决你的问题，请参考以下文章