当标准列表不存在时,python 中的公司名称聚类

Posted

技术标签:

【中文标题】当标准列表不存在时,python 中的公司名称聚类【英文标题】:clustering of company names in python when standard list is not there 【发布时间】:2020-11-11 10:26:43 【问题描述】:

我在 pandas 数据框中有一个公司名称列表,我想对这些相似的名称进行分组,查看并为每个组创建一个标准名称。我看到的大多数解决方案都是将值映射到标准值,但我只想对相似的列表进行分组。在许多情况下,它们可能不会以相同的单词开头

Ex : 

    ANADARKO E & P CO LP
    E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
    E & P ONSHORE LLC ANADARKO 
    PET ANADARKO 
    ANADARKO PET CORP
    ANADARKO PETROLEUM CORPORATION
    PROD ANADARKO 
    ANADARKO PROD CO
    ANADARKO PRODUCTION COMPANY

如果我有一个标准列表,那么fuzzywuzzy 非常好用,当没有标准列表时我们如何对值进行分组?

【问题讨论】:

【参考方案1】:

这样做怎么样?

document = ["This is the most beautiful place in the world.", "This man has more skills to show in cricket than any other game.", "Hi there! how was your ladakh trip last month?", "There was a player who had scored 200+ runs in single cricket innings in his career.", "I have got the opportunity to travel to Paris next year for my internship.", "May be he is better than you in batting but you are much better than him in bowling.", "That was really a great day for me when I was there at Lavasa for the whole night.", "That’s exactly I wanted to become, a highest ratting batsmen ever with top scores.", "Does it really matter wether you go to Thailand or Goa, its just you have spend your holidays.", "Why don’t you go to Switzerland next year for your 25th Wedding anniversary?", "Travel is fatal to prejudice, bigotry, and narrow mindedness., and many of our people need it sorely on these accounts.", "Stop worrying about the potholes in the road and enjoy the journey.", "No cricket team in the world depends on one or two players. The team always plays to win.", "Cricket is a team game. If you want fame for yourself, go play an individual game.", "Because in the end, you won’t remember the time you spent working in the office or mowing your lawn. Climb that goddamn mountain.", "Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(true_k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
            print('%s' % terms[ind])
 

结果...

Cluster 0:
cricket
team
game
world
better
year
really
travel
place
beautiful
Cluster 1:
worrying
road
enjoy
journey
stop
potholes
year
highest
goa
goddamn

最后...你可以用它来进行预测...

print("\n")
print("Prediction")
X = vectorizer.transform(["Nothing is easy in cricket. Maybe when you watch it on TV, it looks easy. But it is not. You have to use your brain and time the ball."])
predicted = model.predict(X)
print(predicted)

结果...

Prediction
[1]

【讨论】:

【参考方案2】:

查看此链接 - https://towardsdatascience.com/group-thousands-of-similar-spreadsheet-text-cells-in-seconds-2493b3ce6d8d

可能想先运行 CleanCo 来标准化名称

from textpack import tp
from cleanco import cleanco


df['Name_Trimmed']=df['names'].apply(lambda x: cleanco(x).clean_name() if type(x)==str else x)

然后使用 ngrams 和 TDIF 来使用他的代码 -

new_df=tp.read_csv('./_________.csv',['Name_Trimmed'], match_threshold=0.85,ngram_remove=r'[,-./]')
new_df.run()
new_df.export_csv('./ngram_grps.csv')
df2= pd.read_csv('ngram_grps.csv')
print("Ngram group Count =",len(df2['Group'].unique()))

【讨论】:

【参考方案3】:

这应该可以解决您的问题!

#创建一个df

data = 'names': ['ANADARKO E & P CO LP',
    'E & P COMPANY ANADARKO  LIMITED PRTNRSHIP',
    'E & P ONSHORE LLC ANADARKO ',
    'PET ANADARKO ',
    'ANADARKO PET CORP',
    'ANADARKO PETROLEUM CORPORATION',
    'PROD ANADARKO ',
    'ANADARKO PROD CO',
    'ANADARKO PRODUCTION COMPANY', 'test', 'test2']

df = pd.DataFrame(data)
print(df)

                                        names
0                        ANADARKO E & P CO LP
1   E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
2                 E & P ONSHORE LLC ANADARKO 
3                               PET ANADARKO 
4                           ANADARKO PET CORP
5              ANADARKO PETROLEUM CORPORATION
6                              PROD ANADARKO 
7                            ANADARKO PROD CO
8                 ANADARKO PRODUCTION COMPANY
9                                        test
10                                      test2

#find str 'ANADARKO' 在那个df中

look = df[df['names'].str.contains('ANADARKO')]
print(look)

                                       names
0                       ANADARKO E & P CO LP
1  E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
2                E & P ONSHORE LLC ANADARKO 
3                              PET ANADARKO 
4                          ANADARKO PET CORP
5             ANADARKO PETROLEUM CORPORATION
6                             PROD ANADARKO 
7                           ANADARKO PROD CO
8                ANADARKO PRODUCTION COMPANY

【讨论】:

我有一个超过 50000 个名字的列表,我不会有任何像 ANADARKO 这样的关键字可以通过。我想在不传递任何关键字的情况下对它们进行分组

以上是关于当标准列表不存在时,python 中的公司名称聚类的主要内容,如果未能解决你的问题,请参考以下文章

python中的单词聚类列表

python-set

聚类分析

层次聚类及scipy中的层次聚类python代码解释

DBSCAN 聚类与名称不同(Python)

标准化公司名称