基于单个特征集的分类精度

Posted

技术标签:

【中文标题】基于单个特征集的分类精度【英文标题】:Classification accuracy based on single Feature set 【发布时间】:2019-07-31 05:52:40 【问题描述】:

我正在尝试根据预先指定的标签对数据进行分类。

有两列如下所示:

room_class                     room_cluster
Standard single sea view        Standard
Deluxe twin Single              Deluxe
Suite Superior room ocean view  Suite
Superior Double twin            Superior
Deluxe Double room              Deluxe

如上图标签集中的 room_cluster 所示。

sn-p代码如下:

le = preprocessing.LabelEncoder()

datar = df

#### Separate data into feature and Labels
x = datar.room_class
y = datar.room_cluster


#### Using Label encoder to change string onto 'int'
le.fit(x)
addv = le.transform(x)
asb =  addv.reshape(-1,1)


#### Splitting into training and testing set adn then using Knn
x_train,x_test,y_train,y_test=train_test_split(asb,y,test_size=0.40)
classifier=neighbors.KNeighborsClassifier(n_neighbors=3)
classifier.fit(x_train,y_train)
predictions =   classifier.predict(x_test)


#### Checking the accuracy
print(accuracy_score(y_test,predictions))

我得到的测试数据的准确率只有 78%,代码中是否有问题阻碍了准确度。

如何使用此模型来预测自定义功能,例如:

输入:'Suite Single sea view' 输出:'Suite'输入:'Superior Suite twin' 输出:“高级”

【问题讨论】:

您将 78% 的准确率视为“低”这一事实在任何情况下都不一定意味着这里存在任何 编码 问题,这就是(编码问题)关于... 我需要 ML,因为输入数据可能会有所不同,但是如何使用模型进行预测作为问题中的示例? @Justice_Lords room_class 并不总是由两个词组成,请查看编辑。 @Justice_Lords 如果可能的话,您能否以答案的形式提供示例代码 sn-p?并且“填充所有句子” = 使它们具有相同的结构? 【参考方案1】:
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np

##Based on your data
initial_room=["Standard single sea view","Deluxe twin Single","Suite Superior room ocean view","Superior Double twin","Deluxe Double room"]


##Based on your data created 100 data points
##Its repeating
room_class=[initial_room[random.randint(0,len(initial_room)-1)] for i in range(100)]

##Based on room_cluster
initial_cluster=["Standard","Deluxe","Suite","Superior"]

##Find intersection between room_class and room_cluster the matching word is the Y_Label
room_cluster=[''.join(list(set(each_room.split()).intersection(set(initial_cluster)))[0]) for each_room in room_class]


##Helps to embed 
embedding=
index=0


##For each unique word in the total room_class assign a unique number.
for each_room in room_class:
    for each_word in each_room.split():
        if each_word not in embedding:
            embedding[each_word]=index
            index+=1

##Find max_len of the room name
max_len=max([len(i.split()) for i in room_class])

##Needed for embedding the matrix
embedded_rooms=[]


##For each room in room_class
for each_room in room_class:
    embedded_room=[]
    for each_word in each_room.split():
        ##Each word assign that unique number
        embedded_room.append(embedding[each_word])

    #Get the length of the row
    room_len=len(embedded_room)

    ##If it is length max_len pad it with -1
    ##Single for embedding I have already used 0 so I cant use it
    while(room_len<max_len):
        embedded_room.append(-1)
        room_len+=1
    ##Append it to embedded rooms
    embedded_rooms.append(embedded_room)

Y=[]

##Embed Y based on same technique
for each_cluster in room_cluster:
    Y.append(embedding[each_cluster])


X=np.array(embedded_rooms)


##Apply KNN
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X,Y)

##Data for testing goes within this list
test=["Single Standard"]
test_label=["Standard"]


embed_tests=[]
##Convert the test to embedding 
#Use the same embedding
for each_test in test:
    embed_test=[]
    for each_word in each_test.split():
        embed_test.append(embedding[each_word])
    ##Again Padding the data    
    n=len(embed_test)
    while(n<max_len):
        embed_test.append(-1)
        n+=1
    embed_tests.append(embed_test)  

#Predict the X_test
X_test=np.array(embed_tests)
predictions = classifier.predict(X_test)

##Convert class_labels to encoding
embed_test_label=[]
for each_class in test_label:
    embed_test_label.append(embedding[each_class])

##Print out the accuracy
print(accuracy_score(embed_test_label,predictions))

我已经粗略地编码了,所以请耐心等待。

参考资料:

    Padding

【讨论】:

谢谢,只是想问一下,在“初始房间”列表中,如果我有足够的数据(~4000),我是否需要像你在“房间集群'?还是只是为了举例?另外我如何测试“Single Stadard”的代码? @JustinJoy 'n=100' 数据点它仅用于示例,因为我没有太多,所以我随机创建了它。为了测试,我将更新代码。 我在哪里给出输入,例如。假设我给出代码“Superior twin double”,根据训练,它会输出“Superior” @JustinJoy 喜欢测试?然后test=["Single Standard"] 将所有输入附加到此。如果您没有 test_class_labels 注释掉相应的嵌入测试类标签。 @JustinJoy 我会为您提供一些我曾经引用过的网站。Machine Learning 和 Analytics Vidhya。这些网站涵盖了所有 ML 概念。因此您可以访问这些网站以进一步阅读。

以上是关于基于单个特征集的分类精度的主要内容,如果未能解决你的问题,请参考以下文章

用于分类的多个不同大小的特征集

如何在词袋中组合多个特征集

混淆矩阵 - 测试情绪分析模型

在 ML 分类问题中以高精度预测结果

nltk:使用自定义特征集的文本分类

如何使用 Scala 运行具有分类特征集的 Spark 决策树?