如何在python中打印SVM集群
Posted
技术标签:
【中文标题】如何在python中打印SVM集群【英文标题】:How to print clusters of SVM in python 【发布时间】:2020-06-03 23:30:58 【问题描述】:我想使用 SVM 聚类方法对列的行进行分类。我可以在网上找到很多可以生成图表或打印预测准确性的内容,但我找不到打印集群的方法。下面的示例将更好地解释我正在尝试做的事情:
我有一个数据框用作测试数据集
import pandas as pd
train_data = 'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)
我想预测文本行是在谈论动物/事物还是杂项。我要通过的测试数据是
test_data = 'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
预期结果是在测试数据框中创建了一个附加列“分类”,其值为 ['Animal','Miscellenous','Animal','Animal','Miscellenous']
【问题讨论】:
【参考方案1】:这里是您的问题的解决方案:
# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC
import pandas as pd
train_data = 'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)
test_data = 'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)
# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()
# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()
# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()
# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()
# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)
# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)
# Get the SVC classifier
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)
# Predict the test samples
print(clf.predict(X_test))
# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)
# Display test dataframe
display(test_df)
作为对该方法的解释:
您有自己的训练数据并想用它来训练 SVM,然后使用标签预测测试数据。
这意味着您需要为每个数据点提取训练数据和标签(因此对于每个短语,您需要知道它是动物还是事物等),然后您需要设置和训练 SVM。在这里,我使用了 scikit-learn 的实现。
此外,您不能只使用原始文本数据训练 SVM,因为它需要数值(数字)。这意味着您需要将文本数据转换为数字。这是“feature extraction from text”,为此,一种常见的方法是使用词频反转文档频率 (TF-IDF) 概念。
现在您可以使用每个短语的向量表示加上一个标签来训练 SVM,然后用它对测试数据进行分类:)
简而言之,步骤是:
-
从训练中提取数据点和标签
从测试中提取数据点
设置 SVM 分类器
设置 TF-IDF 矢量化器并将其拟合到训练数据中
使用 tf-idf 矢量化器转换训练数据和测试数据
训练 SVM 分类器
使用经过训练的分类器对测试数据进行分类
我希望这会有所帮助!
【讨论】:
以上是关于如何在python中打印SVM集群的主要内容,如果未能解决你的问题,请参考以下文章
如何在 python 中的 SVM sklearn 数据中绘制决策边界?
在 LIBSVM 中的 SVM 中进行交叉验证时停止打印准确性 [关闭]
如何在 sklearn Python 中绘制 SVM 决策边界?