如何在python中打印SVM集群

Posted

技术标签:

【中文标题】如何在python中打印SVM集群【英文标题】:How to print clusters of SVM in python 【发布时间】:2020-06-03 23:30:58 【问题描述】:

我想使用 SVM 聚类方法对列的行进行分类。我可以在网上找到很多可以生成图表或打印预测准确性的内容,但我找不到打印集群的方法。下面的示例将更好地解释我正在尝试做的事情:

我有一个数据框用作测试数据集

import pandas as pd
train_data = 'Serial': [1,2,3,4,5,6,7,8,9,10],
        'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
        'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
        'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
        'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
        

df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)

我想预测文本行是在谈论动物/事物还是杂项。我要通过的测试数据是

test_data = 'Serial': [1,2,3,4,5],
        'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
        'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
        

df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])

预期结果是在测试数据框中创建了一个附加列“分类”,其值为 ['Animal','Miscellenous','Animal','Animal','Miscellenous']

【问题讨论】:

【参考方案1】:

这里是您的问题的解决方案:

# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC 
import pandas as pd

train_data = 'Serial': [1,2,3,4,5,6,7,8,9,10],
        'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
        'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
        'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
        'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
        

train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)


test_data = 'Serial': [1,2,3,4,5],
        'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
        'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
        

test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)


# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()

# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()

# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()

# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()

# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)

# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)

# Get the SVC classifier
clf = SVC()

# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)

# Predict the test samples
print(clf.predict(X_test))

# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)

# Display test dataframe
display(test_df)

作为对该方法的解释:

您有自己的训练数据并想用它来训练 SVM,然后使用标签预测测试数据。

这意味着您需要为每个数据点提取训练数据和标签(因此对于每个短语,您需要知道它是动物还是事物等),然后您需要设置和训练 SVM。在这里,我使用了 scikit-learn 的实现。

此外,您不能只使用原始文本数据训练 SVM,因为它需要数值(数字)。这意味着您需要将文本数据转换为数字。这是“feature extraction from text”,为此,一种常见的方法是使用词频反转文档频率 (TF-IDF) 概念。

现在您可以使用每个短语的向量表示加上一个标签来训练 SVM,然后用它对测试数据进行分类:)

简而言之,步骤是:

    从训练中提取数据点和标签 从测试中提取数据点 设置 SVM 分类器 设置 TF-IDF 矢量化器并将其拟合到训练数据中 使用 tf-idf 矢量化器转换训练数据和测试数据 训练 SVM 分类器 使用经过训练的分类器对测试数据进行分类

我希望这会有所帮助!

【讨论】:

以上是关于如何在python中打印SVM集群的主要内容,如果未能解决你的问题,请参考以下文章

如何在 R 中绘制一类 SVM?

如何在 python 中的 SVM sklearn 数据中绘制决策边界?

在 LIBSVM 中的 SVM 中进行交叉验证时停止打印准确性 [关闭]

如何在 sklearn Python 中绘制 SVM 决策边界?

如何在 Python 中使用 OpenCV 3.0 中的 HOG 功能训练 SVM 分类器?

如何在 scikit-learn 的 SVM 中使用非整数字符串标签? Python