Python Scikit-Learn 库中分类数据的异常值预测

Posted 2023-03-12

技术标签:

【中文标题】Python Scikit-Learn 库中分类数据的异常值预测【英文标题】：Outlier prediction with categorical data in Pythons Scikit-Learn lib 【发布时间】：2020-02-01 11:55:39 【问题描述】：

我试图用我自己的输出进行预测。我使用 Python Scikit-learn lib 和 Isolation Forest 作为算法。我不知道我做错了什么，但是当我想转换我的输入数据时，我总是会出错。我在这一行得到一个错误：

    input_par = encoder.transform(val)#ERROR

这是错误： Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

我也试过了，但总是报错：

    input_par = encoder.transform([val])#ERROR

这是错误：alueError: Specifying the columns using strings is only supported for pandas DataFrames

我做错了什么，我该如何解决这个错误？另外，我应该使用OneHotEncoder、LabelEncoder 还是CountVectorizer？

这是我的代码：

import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
num_data = [4, 1, 3, 2, 65, 3,3]

df = pd.DataFrame('my text': textual_data,
                   'num data': num_data)
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
#encoder = ColumnTransformer(transformers=[('lab', LabelEncoder(), ['my text'])])

x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(val)#ERROR

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')

编辑：

我也试过这个：

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(pd.DataFrame('my text': val[0],
                                               'num data': val[1]))

但我收到此错误：

ValueError: If using all scalar values, you must pass an index

【问题讨论】：

也发布错误。我已经更新了问题，我已经添加了错误上述代码只是将句子编码为一种热编码。您确定要对句子进行编码还是要对句子中包含的标记进行编码。我想查找异常值，以检查我的输入文本是否异常值，是否可以使用文本数据执行此操作？另外，我应该使用什么编码？所以我认为你的问题陈述是，根据你想要找到异常值的句子的上下文。您如何确定 -1 预测是异常值？？ 【参考方案1】：

我将尝试列出您可能会发现有用的观察结果：

例如，LabelEncoder 可用于将非数字数据转换为数字标签。 OneHotEncoder 通常采用数字或非数字数据并将其转换为 one-hot 编码。两者通常用于预处理“标签”（监督学习问题的类别）。据我了解，您正在尝试预测异常值（异常检测）。我不清楚话语和整数之间的连接是否只是硬编码的，或者您是否想以某种方式生成这种连接。如果这是您想要的，那么您无法使用前面提到的编码器来实现这一点，因为您正在将它们拟合到一些数据（通常应该是标签）并尝试转换新的不相关数据（ValueError：y 包含以前看不见的标签） .但是，可以通过将 OneHotEncoder 的 handle_unknown 参数设置为“忽略”来解决此问题（来自文档：“如果在转换期间存在未知的分类特征，是否引发错误或忽略”）。即使您可以使用这些编码器之一实现您想要的，您也应该记住，这不是它的主要目的。

我假设您对“负面”话语赋予了较高的价值（即使“错误”不对应于您的训练数据中的 65），而对“正面”话语赋予了较小的价值。如果您假设您已经知道每个话语的每个整数，您可以在被认为是“正”示例的模型上训练模型，并仅在测试中给出“负”示例（异常值）。您不会在“正面”和“负面”示例上训练 IsolationForest - 这只是可以使用决策树建模的基本二元分类。可以看到 IsolationForest 的直观示例here。以下是您的问题的代码：

import numpy as np
from sklearn.ensemble import IsolationForest

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...]
integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1]
integer_connection = np.array([[n] for n in integer_connection])

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
isolation_forest.fit(integer_encoded)

list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]

text_vals = [d[0] for d in list_of_val]
numeric_vals = np.array([[d[1]] for d in list_of_val])

print(integer_encoded, numeric_vals)

outliers = isolation_forest.predict(numeric_vals)
print(outliers)

总的来说，我认为您的方法对于自然语言话语的异常值预测是不正确的。对于您在这个特定示例中尝试执行的操作，我可以推荐使用来自例如spaCy 的词向量相似性，或者可能是简单的词袋方法。

如果你不关心这些点，你只想要一个工作代码，这是我想要做的事情的版本：

import numpy as np

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']


encodings = 

num_data = [4, 1, 3, 2, 65, 3, 3]


onehot_encoder = OneHotEncoder(handle_unknown='ignore')
onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)]))

for i, l in enumerate(onehots):
    original_label = (textual_data[i], num_data[i])
    encodings[original_label] = l

print(encodings)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(onehots)

list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]


test_encoded = onehot_encoder.transform(np.array(list_of_val))
print(test_encoded)

outliers = isolation_forest.predict(test_encoded)
print(outliers)

for i, outlier in enumerate(outliers):
    if outlier == -1:
        print('Values', list_of_val[i], 'are outliers')

    else:
        print('Values', list_of_val[i], 'are not outliers')

【讨论】：

【参考方案2】：

你确定你在做什么有意义吗？您的 OneHotEncoder() 使用 one-hot（又名“one-of-K”或“dummy”）编码方案对分类变量 ('my text') 进行编码。将其视为标签和数字返回之间的映射。

在您的 textual_data 中，您有 7 个不同的标签：['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']。这些中的每一个都将被编码。这发生在您的：

>>> x = encoder.fit_transform(x)
>>> print(x)
<7x8 sparse matrix of type '<class 'numpy.float64'>'
    with 14 stored elements in Compressed Sparse Row format>

在这里，您的编码器会为所有 7 个标签创建一个映射。

当您继续编写脚本并希望使用相同的编码器来转换新标签时，它会失败：

>>> to_predict = pd.DataFrame('my text': ['good work', 'you are wrong', 'this was amazing'],
                               'num data': [2, 54, 1])
>>> encoder.transform(to_predict)
ValueError: Found unknown categories ['this was amazing', 'good work', 'you are wrong'] in column 0 during transform

它在其映射中找不到这些标签。但是，如果您有新的观察结果，而您的标签是映射的一部分，那么它将能够转换它们：

>>> to_predict = pd.DataFrame('my text': ['i like that', 'i love you', 'i love you'],
                               'num data': [2, 54, 1])
>>> encoder.transform(to_predict)
<3x8 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

您可以做的是将这些带有新标签的新观察结果添加到您原来的 df 并通过您的管道再次运行它们，以便它们成为您映射的一部分。

我必须承认我完全没有这方面的经验，所以如果我错了，请纠正我，但在我看来就是这样。祝你的项目好运。

【讨论】：

检查我的问题，我已经更新了。在问题的最后，我提出了我尝试过的内容。我试图将数据框传递给encoder.transform(...)，但我得到了同样的错误。当我用分类数据制作回归模型时，我将数据框传递给 encoder.transform(...) 并且它工作了，我不知道为什么现在它不工作了。我做了同样的事情，我只是使用不同的算法 'my text': [val[0]], 'num data': [val[1]] 避免 ValueError: If using all scalar values, you must pass an index 我已经这样做了，我得到了这个错误：ValueError: Found unknown categories ['good work'] in column 0 during transform 这个答案提出了一个很好的观点，您的测试数据包含训练中不存在的类别，所以它永远不会起作用。尝试首先将list_of_val 转换为df，与x 逐行连接，在这个新df 上调用encoder.fit()，然后分别transform 两个dfs【参考方案3】：

你有一个非常相似的问题

AttributeError when using ColumnTransformer into a pipeline

如那里所述，建议使用 pandas 进行编码（还有一个单热编码示例）。希望对您有所帮助！

【讨论】：

我已经尝试过该解决方案，但它不起作用。这就是为什么我问了一个问题并悬赏它【参考方案4】：

尝试通过运行将您的列表 list_of_val 转换为 numpy 数组

import numpy as np
list_of_val = np.asarray(list_of_val)

【讨论】：

还是不行。我收到一个错误：ValueError: Specifying the columns using strings is only supported for pandas DataFrames 【参考方案5】：

当数据是单个变量时收到这样的消息，也称为时间序列:)

PDF 是Pandas DataFrame

### pick real data
X_train = PDF.y          # single dimension, or time-series
y_train = PDF.isAnomaly  # validation variable

### reshape for isolation forest
X_train = np.array(X_train).reshape(-1, 1)

【讨论】：

以上是关于Python Scikit-Learn 库中分类数据的异常值预测的主要内容，如果未能解决你的问题，请参考以下文章