scikit-learn：FeatureUnion 包含手工制作的功能

Posted 2023-03-12

技术标签:

【中文标题】scikit-learn：FeatureUnion 包含手工制作的功能【英文标题】：scikit-learn: FeatureUnion to include hand crafted features 【发布时间】：2020-04-28 12:23:45 【问题描述】：

我正在对文本数据执行多标签分类。我希望使用tfidf 的组合功能和类似于示例here 使用FeatureUnion 的自定义语言功能。

我已经生成了自定义语言特征，它们采用字典的形式，其中键代表标签，（列表）值代表特征。

custom_features_dict = 'contact':['contact details', 'e-mail'], 
                       'demographic':['gender', 'age', 'birth'],
                       'location':['location', 'geo']

训练数据结构如下：

text                                            contact  demographic  location
---                                              ---      ---          ---
'provide us with your date of birth and e-mail'  1        1            0
'contact details and location will be stored'    1        0            1
'date of birth should be before 2004'            0        1            0

上面的dict怎么能合并到FeatureUnion里面呢？我的理解是，应该调用一个用户定义的函数，该函数返回与训练数据中是否存在字符串值（来自custom_features_dict）相对应的布尔值。

对于给定的训练数据，这给出了以下 list 或 dict：

[
    
       'contact':1,
       'demographic':1,
       'location':0
    ,
    
       'contact':1,
       'demographic':0,
       'location':1
    ,
    
       'contact':0,
       'demographic':1,
       'location':0
    ,
]

上面的list如何实现fit和transform？

代码如下：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO

data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')

df = pd.read_csv(data)

custom_features_dict = 'contact':['contact details', 'e-mail'], 
                        'demographic':['gender', 'age', 'birth'],
                        'location':['location', 'geo']

my_features = [
    
       'contact':1,
       'demographic':1,
       'location':0
    ,
    
       'contact':1,
       'demographic':0,
       'location':1
    ,
    
       'contact':0,
       'demographic':1,
       'location':0
    ,
]

bow_pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(stop_words=stop_words)),
    ]
)

manual_pipeline = Pipeline(
    steps=[
        # This needs to be fixed
        ("custom_features", my_features),
        ("dict_vect", DictVectorizer()),
    ]
)

combined_features = FeatureUnion(
    transformer_list=[
        ("bow", bow_pipeline),
        ("manual", manual_pipeline),
    ]
)

final_pipeline = Pipeline([
            ('combined_features', combined_features),
            ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
        ]
)

labels = ['contact', 'demographic', 'location']

for label in labels:
    final_pipeline.fit(df['text'], df[label])

【问题讨论】：

【参考方案1】：

您必须定义一个将您的文本作为输入的转换器。类似的东西：

from sklearn.base import BaseEstimator, TransformerMixin

custom_features_dict = 'contact':['contact details', 'e-mail'], 
                   'demographic':['gender', 'age', 'birth'],
                   'location':['location', 'geo']

#helper function which returns 1, if one of the words occures in the text, else 0
#you can add more words or categories to custom_features_dict if you want
def is_words_present(text, listofwords):
  for word in listofwords:
    if word in text:
      return 1
  return 0

class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, custom_feature_dict):
       self.custom_feature_dict = custom_feature_dict
    def fit(self, x, y=None):
        return self    
    def transform(self, data):
        result_arr = []
        for text in data:
          arr = []
          for key in self.custom_feature_dict:
            arr.append(is_words_present(text, self.custom_feature_dict[key]))
          result_arr.append(arr)
        return result_arr

注意：这个 Transformer 直接生成一个数组，如下所示：[1, 0, 1]，它不生成字典，这让我们可以省去 DictVectorizer。

另外我改变了处理多标签分类的方式，见here：

#first, i generate a new column in the dataframe, with all the labels per row:
def create_textlabels_array(row):
  arr = []
  for label in ['contact', 'demographic', 'location']:
    if row[label]==1:
      arr.append(label)
  return arr

df['textlabels'] = df.apply(create_textlabels_array, 1) 

#then we generate the binarized Labels:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer().fit(df['textlabels'])
y = mlb.transform(df['textlabels'])

现在我们可以将所有内容一起添加到管道中：

bow_pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(stop_words=stop_words)),
    ]
)

manual_pipeline = Pipeline(
    steps=[
        ("costum_vect", CustomFeatureTransformer(custom_features_dict)),
    ]
)

combined_features = FeatureUnion(
    transformer_list=[
        ("bow", bow_pipeline),
        ("manual", manual_pipeline),
    ]
)

final_pipeline = Pipeline([
        ('combined_features', combined_features),
        ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
    ]
)

#train your pipeline
final_pipeline.fit(df['text'], y) 

#let's predict something: (Note: of course training data is a bit low in that examplecase here)
pred = final_pipeline.predict(["write an e-mail to our location please"])
print(pred) #output: [0, 1, 1] 

#reverse the predicted array to the actual labels:
print(mlb.inverse_transform(pred)) #output: [('demographic', 'location')]

【讨论】：

我在def transform(self, data): result_arr NameError 中收到以下错误：未定义全局名称“result_arr”。在修改为result_arr = [] 时，我在arr.append(is_words_present(text, self.word_dict[key])) AttributeError 行收到以下错误：'CustomFeatureTransformer' 对象没有属性'word_dict' 是的，将代码传输到 *** 时出现了一些错误。我希望我现在修好了现在可以使用了！我只有最后一个问题，您是否有特殊原因为什么要一起训练所有标签上的管道而不是一个一个地训练？一一训练的一个加分点是我们可以得到每个标签的预测准确率分数。例如，如果我们有一个类似的数据框用于测试数据（df_test），那么我们可以在问题给出的代码末尾添加以下两行：pred = final_pipeline.predict(df_test['text']) print (label, accuracy_score(df_test[label], pred)) 因为它本质上是相同的，并且是标准的方法。您正在使用 OneVsRestClassifier，它为每个标签训练一个单独的分类器，即使它在一行中“一起训练”。您还可以生成具有多个标签的分类报告，例如 classification_report。可能阅读 this 可以帮助您了解如何评估 multilabel-clf-tasks。【参考方案2】：

如果我们只想修复标记为已修复的那部分代码，我们只需要实现一个扩展类 sklearn.base.BaseEstimator 的新估计器（类 TemplateClassifier 就是一个很好的例子here）。

但是，这里似乎存在概念上的错误。 my_features 列表中的信息似乎是标签本身（好吧，有人可能会说它们是非常强大的功能......）。所以，我们不应该把标签放在特征管道中。

如here所述，

变换器通常与分类器、回归器或其他估计器来构建复合估计器。最常用的工具是一个管道。 Pipeline 常与 FeatureUnion 结合使用它将转换器的输出连接成一个复合特征空间。 TransformedTargetRegressor 处理转换目标（即对数变换 y）。 相比之下，Pipelines 只转换观测数据 (X)。

也就是说，如果您仍想将该列表信息放入转换方法中，则应该是这样的：

def transform_str(one_line_text: str) -> dict:
    """ Transforms one line of text to dict features using manually extracted information"""
    # manually extracted information
    custom_features_dict = 'contact': ['contact details', 'e-mail'],
                            'demographic': ['gender', 'age', 'birth'],
                            'location': ['location', 'geo']
    # simple tokenization. it can be improved using some text pre-processing lib
    tokenized_text = one_line_text.split(" ")
    output = dict()
    for feature,tokens in custom_features_dict.items():
        output[feature] = False
        for word in tokenized_text:
            if word in tokens:
                output[feature] = True
    return output

def transform(text_list: list) -> list:
    output = list()
    for one_line_text in text_list:
        output.append(transform_str(one_line_text))
    return output

在这种情况下，您不需要 fit 方法，因为 fit 是手动完成的。

【讨论】：

那么如何实现 FeatureUnion（如果我们不使用管道）将我们的手动功能与 tfidf 矢量化器结合起来？它可以像@chefhose 答案中显示的那样实现。我刚刚指出，如果您的手动功能包含与您预测的目标相同的信息，则可能存在概念问题。但是，如果手动特征中包含的信息与您的目标不完全匹配，那么您可以继续使用此公式。

以上是关于scikit-learn：FeatureUnion 包含手工制作的功能的主要内容，如果未能解决你的问题，请参考以下文章