如何使用 scikit-learn 组合具有不同维度输出的特征

Posted 2023-03-12

技术标签:

【中文标题】如何使用 scikit-learn 组合具有不同维度输出的特征【英文标题】：How to combine features with different dimensions output using scikit-learn 【发布时间】：2018-10-30 06:32:59 【问题描述】：

我正在使用 scikit-learn 与 Pipeline 和 FeatureUnion 从不同的输入中提取特征。我的数据集中的每个样本（实例）都引用了不同长度的文档。我的目标是独立计算每个文档的顶部 tfidf，但我不断收到此错误消息：

ValueError: blocks[0,:] 的行尺寸不兼容。得到 blocks[0,1].shape[0] == 1，预计 2000 年。

2000 是训练数据的大小。这是主要代码：

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])

我编写了两个类来处理每个管道函数。我的问题是 book_contents 管道，它主要处理每个样本并独立返回每本书的 TFidf 矩阵。

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
                      corpus=(str([user_data['text']]).strip('[]')) 
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        text=data_dict['bookid'].apply(self.count_contents2)
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

数据样本（示例）：

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

那么每个id都会引用一个包含这些书籍实际内容的文本文件

我尝试过toarray 和reshape 函数，但没有成功。任何想法如何解决这个问题。谢谢

【问题讨论】：

能否提供一些示例数据？我添加了数据示例理想情况下，您将提供一个最小的工作示例来重现您的错误。目前你指的是book_content_count()，我无法从你的代码中识别出来。这不能在 FeatureUnion 中完成。它在内部使用numpy.hstack，这要求所有部分的行数相等。第一部分'book_summary' 将处理整个训练数据并返回一个包含 2000 行的矩阵。但是您的第二部分 'book_contents' 将只返回一行。您将如何组合这些数据？只是古玩，您找到解决方案了吗？请注意，“每个文档独立的 tfidf”等价于 countvectorizer。 【参考方案1】：

您可以将Neuraxle's Feature Union 与需要自己编码的自定义连接器一起使用。 joiner 是一个传递给 Neuraxle 的 FeatureUnion 的类，用于以您期望的方式将结果合并在一起。

1。导入 Neuraxle 的类。

from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion

2。通过继承 BaseStep 来定义您的自定义类：

class BookContentCount(BaseStep): 

    def transform(self, data_dict, y=None):
        transformed = do_things(...)  # be sure to use SKLearnWrapper if you wrap sklearn items.
        return transformed

    def fit(self, x, y=None):
        return self

3。创建一个加入者，以您希望的方式加入特征联合的结果：

class CustomJoiner(NonFittableMixin, BaseStep):
    def __init__(self):
        BaseStep.__init__(self)
        NonFittableMixin.__init__(self)

    # def fit: is inherited from `NonFittableMixin` and simply returns self.

    def transform(self, data_inputs):
        # TODO: insert your own concatenation method here.
        result = np.concatenate(data_inputs, axis=-1)
        return result

4。最后通过将连接器传递给 FeatureUnion 来创建您的管道：

book_summary= Pipeline([
    ItemSelector(key='book'),
    TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])

p = Pipeline([
    FeatureUnion([
        book_summary,
        BookContentCount()
    ], 
        joiner=CustomJoiner()
    ),
    SVC(kernel='linear', class_weight='balanced')
])

注意：如果您希望您的 Neuraxle 管道恢复为 scikit-learn 管道，您可以执行 p = p.tosklearn()。

要了解有关 Neuraxle 的更多信息： https://github.com/Neuraxio/Neuraxle

文档中的更多示例： https://www.neuraxle.org/stable/examples/index.html

【讨论】：

以上是关于如何使用 scikit-learn 组合具有不同维度输出的特征的主要内容，如果未能解决你的问题，请参考以下文章