如何将不同的输入拟合到 sklearn 管道中?
Posted
技术标签:
【中文标题】如何将不同的输入拟合到 sklearn 管道中?【英文标题】:How to fit different inputs into an sklearn Pipeline? 【发布时间】:2016-07-15 04:13:44 【问题描述】:我正在使用 sklearn 中的 Pipeline 对文本进行分类。
在这个管道示例中,我有一个 TfIDF 矢量化器和一些用 FeatureUnion 和分类器包装的自定义特征作为管道步骤,然后我拟合训练数据并进行预测:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
上面的代码工作得很好,但有一个转折点。我想对文本进行部分语音标记,并在 tagget 文本上使用不同的 Vectorizer。
X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X)
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)
# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)
features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
如何正确拟合此类数据?两个矢量化器如何区分原始文本和 pos 文本?我有哪些选择?
我也有自定义功能,其中一些会采用原始文本,而另一些会采用 POS 文本。
编辑:添加 MeasureFeatures()
from sklearn.base import BaseEstimator
import numpy as np
class MeasureFeatures(BaseEstimator):
def __init__(self):
pass
def get_feature_names(self):
return np.array(['type_token', 'count_nouns'])
def fit(self, documents, y=None):
return self
def transform(self, x_dataset):
X_type_token = list()
X_count_nouns = list()
for sentence in x_dataset:
# takes raw text and calculates type token ratio
X_type_token.append(type_token_ratio(sentence))
# takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
X_count_nouns.append(count_nouns(sentence))
X = np.array([X_type_token, X_count_nouns]).T
print X
print X.shape
if not hasattr(self, 'scalar'):
self.scalar = StandardScaler().fit(X)
return self.scalar.transform(X)
然后,此特征转换器需要为 count_nouns() 函数获取标记文本或为 type_token_ratio() 获取原始文本
【问题讨论】:
【参考方案1】:我认为您必须在 2 个 Transformer(TfidfTransformer 和 POSTransformer)上执行 FeatureUnion。当然你需要定义那个 POSTransformer。 也许这个article 会对你有所帮助。
也许你的管道会是这样的。
pipeline = Pipeline([
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts_ngram', CountVectorizer()),
('tf_idf_ngram', TfidfTransformer())
])),
('pos_tf_idf', Pipeline([
('pos', POSTransformer()),
('counts_pos', CountVectorizer()),
('tf_idf_pos', TfidfTransformer())
])),
('measure_features', MeasureFeatures())
])),
('classifier', LinearSVC())
])
这假设 MeasureFeatures 和 POSTransformer 是符合 sklearn API 的 Transformer。
【讨论】:
我在最新的编辑中添加了 MeasureFeatures()。基本上,它需要为一组功能获取原始文本,为另一组功能获取 pos 标签集。有两个 MeasureFeature 类会有帮助吗?一种用于原始文本功能,一种用于 pos 标签功能? 我在这里看不到您的工作流程。看看我向你提议的那个,链接和这个例子(scikit-learn.org/stable/auto_examples/hetero_feature_union.html)。之后,您只需要考虑您的工作流程,您的数据会发生什么。以上是关于如何将不同的输入拟合到 sklearn 管道中?的主要内容,如果未能解决你的问题,请参考以下文章