如何使用热门单词创建特征向量（scikit-learn中的特征选择）

Question

我使用scikit-learn来创建文档的特征向量。我的目标是使用这些特征向量创建二分类器（Genderclassifier）。

我希望将k-top字作为特征，因此来自两个labeldocuments的k个最高计数字（k = 10 - > 20个特征，因为2个标签）

我的两个文档（label1document，label2document）都填充了这样的实例：

user:somename, post:"A written text which i use"

到目前为止，我的理解是，我使用来自两个文档的所有实例的所有文本来创建带有计数的词汇表（两个标签的计数，以便我可以比较标签数据）：

#These are my documents with all text
label1document = "car eat essen sleep sleep"
label2document = "eat sleep woman woman woman woman"

vectorizer = CountVectorizer(min_df=1)

corpus = [label1document,label2document]

#Here I create a Matrix with all the countings of the words from both documents  
X = vectorizer.fit_transform(corpus)

问题1：我需要在fit_transform中添加什么才能从两个标签中获得最多的单词？

X_new = SelectKBest(chi2, k=2).fit_transform( ?? )

最后，我想要训练数据（实例），如下所示：

<label>  <feature1 : value> ... <featureN: value>

问题2：我如何从那里开始获取此培训数据？

奥利弗