使用 scikit-learn 进行文本特征提取

Posted 2023-03-12

技术标签:

【中文标题】使用 scikit-learn 进行文本特征提取【英文标题】：Text Feature Extraction using scikit-learn 【发布时间】：2013-11-22 09:04:28 【问题描述】：

我正在使用 Scikt-Learn 包从语料库中提取特征。我的代码如下：

#! /usr/bin/python -tt

from __future__ import division
import re
import random
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.cluster.util import cosine_distance
from operator import itemgetter

def preprocess(fnin, fnout):
  fin = open(fnin, 'rb')
  fout = open(fnout, 'wb')
  buf = []
  id = ""
  category = ""
  for line in fin:
    line = line.strip()

    if line.find("-- Document Separator --") > -1:
      if len(buf) > 0:
        # write out body,
        body = re.sub("\s+", " ", " ".join(buf))
        fout.write("%s\t%s\t%s\n" % (id, category, body))
      # process next header and init buf
      id, category, rest = map(lambda x: x.strip(), line.split(": "))
      buf = []
    else:
      # process body
      buf.append(line)
  fin.close()
  fout.close()

def train(fnin):
  docs = []
  cats = []
  fin = open(fnin, 'rb')
  for line in fin:
    id, category, body = line.strip().split("\t")
    docs.append(body)
    cats.append(category)
  fin.close()
  v=CountVectorizer(min_df=1,stop_words="english")
  pipeline = Pipeline([
    ("vect", v),
    ("tfidf", TfidfTransformer(use_idf=False))])
  tdMatrix = pipeline.fit_transform(docs, cats)
  return tdMatrix, cats


def main():
  preprocess("corpus.txt", "sccpp.txt")
  tdMatrix, cats = train("sccpp.txt")

if __name__ == "__main__":
  main()

我的语料库是（简短的形式）：语料库.txt

0: sugar: -- Document Separator -- reut2-021.sgm
British Sugar Plc was forced to shut its
Ipswich sugar factory on Sunday afternoon due to an acute
shortage of beet supplies, a spokesman said, responding to a
Reuter inquiry
    Beet supplies have dried up at Ipswich due to a combination
of very wet weather, which has prevented most farmers in the
factory's catchment area from harvesting, and last week's
hurricane which blocked roads.
    The Ipswich factory will remain closed until roads are
cleared and supplies of beet build up again.
    This is the first time in many years that a factory has
been closed in mid-campaign, the spokesman added.
    Other factories are continuing to process beet normally,
but harvesting remains very difficult in most areas.
    Ipswich is one of 13 sugar factories operated by British
Sugar. It processes in excess of 500,000 tonnes of beet a year
out of an annual beet crop of around eight mln tonnes.
    Despite the closure of Ipswich and the severe harvesting
problems in other factory areas, British Sugar is maintaining
its estimate of sugar production this campaign at around

错误信息是：

v=CountVectorizer(min_df=1,stop_words="english")
TypeError: __init__() got an unexpected keyword argument 'min_df'

我在 Linux Mint 中使用 python2.7.4。谁能建议我如何解决这个问题？提前谢谢你。

【问题讨论】：

我的直觉是问题出在您的sklearn 版本上。 sklearn 0.13.1 没有出现此错误。 @Akavall：非常感谢您的回复。我将整个软件包安装为：sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev libatlas3-base。然后我尝试了这段代码。那我现在该怎么办？您的sklearn 版本是什么？你可以通过import sklearn、sklearn.__version__ 来找到它。如果您需要更新您的sklearn，您可以在此处了解如何操作scikit-learn.org/stable/install.html# @Akavall：我使用的是 0.11，当我发出命令 sudo apt-get install python-sklearn 时，它说，python-sklearn 已经安装了新版本。 :( 【参考方案1】：

您需要更新的 scikit-learn 版本。摆脱 Mint 的那个：

sudo apt-get uninstall python-sklearn

安装构建新版本所需的软件包：

sudo apt-get install python-numpy-dev python-scipy-dev python-pip

然后获取最新版本并使用 pip 构建它：

sudo pip install scikit-learn

【讨论】：

如果 OP 已经有pip，他不能只做sudo pip install -U scikit-learn吗？ @Akavall：我不认为 pip 会卸载使用apt-get 安装的软件包，因此他们最终可能会安装两个版本，并且会导致所有混乱。

以上是关于使用 scikit-learn 进行文本特征提取的主要内容，如果未能解决你的问题，请参考以下文章