如何使用 scikit.learn 将字符串列表用作 svm 的训练数据?
Posted
技术标签:
【中文标题】如何使用 scikit.learn 将字符串列表用作 svm 的训练数据?【英文标题】:How to use list of strings as training data for svm using scikit.learn? 【发布时间】:2013-12-18 13:28:05 【问题描述】:我正在使用 scikit.learn 根据每个观察值 (X) 是单词列表的数据来训练 svm。每个观察值 (Y) 的标签都是浮点值。我尝试按照 scikit learn 文档 (http://scikit-learn.org/stable/modules/svm.html) 中给出的示例进行多类分类。 这是我的代码:
from __future__ import division
from sklearn import svm
import os.path
import numpy
import re
'''
The stanford-postagger was included to see how it tags the words and to see if it would help in getting just the names
of the ingredients. Turns out its pointless.
'''
#from nltk.tag.stanford import POSTagger
mainDirectory = './nyu/PROJECTS/Epicurious/DATA/ingredients'
#st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger','/usr/share/stanford-postagger/stanford-postagger.jar')
'''
This is where we would reach each line of the file and then run a regex match on it to get all the words before
the first tab. (these are the names of the ingredients. Some of them may have adjectives like fresh, peeled,cut etc.
Not sure what to do about them yet.)
'''
def getFileDetails(_filename,_fileDescriptor):
rankingRegexMatch = re.match('([0-9](?:\_)[0-9]?)', _filename)
if len(rankingRegexMatch.group(0)) == 2:
ranking = float(rankingRegexMatch.group(0)[0])
else:
ranking = float(rankingRegexMatch.group(0)[0]+'.'+rankingRegexMatch.group(0)[2])
_keywords = []
for line in _fileDescriptor:
m = re.match('(\w+\s*\w*)(?=\t[0-9])', line)
if m:
_keywords.append(m.group(0))
return [_keywords,ranking]
'''
Open each file in the directory and pass the name and file descriptor to getFileDetails
'''
def this_is_it(files):
_allKeywords = []
_allRankings = []
for eachFile in files:
fullFilePath = mainDirectory + '/' + eachFile
f = open(fullFilePath)
XandYForThisFile = getFileDetails(eachFile,f)
_allKeywords.append(XandYForThisFile[0])
_allRankings.append(XandYForThisFile[1])
#_allKeywords = numpy.array(_allKeywords,dtype=object)
svm_learning(_allKeywords,_allRankings)
def svm_learning(x,y):
clf = svm.SVC()
clf.fit(x,y)
'''
This just prints the directory path and then calls the callback x on files
'''
def print_files( x, dir_path , files ):
print dir_path
x(files)
'''
code starts here
'''
os.path.walk(mainDirectory, print_files, this_is_it)
当调用 svm_learning(x,y) 方法时,它会抛出一个错误:
Traceback (most recent call last):
File "scan for files.py", line 72, in <module>
os.path.walk(mainDirectory, print_files, this_is_it)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 238, in walk
func(arg, top, names)
File "scan for files.py", line 68, in print_files
x(files)
File "scan for files.py", line 56, in this_is_it
svm_learning(_allKeywords,_allRankings)
File "scan for files.py", line 62, in svm_learning
clf.fit(x,y)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/svm/base.py", line 135, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 116, in atleast2d_or_csr
"tocsr")
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 96, in _atleast2d_or_sparse
X = array2d(X, dtype=dtype, order=order, copy=copy)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 80, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/Library/Python/2.7/site-packages/numpy-1.8.0.dev_bbcfcf6_20130307-py2.7-macosx-10.8-intel.egg/numpy/core/numeric.py", line 331, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
有人可以帮忙吗?我是 scikit 的新手,在文档中找不到任何帮助。
【问题讨论】:
查看feature extraction 文档。 【参考方案1】:你应该看看:Text feature extraction。您将要使用 TfidfVectorizer、CountVectorizer 或 HashingVectorizer(如果您的数据非常大)。这些组件将您的文本输入并输出分类器可接受的特征矩阵。请注意,这些适用于字符串列表,每个示例一个字符串,因此如果您有一个字符串列表列表(您已经标记化),您可能需要 join() 标记以获得字符串列表或跳过标记化。
【讨论】:
以上是关于如何使用 scikit.learn 将字符串列表用作 svm 的训练数据?的主要内容,如果未能解决你的问题,请参考以下文章
创建一个不使用 SciKit Learn 将字符串编码为二进制的函数 - python 3
您如何使用 Scikit learn 预测分类变量和连续变量的组合?
如何将 scikit learn 的预测概率输出转换为 sigmoid