TFIDF 的 Python ScikitLearn GridSearchCV 问题 - JobLibValueError?
Posted
技术标签:
【中文标题】TFIDF 的 Python ScikitLearn GridSearchCV 问题 - JobLibValueError?【英文标题】:Python ScikitLearn GridSearchCV issues with TFIDF - JobLibValueError? 【发布时间】:2016-02-08 21:38:02 【问题描述】:所以我有一个词库,我正在运行 TFIDF,然后尝试使用 Logistic 回归和 GridSearch 进行分类。
但是当我运行 GridSearch 时,我遇到了一个巨大的错误。错误是这样的(它更长,但我只是复制并粘贴了一点):
An unexpected error occurred while tokenizing input file /Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/base.pyc
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (2, 0))
An unexpected error occurred while tokenizing input file /Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/base.pyc
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (2, 0))
---------------------------------------------------------------------------
JoblibValueError Traceback (most recent call last)
<ipython-input-43-7c8b397eb30b> in <module>()
----> 1 gs_lr_tfidf.fit(X_train, y_train)
/Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
805
806
/Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
551 self.fit_params, return_parameters=True,
552 error_score=self.error_score)
--> 553 for parameters in parameter_iterable
554 for train, test in cv)
555
/Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
810 # consumption.
811 self._iterating = False
--> 812 self.retrieve()
813 # Make sure that we get a last message telling us we are done
814 elapsed_time = time.time() - self._start_time
/Users/yongcho822/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
760 # a working pool as they expect.
761 self._initialize_pool()
--> 762 raise exception
763
764 def __call__(self, iterable):
JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/Users/yongcho822/anaconda/lib/python2.7/runpy.py in _run_module_as_main(mod_name='IPython.kernel.__main__', alter_argv=1)
157 pkg_name = mod_name.rpartition('.')[0]
158 main_globals = sys.modules["__main__"].__dict__
159 if alter_argv:
160 sys.argv[0] = fname
161 return _run_code(code, main_globals, None,
--> 162 "__main__", fname, loader, pkg_name)
fname = '/Users/yongcho822/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py'
loader = <pkgutil.ImpLoader instance>
pkg_name = 'IPython.kernel'
163
164 def run_module(mod_name, init_globals=None,
165 run_name=None, alter_sys=False):
166 """Execute a module's code without importing it
...........................................................................
/Users/yongcho822/anaconda/lib/python2.7/runpy.py in _run_code(code=<code object <module> at 0x1033028b0, file "/Use...ite-packages/IPython/kernel/__main__.py", line 1>, run_globals='__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/yongcho822/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'IPython.kernel', 'app': <module 'IPython.kernel.zmq.kernelapp' from '/Us.../site-packages/IPython/kernel/zmq/kernelapp.pyc'>, init_globals=None, mod_name='__main__', mod_fname='/Users/yongcho822/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py', mod_loader=<pkgutil.ImpLoader instance>, pkg_name='IPython.kernel')
67 run_globals.update(init_globals)
68 run_globals.update(__name__ = mod_name,
69 __file__ = mod_fname,
70 __loader__ = mod_loader,
71 __package__ = pkg_name)
---> 72 exec code in run_globals
code = <code object <module> at 0x1033028b0, file "/Use...ite-packages/IPython/kernel/__main__.py", line 1>
run_globals = '__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/yongcho822/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'IPython.kernel', 'app': <module 'IPython.kernel.zmq.kernelapp' from '/Us.../site-packages/IPython/kernel/zmq/kernelapp.pyc'>
73 return run_globals
74
75 def _run_module_code(code, init_globals=None,
76 mod_name=None, mod_fname=None,
...........................................................................
/Users/yongcho822/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py in <module>()
1
2
----> 3
4 if __name__ == '__main__':
5 from IPython.kernel.zmq import kernelapp as app
6 app.launch_new_instance()
7
8
9
10
我做错了什么?这就是我正在做的事情:
X_train, X_test, y_train, y_test = train_test_split(train_X_tfidf_DF.values, train_Y, test_size=0.25, random_state=1)
X_train.shape, type(X_train), y_train.shape, type(y_train)
>>>((29830, 6648), numpy.ndarray, (29830,), numpy.ndarray)
X_train[:2]
>>>array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
y_train[:2]
>>>array([11, 16])
param_grid = ['clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]]
gs_lr_tfidf = GridSearchCV(estimator = LogisticRegression(),
param_grid = param_grid,
scoring = 'accuracy',
cv = 5, verbose = 1, n_jobs = -1)
gs_lr_tfidf.fit(X_train, y_train)
(this is where the error pops up)
【问题讨论】:
看起来多处理模块有问题。你试过设置n_jobs=1
@SebastianRaschka 这对我有用,我的意思是,它解决了令人费解的An unexpected error occurred while tokenizing input file .../sklearn/base.pyc
错误并揭示了实际错误。就我而言,实际问题是参数键不正确。
我也是。不知何故,那里有一个负面的偷偷摸摸:n_jobs=-1
n_jobs=-1 非常好,它会使用所有可用的 CPU
【参考方案1】:
我偶然发现了类似的问题。首先将 n_jobs 设置为 1,然后运行代码,结果您将收到真正的错误消息,修复错误并返回 n_jobs = -1
【讨论】:
【参考方案2】:我的问题是在 param_grid 上,我设置了一个无效值,检查你的值,例如我的简单解决问题是值 1
'max_leaf_nodes':[1]
【讨论】:
以上是关于TFIDF 的 Python ScikitLearn GridSearchCV 问题 - JobLibValueError?的主要内容,如果未能解决你的问题,请参考以下文章
使用 Python 的 Apache Spark TFIDF