使用 scikit learn DictVectorizer 对特定列进行矢量化时出现问题?

Posted

技术标签:

【中文标题】使用 scikit learn DictVectorizer 对特定列进行矢量化时出现问题?【英文标题】:Problems vectorizing specific columns with scikit learn DictVectorizer? 【发布时间】:2015-07-10 14:04:36 【问题描述】:

我想了解如何做一个简单的预测任务,我正在玩这个dataset,也是不同格式的here。这与学生在某些课程中的表现有关,我想对数据集的某些列进行矢量化处理,以便不使用所有数据(只是为了了解它的工作原理)。所以我尝试了以下方法,DictVectorizer:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')

dict_vect = DictVectorizer(sparse=False)

training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()

然后我想像这样传递另一个特征行:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])

问题在于我得到以下回溯:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
  File "school_2.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
  File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')

Process finished with exit code 1

知道如何正确向量化两个数据(即训练和测试)吗?并用.toarray()显示两个矩阵

更新

>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id         396 non-null object
content    396 non-null object
label      396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None

Process finished with exit code 0

【问题讨论】:

你的训练数据只有 3 列,因为它加载了一些列作为索引,G1 和 G2 甚至不在索引中,我会尝试自己加载这个 我可以正确加载数据,但您似乎误解了如何使用 dict vectoriser,它需要一个 dict 而不是数组:scikit-learn.org/0.11/modules/generated/…。 我明白了.. 有没有其他方法可以矢量化 .csv 文件(“数据库”)以将其呈现给估算器? 有一个相关的帖子:***.com/questions/20024584/…试试这个:training_matrix = dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']].T.to_dict().values())它对我有用 【参考方案1】:

你需要传递一个列表:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])

您所做的是尝试使用键索引您的 df:

['G1','G2','sex','school','age']

这就是为什么你会得到一个KeyError,因为没有像上面这样命名的单列,要选择多个列,你需要传递一个列名列表和双下标[[col_list]]

例子:

In [43]:

df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:

df['a','b']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']

......    
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()

KeyError: ('a', 'b')

但这有效:

In [45]:

df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []

【讨论】:

我尝试了以下方法:training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv', names=['id', 'content', 'label'])# testing_data = pd.read_csv('/Users/user/Desktop/student-mat_test.csv') dict_vect = DictVectorizer(sparse=False) training_matrix =dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']]) print training_matrix.toarray() 仍然遇到同样的错误,不知道如何继续 我只是在下载数据,并会尝试重现您的错误。您可以将training_data.info() 的输出编辑到您的问题中吗

以上是关于使用 scikit learn DictVectorizer 对特定列进行矢量化时出现问题?的主要内容,如果未能解决你的问题,请参考以下文章

使用 Scikit-learn 计算信息增益

Sklearn 速查

Scikit-learn使用总结

使用 scikit-learn 去除低方差的特征

如何使用 scikit-learn 创建我自己的数据集?

scikit-learn 中的 DBSCAN(仅使用指标)