sklearn 和导入 CSV 的不可散列类型错误

Posted 2023-03-11

技术标签:

【中文标题】sklearn 和导入 CSV 的不可散列类型错误【英文标题】：Unhashable type error with sklearn and importing a CSV 【发布时间】：2014-02-25 17:21:14 【问题描述】：

我正在尝试执行以下代码，但我不明白我做错了什么。代码的目的是使用 Python 和 sklearn 的 train_test_split 函数将数据划分为训练和测试块。

数据 (downloadable here) 是各种房屋/公寓的租金成本数据，以及每个房屋/公寓的属性。最终，我尝试使用预测模型来预测租金价格（因此租金价格是目标）。代码如下：

import pandas as pd
rentdata = pd.read_csv('6000_clean.csv')

import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split

#trying to make a all rows of the first column and b all rows of columns 2-46, i.e., a will be only target data (rent prices) and b will be the data.

a, b = rentdata[ : ,0], rentdata[ : ,1:46]

什么结果是以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-24-789fb8e8c2f6> in <module>()
      8 from sklearn.cross_validation import train_test_split
      9 
---> 10 a, b = rentdata[ : ,0], rentdata[ : ,1:46]
     11 

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item)
   1653     def get(self, item):
   1654         if self.items.is_unique:
-> 1655             _, block = self._find_block(item)
   1656             return block.get(item)
   1657         else:

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\internals.pyc in _find_block(self, item)
   1933 
   1934     def _find_block(self, item):
-> 1935         self._check_have(item)
   1936         for i, block in enumerate(self.blocks):
   1937             if item in block:

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\internals.pyc in _check_have(self, item)
   1939 
   1940     def _check_have(self, item):
-> 1941         if item not in self.items:
   1942             raise KeyError('no item named %s' % com.pprint_thing(item))
   1943 

C:\Users\Nick\Anaconda\lib\site-packages\pandas\core\index.pyc in __contains__(self, key)
    317 
    318     def __contains__(self, key):
--> 319         hash(key)
    320         # work around some kind of odd cython bug
    321         try:

TypeError: unhashable type

您可以在此处下载 CSV 以查看数据：http://wikisend.com/download/776790/6000_clean.csv

【问题讨论】：

您的切片语法 ([ : ,0]) 将带有 slice() 对象的元组发送到 __getitem__ ((slice(None, None, None), 0))。那里存在唯一性约束，slice() 对象在hash() 测试中失败（它不是可散列对象）。我对 Pandas 不熟悉，所以不知道 Pandas 是否支持你的 slice 语法，但这是出现异常的技术原因。谢谢，知道如何修复代码吗？不，抱歉。不知道您要做什么，也不知道如何使用 Pandas 将其转换为有效代码。您需要阅读如何索引熊猫数据框：pandas.pydata.org/pandas-docs/stable/indexing.html 我认为在您的情况下，语法应该是：a, b = rentdata.iloc[0], rentdata.iloc[1:46] 【参考方案1】：

我下载了您的数据并将您的问题行修改为：

a, b = rentdata.iloc[0], rentdata.iloc[1:46]

iloc 按位置选择行，请参阅文档：http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

现在选择第一行和第 2-46 行（请记住，切片是开闭的，包括范围的开头但不包括范围的结尾）

请注意，您始终可以使用 head 选择第一行：

a, b = rentdata.head(0), rentdata.iloc[1:46]

也可以

In [5]:

a

Out[5]:

Monthly $ rent                                                    1150
Location                                                       alameda
# of bedrooms                                                        1
# of bathrooms                                                       1
# of square feet                                                   NaN
Latitude                                                      37.77054
Longitude                                                    -122.2509
Street address                                  1500-1598 Lincoln Lane
# more rows so trimmed for brevity here
.......

In [9]: b

Out[9]:
# too large to paste here
.....
45 rows × 46 columns

【讨论】：

以上是关于sklearn 和导入 CSV 的不可散列类型错误的主要内容，如果未能解决你的问题，请参考以下文章