Pandas 合并错误 TypeError:“int”和“str”实例之间不支持“>”
Posted
技术标签:
【中文标题】Pandas 合并错误 TypeError:“int”和“str”实例之间不支持“>”【英文标题】:Pandas merge error TypeError: '>' not supported between instances of 'int' and 'str' 【发布时间】:2018-06-06 18:36:14 【问题描述】:我有一个包含多个表格的数据集,每个表格都以国家、年份和一些指标的形式出现。我已将所有 excel 表转换为 csv 文件,然后将它们合并到一个表中。
问题是我有一些表拒绝合并,出现如下信息TypeError: '>' not supported between instances of 'int' and 'str'
我尽我所能,但没有运气,仍然出现同样的错误!
此外,我尝试了数百个不同的文件,但仍有数十个文件面临这个问题。
对于示例文件file17.csv 和file35.csv(以防有人需要重复)。 这是我使用的代码:
# To load the first file
import pandas as pd
filename1 = 'file17.csv'
df1 = pd.read_csv(filename1, encoding='cp1252', low_memory=False)
df1.set_index(['Country', 'Year'], inplace=True)
df1.dropna(axis=0, how='all', inplace=True)
df1.head()
出来>>>
+-------------+------+--------+--------+
| | | ind500 | ind356 |
| Country | Year | | |
| Afghanistan | 1800 | 603.0 | NaN |
| | 1801 | 603.0 | NaN |
| | 1802 | 603.0 | NaN |
| | 1803 | 603.0 | NaN |
| | 1804 | 603.0 | NaN |
+-------------+------+--------+--------+
在>>>
# To load the second file
filename2 = 'file35.csv'
df2 = pd.read_csv(filename2, encoding='cp1252', low_memory=False)
df2.set_index(['Country', 'Year'], inplace=True)
df2.dropna(axis=0, how='all', inplace=True)
df2.head()
出来>>>
# To merge the two dataframes
gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
gross_df.dropna(axis=0, how='all', inplace=True)
print (gross_df.shape)
gross_df.to_csv('merged.csv')
重要提示:
我注意到在所有成功的文件中,列名以升序出现,即ind001
、ind009
、ind012
,因为它们是自动排序的。而有错误的文件有一个或多个位置错误的列,例如第一个表中的ind500
,后跟in356
,同样适用于提供的第二个示例。
请注意,两个 dataframesindiceswo 索引(Country
和 year
)
错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
480 try:
--> 481 sorter = values.argsort()
482 ordered = values.take(sorter)
TypeError: '>' not supported between instances of 'int' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-11-960b2698de60> in <module>()
----> 1 gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False)
2 gross_df.dropna(axis=0, how='all', inplace=True)
3 print (gross_df.shape)
4 gross_df.to_csv('merged.csv')
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
52 right_index=right_index, sort=sort, suffixes=suffixes,
53 copy=copy, indicator=indicator)
---> 54 return op.get_result()
55
56
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
567 self.left, self.right)
568
--> 569 join_index, left_indexer, right_indexer = self._get_join_info()
570
571 ldata, rdata = self.left._data, self.right._data
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
720 join_index, left_indexer, right_indexer = \
721 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722 sort=self.sort)
723 elif self.right_index and self.how == 'left':
724 join_index, left_indexer, right_indexer = \
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
2995 else:
2996 return self._join_non_unique(other, how=how,
-> 2997 return_indexers=return_indexers)
2998 elif self.is_monotonic and other.is_monotonic:
2999 try:
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
3076 left_idx, right_idx = _get_join_indexers([self.values],
3077 [other._values], how=how,
-> 3078 sort=True)
3079
3080 left_idx = _ensure_platform_int(left_idx)
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
980
981 # get left & right join labels and num. of levels at each location
--> 982 llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
983
984 # get flat i8 keys from label lists
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
1409 if sort:
1410 uniques = rizer.uniques.to_array()
-> 1411 llab, rlab = _sort_labels(uniques, llab, rlab)
1412
1413 # NA group
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
1435 labels = np.concatenate([left, right])
1436
-> 1437 _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
1438 new_labels = _ensure_int64(new_labels)
1439 new_left, new_right = new_labels[:l], new_labels[l:]
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
483 except TypeError:
484 # try this anyway
--> 485 ordered = sort_mixed(values)
486
487 # labels:
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
469 str_pos = np.array([isinstance(x, string_types) for x in values],
470 dtype=bool)
--> 471 nums = np.sort(values[~str_pos])
472 strs = np.sort(values[str_pos])
473 return _ensure_object(np.concatenate([nums, strs]))
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
820 else:
821 a = asanyarray(a).copy(order="K")
--> 822 a.sort(axis=axis, kind=kind, order=order)
823 return a
824
TypeError: '>' not supported between instances of 'int' and 'str'
【问题讨论】:
请不要截图错误——复制文本并粘贴为代码块! 好的,完成...抱歉 确保合并的 DF 中的索引具有相同的 dtypes @MaxU 非常感谢,我真的发现'Year'
列在第一个数据帧中存储为int64
,在第二个数据帧中存储为'object'
。我会将int64
更改为string
然后会回复您。
@MohammadElNesr,您能否将df1.head()
的输出发布为文本,以便我们复制和粘贴它?
【参考方案1】:
这个错误表明合并的 DF 中的索引具有不同的 dtypes
演示 - 如何将字符串索引级别转换为int
:
In [183]: df
Out[183]:
0 1 2 3
bar 1 -0.205037 0.762509 0.816608 -1.057907
2 1.249104 0.338777 -0.982084 0.329330
baz 1 0.845695 -0.996365 0.548100 -0.113733
2 1.247092 -2.674061 -0.071993 -0.734242
foo 1 -1.233825 -0.195377 -0.240303 1.168055
2 -0.108942 -0.615612 -1.299512 0.908641
qux 1 0.844421 0.251425 -0.506877 1.307800
2 0.038580 0.045072 -0.262974 0.629804
In [184]: df.index
Out[184]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['1', '2']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
In [185]: df.index.get_level_values(1)
Out[185]: Index(['1', '2', '1', '2', '1', '2', '1', '2'], dtype='object')
In [187]: df.index = df.index.set_levels(df.index.get_level_values(1) \
.map(lambda x: pd.to_numeric(x, errors='coerce')), level=1)
结果:
In [189]: df.index.get_level_values(1)
Out[189]: Int64Index([1, 2, 1, 2, 1, 2, 1, 2], dtype='int64')
更新:试试这个:
In [247]: d1 = pd.read_csv('https://docs.google.com/uc?id=1jUsbr5pw6sUMvewI4fmbps-s-roG4RZ7LE&export=download', index_col=[0,1])
In [248]: d2 = pd.read_csv('https://docs.google.com/uc?id=1Ufx6pvnSC6zQdTAj05ObmV027fA4-Mr3&export=download', index_col=[0,1])
In [249]: d2 = d2[pd.to_numeric(d2.index.get_level_values(1), errors='coerce').notna()]
In [250]: d2.index = d2.index.set_levels(d2.index.get_level_values(1).map(lambda x: pd.to_numeric(x, errors='coerce')), level=1)
In [251]: d1.reset_index().merge(d2.reset_index(), on=['Country','Year'], how='outer').set_index(['Country','Year'])
Out[251]:
ind500 ind356 ind475 ind476 ind456
Country Year
Afghanistan 1800 603.0 NaN NaN NaN NaN
1801 603.0 NaN NaN NaN NaN
1802 603.0 NaN NaN NaN NaN
1803 603.0 NaN NaN NaN NaN
1804 603.0 NaN NaN NaN NaN
1805 603.0 NaN NaN NaN NaN
1806 603.0 NaN NaN NaN NaN
1807 603.0 NaN NaN NaN NaN
1808 603.0 NaN NaN NaN NaN
1809 603.0 NaN NaN NaN NaN
... ... ... ... ... ...
Bahamas, The 1967 NaN NaN NaN NaN 18381.131314
Gambia, The 1967 NaN NaN NaN NaN 937.355288
Korea, Dem. Rep. 1967 NaN NaN NaN NaN 1428.689253
Lao PDR 1967 NaN NaN NaN NaN 1412.359955
Netherlands Antilles 1967 NaN NaN NaN NaN 14076.731352
Russian Federation 1967 NaN NaN NaN NaN 11794.726437
Serbia and Montenegro 1967 NaN NaN NaN NaN 2987.080489
Syrian Arab Republic 1967 NaN NaN NaN NaN 2015.913906
Yemen, Rep. 1967 NaN NaN NaN NaN 1075.693355
Bahamas, The 1968 NaN NaN NaN NaN 18712.082830
[46607 rows x 5 columns]
【讨论】:
我都做了,索引的数据类型已成功更改,但不幸的是,错误仍然存在!还有其他建议吗? 是的,不幸的是同样的错误!我已经在问题中发布了两个 csv 文件的链接。 问题中发布的 CSV 文件是否足够? @MohammadElNesr,我错过了您已经发布了指向这些 CSV 的链接。让我试试…… 奇怪,我重复python 2.7
下的问题,解决了,但是python 3
下问题依旧。 python 3
的pandas
模块有问题吗?【参考方案2】:
对于任何在 2021 年偶然发现此问题的人:
这里的问题是 pandas 多索引在数据集中不是唯一的。
您可以通过以下方式解决此问题:
选择唯一的多索引 或者,重置索引并对列进行合并例如。 pd.merge(d1.reset_index(), d2.reset_index(), on=['Country','Year'], how='outer')
【讨论】:
以上是关于Pandas 合并错误 TypeError:“int”和“str”实例之间不支持“>”的主要内容,如果未能解决你的问题,请参考以下文章
TypeError: unhashable type: 'numpy.ndarray' 合并来自 BigQuery 的 pandas 数据帧时
当列中的项目是列表时,列上的合并 Pandas DataFrame 的 TypeError
python pandas dataframe index,错误TypeError:输入必须是可迭代的,pandas版本可能错误
将 pandas 数据附加到 hdf 存储,得到 'TypeError: object of type 'int' has no len()' 错误