pandas 列选择因合并 Excel 列中的元组列名而失败
Posted
技术标签:
【中文标题】pandas 列选择因合并 Excel 列中的元组列名而失败【英文标题】:pandas column selection fails with tuple column names from merged Excel columns 【发布时间】:2021-05-26 06:34:58 【问题描述】:我有一个数据框,其中列名是元组。原因是数据框基于复杂的 Excel 电子表格,其中列标题由不同大小的合并单元格组成,因此在 MultiIndex 中,几乎每一列都有一个或多个包含 nan 条目的级别,这使得选择和写入数据变得困难。我过去多次使用相同的解决方案而没有问题,例如,MultiIndex 标头('SCN', nan, nan, nan, nan)
将转换为元组('SCN',)
,然后我们将MultiIndex 更改为元组的普通索引。但是,由于奇怪的 KeyErrors 和 TypeErrors,这一次我似乎无法对数据帧或对数据帧做任何事情:
test.columns
Out[25]:
Index([ ('SCN',),
('Site',),
('Subject Status',),
('Enrollment Date',),
('Enrollment Type',),
('Specimen Type',),
('Inclusion Criteria', 'Consented', 'Symptoms'),
('Inclusion Criteria', 'Consented', 'Consent'),
('Inclusion Criteria', 'Consented', 'Volume'),
('Inclusion Criteria', 'Residual', 'Sample'),
...
('PI Review', 'All Forms Complete'),
('PI Review', 'PI Signature'),
('Sunday of Enroll Week',),
('Sunday of Last Week',)],
dtype='object', length=296)
尝试查询:
test[('SCN',)]
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 169, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
KeyError: ('SCN',)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-5cd8b06f24ce>", line 1, in <module>
test[('SCN',)]
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 169, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
KeyError: ('SCN',)
更明确:
test[test.columns[0]]
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 169, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
KeyError: ('SCN',)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-d47929e28842>", line 1, in <module>
test[test.columns[0]]
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 169, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
KeyError: ('SCN',)
奇怪的是,当传递一个元组列表时它可以工作,但返回一个单列数据帧而不是预期的系列。
其他列给出不同的错误:
test[test.columns[45]]
C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py:2897: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return self._engine.get_loc(key)
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-31714132fb16>", line 1, in <module>
test[test.columns[45]]
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\user\Anaconda3\envs\project_env\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 160, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
TypeError: Cannot convert bool to numpy.ndarray
有什么想法吗? 谢谢!
【问题讨论】:
你能用这个制作一个更小的可执行示例吗?当然,这不是复制错误所需的最少代码量。另外,你试过test[('SCN',),:]
吗?
请发布可重现的代码
【参考方案1】:
我有一个数据框,其中列名是元组。
不要。 已经将列名转换为简单的字符串。类似于:
df.columns = [cols[0] for cols in df.columns]
我不知道 pandas 支持元组作为列名。让您的生活更轻松。
test[('SCN',)]
给出KeyError: ('SCN',)
。显然 pandas 不喜欢元组作为列名。所以不要这样做。
好的,我现在看到一些列名元组的长度 >1,例如('Inclusion Criteria', 'Consented', 'Symptoms')
和 “数据框基于复杂的 Excel 电子表格,其中列标题由不同大小的合并单元格组成”。仅仅因为 Excel 可以导出某些内容(合并单元格),熊猫并不支持它。凭经验找出 pandas 可以支持的模式。最坏的情况是,如果您需要手动将列名添加到 'Inclusion_Criteria_Consented_Symptoms'
,那么就这样做吧。
【讨论】:
Pandas 确实支持元组作为列名。我过去曾多次这样做。对于带有大型 excel 电子表格的特定应用程序,它一直运行良好。我试图弄清楚为什么这次它会中断。如果我不能,也许我会使用下划线或某种带有字符串的结构。谢谢 Ern:请发布可重现的代码,我会尽力给出更好的答案以上是关于pandas 列选择因合并 Excel 列中的元组列名而失败的主要内容,如果未能解决你的问题,请参考以下文章
遍历 pandas 数据框中的行并匹配列表中的元组并创建一个新的 df 列