使用字符串和浮点数字典的 Pandas DataFrame 分配错误？

Posted 2023-03-11

技术标签:

【中文标题】使用字符串和浮点数字典的 Pandas DataFrame 分配错误？【英文标题】：Pandas DataFrame Assignment Bug using Dictionaries of Strings and Floats? 【发布时间】：2021-08-10 01:03:36 【问题描述】：

问题

Pandas 似乎支持使用 df.loc 将字典分配给行条目，如下所示：

df = pd.DataFrame(columns = ['a','b','c'])
entry = 'a':'test', 'b':1, 'c':float(2)
df.loc[0] = entry

正如预期的那样，Pandas 根据字典键将字典值插入到相应的列中。打印出来：

      a  b    c
0  test  1  2.0

但是，如果您覆盖相同的条目，Pandas 将分配字典 keys 而不是字典值。打印出来：

   a  b  c
0  a  b  c

问题

为什么会这样？

具体来说，为什么这只发生在第二次分配？所有后续分配都恢复为原始结果，包含（几乎）预期值：

      a  b  c
0  test  1  2

我说几乎是因为c 上的dtype 实际上是object 而不是float 对于所有后续结果。

我已经确定，只要涉及字符串和浮点数，就会发生这种情况。如果它只是一个字符串和整数，或者整数和浮点数，你不会发现这种行为。

示例代码

df = pd.DataFrame(columns = ['a','b','c'])
print(f'empty df:\ndf\n\n')

entry = 'a':'test', 'b':1, 'c':float(2.3)
print(f'dictionary to be entered:\nentry\n\n')

df.loc[0] = entry
print(f'df after entry:\ndf\n\n')

df.loc[0] = entry
print(f'df after second entry:\ndf\n\n')

df.loc[0] = entry
print(f'df after third entry:\ndf\n\n')

df.loc[0] = entry
print(f'df after fourth entry:\ndf\n\n')

这给出了以下打印输出：

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
'a': 'test', 'b': 1, 'c': float(2)


df after entry:
      a  b    c
0  test  1  2.0


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
      a  b  c
0  test  1  2


df after fourth entry:
      a  b  c
0  test  1  2

【问题讨论】：

有趣的发现。在 pandas 版本 1.2.4 上，all 后续数据帧的值为 a b c，而不仅仅是第二个。 @aneroid 即使你换成了pd.Series() ? @rudolfovic 将其包装成一个系列可以解决问题。但我不关心解决方法。预期的行为不是观察到的行为。 df.loc[0] = entry.values() 也有效，但这又是一种解决方法。它似乎只在分配给新行时才能正常工作。文档没有说任何地方可以通过字典。我认为这应该成为github.com/pandas-dev/pandas的问题 【参考方案1】：

1.2.4行为如下：

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
'a': 'test', 'b': 1, 'c': 2.3


df after entry:
      a  b    c
0  test  1  2.3


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
   a  b  c
0  a  b  c


df after fourth entry:
   a  b  c
0  a  b  c

df.loc[0]函数第一次运行是_setitem_with_indexer_missing函数，因为轴上没有索引0：

运行此行：

elif isinstance(value, dict):
    value = Series(
        value, index=self.obj.columns, name=indexer, dtype=object
    )

这会将dict 变成一个系列并且它的行为符合预期。

然而，在未来的时间里，由于索引没有丢失（存在索引0）_setitem_with_indexer_split_path 运行：

elif len(ilocs) == len(value):
    # We are setting multiple columns in a single row.
    for loc, v in zip(ilocs, value):
        self._setitem_single_column(loc, v, pi)

这只是使用dict 中的每个值压缩列位置：

在这种情况下，这大致相当于：

entry = 'a': 'test', 'b': 1, 'c': float(2.3)
print(list(zip([0, 1, 2], entry)))
# [(0, 'a'), (1, 'b'), (2, 'c')]

因此为什么值是键。

因此，问题并不像看起来那么具体：

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
print(f'df:\ndf\n\n')

entry = 'a': 'test', 'b': 1, 'c': float(2.3)
print(f'dictionary to be entered:\nentry\n\n')

df.loc[0] = entry
print(f'df after entry:\ndf\n\n')

initial df:
   a  b  c
0  1  2  3

dictionary to be entered:
'a': 'test', 'b': 1, 'c': 2.3

df after entry:
   a  b  c
0  a  b  c

如果索引 loc 存在，它将不会转换为系列：它只是使用可迭代的列 locs 压缩。在字典的情况下，这意味着键是包含在框架中的值。

这也可能是为什么只有迭代器返回其值的可迭代对象是loc赋值可接受的左手参数。

我也同意 @DeepSpace 的观点，认为这应该作为一个错误提出。

1.1.5 行为如下：

然而，初始分配与 1.2.4 相比没有变化：

数据类型在这里值得注意：

import pandas as pd

df = pd.DataFrame(0: [1, 2, 3], columns=['a', 'b', 'c'])

entry = 'a': 'test', 'b': 1, 'c': float(2.3)

# First Entry
df.loc[0] = entry
print(df.dtypes)
# a     object
# b     object
# c    float64
# dtype: object

# Second Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Third Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Fourth Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

他们引人注目的原因是因为当

take_split_path = self.obj._is_mixed_type

是真的。它执行与 1.2.4 中相同的 zip 操作。

然而，在 1.1.5 中，dtypes 都是object，所以take_split_path 仅在第一次赋值后为假，因为c 是float64。后续分配使用：

if isinstance(value, (ABCSeries, dict)):
    # TODO(EA): ExtensionBlock.setitem this causes issues with
    # setting for extensionarrays that store dicts. Need to decide
    # if it's worth supporting that.
    value = self._align_series(indexer, Series(value))

这自然会正确对齐dict。

【讨论】：

【参考方案2】：

有趣的发现。在 pandas 版本 1.2.4 上，所有后续数据帧的值为 a b c，而不仅仅是第二个。

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []

dictionary to be entered:
'a': 'test', 'b': 1, 'c': 2.3

df after entry:
      a  b    c
0  test  1  2.3

df after second entry:
   a  b  c
0  a  b  c

df after third entry:
   a  b  c
0  a  b  c

顺便说一句，它似乎只在分配给 new 行时才能正常工作。所以它只是在这种情况下将键与列相关联。对于所有后续对现有行的重新分配，它在1.2.4 中具有观察到的意外行为。

df.loc[1] = entry
print(f'df after assigning to a new row:\ndf\n\n')
# output:
df after assigning to a new row:
      a  b    c
0     a  b    c
1  test  1  2.3

df.loc[1] = entry
print(f'df after reapting:\ndf\n')
# output:
df after reapting:
   a  b  c
0  a  b  c
1  a  b  c

现有行可能会发生这种情况（除了是一个错误）的原因是它迭代了集合。就字典而言，它是键。 在文档部分“Setting with enlargement”

.loc/[] 操作可以在为该轴设置不存在的键时执行放大。

在Series 的情况下，这实际上是一个附加操作。

因此，对于新行，它“扩大”输入，但对于现有行，它遍历输入（字典的键，而不是值）。

对于一个列表，它可以正常工作。

df.loc[2] = list(entry.values())
print(f'df when assigning from a list\ndf\n')
# output
df when assigning from a list
      a  b    c
0     a  b    c
1     a  b    c
2  test  1  2.3


df.loc[2] = list(entry.values())
print(f'df when assigning from a list 2nd time\ndf\n')
# output
df when assigning from a list 2nd time
      a  b    c
0     a  b    c
1     a  b    c
2  test  1  2.3

（这就是 why 基于文档的原因。我认为实际的技术原因可能只有在仔细阅读源代码后才能显现出来。）

恕我直言，它应该适用于所有分配/重新分配，或者根本不允许。我同意这应该作为一个错误提出，如@DeepSpace mentions。

【讨论】：

以上是关于使用字符串和浮点数字典的 Pandas DataFrame 分配错误？的主要内容，如果未能解决你的问题，请参考以下文章