Pandas - ValueError:无法从重复的轴重新索引
Posted
技术标签:
【中文标题】Pandas - ValueError:无法从重复的轴重新索引【英文标题】:Pandas - ValueError: cannot reindex from a duplicate axis 【发布时间】:2020-05-23 22:46:15 【问题描述】:我正在研究 Airflow 中的数据管道,并且不断遇到这个 ValueError: cannot reindex from a duplicate axis
,我已经为此苦苦挣扎了好几天。
这是搞砸的功能:
def fill_missing_dates(df):
df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
masdiv = df['MASDIV'].unique()
station = df['STATION'].unique()
idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()
return df
这是来自 AWS Cloudwatch 日志的错误输出:
16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis
我已经运行了一些记录器以了解该步骤中数据帧的输出,但我没有看到问题所在:
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.index(): RangeIndex(start=0, stop=93, step=1)
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.columns: Index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION', 'DOW', 'MOY',
18:40:34
'TRANSACTIONS', 'DOW_INT', 'MOY_INT', 'DT_NBR'],
18:40:34
dtype='object')
我已经尝试了这些帖子中的所有内容,但无济于事:
Pandas error: cannot reindex from a duplicate axis
What does `ValueError: cannot reindex from a duplicate axis` mean?
我也不完全确定我理解为什么会发生这种情况。任何建议都非常感谢。
【问题讨论】:
【参考方案1】:没有示例数据,我无法重现您的错误。但是,根据函数的名称“fill_missing_dates”,我认为这种替代解决方案可能会实现您想要实现的目标。
import pandas as pd
df = pd.DataFrame(
'date': ["2020-01-01 00:01:00", "2020-01-01 00:02:00", "2020-01-01 01:00:00", "2020-01-01 02:00:00",
"2020-01-01 00:04:00", "2020-01-01 00:05:00",
"2020-01-03 00:01:00", "2020-01-03 00:02:00", "2020-01-03 01:00:00", "2020-01-03 02:00:00",
"2020-01-03 00:04:00", "2020-01-03 00:05:00",
],
'station': ["a","a","a","a","b", "b", "a", "a", "a", "a", "b", "b"],
'data': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
)
def resampler(x):
return x.set_index('date').resample('D').sum()
df['date'] = pd.to_datetime(df['date'])
multipass = pd.MultiIndex.from_frame(df[["date", "station"]])
df = df.set_index(["date", "station"])
df = df.reindex(multipass)
df.reset_index(level=0).groupby(level=0).apply(resampler)
结果用 0 填充缺失的日期:
data
station date
a 2020-01-01 10
2020-01-02 0
2020-01-03 34
b 2020-01-01 11
2020-01-02 0
2020-01-03 23
【讨论】:
以上是关于Pandas - ValueError:无法从重复的轴重新索引的主要内容,如果未能解决你的问题,请参考以下文章
Pandas - 在数据框中附加字符串:ValueError:无法从重复的轴重新索引
ValueError:无法使用 groupy 从重复轴重新索引并在 Pandas 中应用 pct_change
Pandas Pivot with Strings- ValueError:索引包含重复的条目,无法重塑
无法将字符串转换为 pandas 中的浮点数(ValueError)
Pandas Series 写入和读取 json 数据会产生带有 to_json 和 read_json 的 ValueError [重复]