如何在 pandas/matplotlib 中绘制索引列？

Posted 2023-03-12

技术标签:

【中文标题】如何在 pandas/matplotlib 中绘制索引列？【英文标题】：How to plot the index column in pandas/matplotlib? 【发布时间】：2020-07-14 20:57:06 【问题描述】：

我将数据框的第一列作为有意义的索引。我将该列绘制为我的 x 轴。但是，我一直在努力这样做，因为我不断收到错误：

“没有[Float64Index（[1992.9595,1992.9866,1993.0138,1993.09,1993.0681,1993.0952，\ n 1993.1223,1993.1495,1993.1766,1993.2038，\ n ... \ n 2002.7328,2002.7599，2002.7871,2002.8142， 2002.8414, 2002.8685,\n 2002.8957, 2002.9228, 2002.95, 2002.9771],\n dtype='float64', name='Time', length=340)]在[columns]"

我已经尝试使用x=df_topex.index，正如另一个论坛问题（链接如下）中所建议的那样，但这似乎对我不起作用。我想知道是否有人可以向我解释为什么以及如何实现绘图。

df_topex = pd.read_csv('datasets/TOPEX.dat', 
                       sep='\s+', #multiple spaces as separator
                       index_col=0, #convert first column to index
                       names=["Time", "Anomaly"], #naming the headers
                      )

df_topex.plot(kind='scatter', x=df_topex.index, y='Anomaly', color='red')
plt.show()

另一个问题：Use index in pandas to plot data

【问题讨论】：

【参考方案1】：

我根据您的反馈修改了我的答案，以更准确地重现问题。

有了这个：

df_topex = pd.read_csv('datasets/TOPEX.dat', 
                       sep='\s+', #multiple spaces as separator
                       index_col=0, #convert first column to index
                       names=["Time", "Anomaly"], #naming the headers
                      )

你有这样的东西，其中“时间”列是索引：

    Time    Anomaly
---------  ---------
1992.9595     2.0000
1992.9866     3.0000
1993.0138     4.0000
1993.0409     5.0000
1993.0681     6.0000
1993.0952     7.0000

要绘制它，我们可以按照您说的执行以下操作，但仅供参考，此方法存在问题（https://github.com/pandas-dev/pandas/issues/16529，但目前不是什么大问题）：

df_topex.reset_index(inplace=True)
tabulate_df(df_topex)

这可能更安全：

df_topex = df_topex.reset_index()

无论如何，我们已经准备好在绘图中使用“时间”作为列（我指出“时间”在我看来没有时间格式）：

            Time    Anomaly
------  ---------  ---------
     0  1992.9595     2.0000
     1  1992.9866     3.0000
     2  1993.0138     4.0000
     3  1993.0409     5.0000
     4  1993.0681     6.0000
     5  1993.0952     7.0000

绘制它：

df_topex.plot(kind='scatter', x='Time', y='Anomaly', color='red')

那么让我们按照你的最后一个问题来思考：嗯......我们已经得到了情节，但是现在我们不能利用使用“时间”作为索引的优势，不是吗？

索引在过滤数百万行时具有显着的性能影响。也许您有兴趣使用“时间”列作为索引，因为您已经或预见到高容量。可以绘制数百万个点（例如数据着色），但不是很常见。在绘制之前过滤任何 DataFrame 是很常见的，在这一点上，对要过滤的列进行索引真的很有帮助，之后通常会出现绘图。

所以我们可以分阶段使用不同的 DataFrame，或者在 csv 导入操作之后完全执行以下操作，即保持索引与它一起使用并随时在 Time2 列上绘图：

df_topex['Time2'] = df_topex.index

所以我们将“时间”作为索引：

    Time    Anomaly      Time2
---------  ---------  ---------
1992.9595     2.0000  1992.9595
1992.9866     3.0000  1992.9866
1993.0138     4.0000  1993.0138
1993.0409     5.0000  1993.0409
1993.0681     6.0000  1993.0681
1993.0952     7.0000  1993.0952

如何利用索引？很好的帖子，其中测量了过滤索引的性能：What is the performance impact of non-unique indexes in pandas?

简而言之，您对拥有唯一索引或至少已排序感兴趣。

# Performance preference in index type to filtering tasks: 
# 1) unique
# 2) if not unique, at least sorted (monotonic increase o decrease)
# 3) Worst combination: non-unique and unsorted.

# Let's check:
print ("Is unique?", df_topex.index.is_unique)
print ("Is is_monotonic increasing?", df_topex.index.is_monotonic_increasing)
print ("Is is_monotonic decreasing?", df_topex.index.is_monotonic_decreasing)

来自样本数据：

Is unique? True
Is is_monotonic increasing? True
Is is_monotonic decreasing? False

如果未排序，您可以通过以下方式执行排序任务：

df_topex = df_topex.sort_index()
# Ready to go on filtering...

希望对你有帮助。

【讨论】：

谢谢。但是，df_topex.reset_index().plot(kind='scatter', x='index', y='Anomaly', color='red') 实际上并没有起作用。 df_topex.reset_index(inplace=True).plot(kind='scatter', x='Time', y='Anomaly', color='red') 做到了。当一个人想要绘制有意义的索引时，你会建议不要设置一个有意义的索引吗？

以上是关于如何在 pandas/matplotlib 中绘制索引列？的主要内容，如果未能解决你的问题，请参考以下文章

Python中Pandas/Matplotlib中直方图和密度的叠加

Python+pandas+matplotlib数据分析与可视化案例

使用 pandas/matplotlib 或 seaborn 排序的条形图

金融与量化投资

Pandas、matplotlib 和 plotly - 如何修复系列图例？

解决Pandas/Matplotlib保存图形时坐标轴标签太长导致显示不全的问题