使用 pandas 读取 CSV 日期会返回 datetime 而不是 Timestamp

Posted 2023-03-11

技术标签:

【中文标题】使用 pandas 读取 CSV 日期会返回 datetime 而不是 Timestamp【英文标题】：Reading CSV dates with pandas returns datetime instead of Timestamp 【发布时间】：2021-06-30 15:09:42 【问题描述】：

问题描述：

我正在尝试将 CSV 格式的历史股票价格读入 pandas Dataframe，但到目前为止我注意到一件有趣的事情 - 在读取某些行号时，日期列类型从 pandas.Timestamp 更改为 datetime -这是如何运作的？那我怎么看pandas.Timestamp呢？

最小复制示例：

我已经检查了我的 CSV 文件，这里是一个所需的最少数据示例。

import pandas as pd
file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

结果：<class 'datetime.datetime'>

对比

import pandas as pd
file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
02.03.2014 02:00:00.000 GMT+0200,1.37620,1.37882,1.37586,1.37745,5616.04
03.03.2014 02:00:00.000 GMT+0200,1.37745,1.37928,1.37264,1.37357,136554.6563
04.03.2014 02:00:00.000 GMT+0200,1.37356,1.37820,1.37211,1.37421,124863.8203""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

结果：<class 'pandas._libs.tslibs.timestamps.Timestamp'>

版本：

pandas==1.2.3 pandas-datareader==0.9.0

总结：

我需要阅读 pandas.Timestamp 因为后面的一些数据操作，而不是 datetime，并且不知道这里出了什么问题 - 希望你们，伙计们，可以提供帮助...

我也创建了一个 GitHub issue，但它还没有被分类。

【问题讨论】：

【参考方案1】：

您可以指定要使用哪个date_parser 函数：

data = pd.read_csv('temp.csv', 
                   parse_dates = ["Local time"],
                   date_parser=pd.Timestamp)

输出：

>>> data
                  Local time     Open     High      Low    Close       Volume
0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2  2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203

>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

根据我的观察，当单个观察的时区不同时，pandas 会自动将每个条目解析为日期时间。

如果您确实需要使用pd.Timestamp，上述方法应该可以工作。

不过，运行上述命令也会给我一个 FutureWarning，我对此进行了研究并发现它是无害的。

编辑

经过一番研究：

pandas 尝试将日期类型列转换为DatetimeIndex，以提高基于日期时间的操作的效率。但是对于这个 pandas 需要为整个列有一个共同的时区。

关于明确尝试转换为pd.DatetimeIndex

>>> data
                  Local time     Open     High      Low    Close       Volume
0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2  2014-04-03 02:00:00-04:00  1.37356  1.37820  1.37211  1.37421  124863.8203

>>> pd.DatetimeIndex(data['Local time'])

ValueError: Array must be all same time zone

During handling of the above exception, another exception occurred:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

因此，当转换为DatetimeIndex 失败时，pandas 会在内部将数据保存为字符串（dtype : object），并将单个条目作为datetime 处理。

文档建议，如果数据中的时区不同，请指定 UTC=True，因此时区将设置为 UTC，时间值将相应更改。

来自文档：

pandas 本身不能表示具有混合时区的列或索引。如果您的 CSV 文件包含混合时区的列，则默认结果将是带有字符串的 object-dtype 列，即使是 parse_dates。

要将混合时区值解析为日期时间列，请将部分应用的 to_datetime() 传递给 utc=True

在已经具有相同时区的数据中，DatetimeIndex 可以无缝工作：

>>> data
                 Local time     Open     High      Low    Close       Volume
0 2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1 2014-03-03 02:00:00-02:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2 2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203


>>> pd.DatetimeIndex(data['Local time'])

DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
               '2014-04-03 02:00:00-02:00'],
              dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)

>>> type(pd.DatetimeIndex(data['Local time'])[0])

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

参考资料：

https://pandas.pydata.org/docs/user_guide/io.html#io-csv-mixed-timezones https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#parse_dates

【讨论】：

谢谢，它成功了。不过，我不明白为什么具有不同时区的默认解析类型是日期时间。据我所知，显式 date_parser 工作正常，所以 pandas.read_csv(...) 有点不一致，不是吗？我不认为这可能是不一致的，除非他们使用了某种随机化器:)。在他们的documentation 中有更多关于默认date_parser 所做的事情不一致，我的意思是不同的推导类型 - 日期时间与时间戳 - 对于具有相同数据格式的值，具体取决于时区的值。谢谢，我会仔细阅读文档。 @HlibPylypets 我也会这样做。如果我找到可靠的东西，我会在这里报告。感谢您的评论更新，太棒了。

以上是关于使用 pandas 读取 CSV 日期会返回 datetime 而不是 Timestamp的主要内容，如果未能解决你的问题，请参考以下文章