Pandas 数据框类型 datetime64[ns] 在 Hive/Athena 中不起作用

Posted 2023-03-11

技术标签:

【中文标题】Pandas 数据框类型 datetime64[ns] 在 Hive/Athena 中不起作用【英文标题】：Pandas dataframe type datetime64[ns] is not working in Hive/Athena 【发布时间】：2019-05-23 23:46:39 【问题描述】：

我正在开发一个 python 应用程序，它只是将 csv 文件转换为 hive/athena 兼容的 parquet 格式，我正在使用 fastparquet 和 pandas 库来执行此操作。 csv 文件中有时间戳值，例如 2018-12-21 23:45:00，需要在 parquet 文件中写入 timestamp 类型。下面是我正在运行的代码，

columnNames = ["contentid","processed_time","access_time"]

dtypes = 'contentid': 'str'

dateCols = ['access_time', 'processed_time']

s3 = boto3.client('s3')

obj = s3.get_object(Bucket=bucketname, Key=keyname)

df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)

s3filesys = s3fs.S3FileSystem()

myopen = s3filesys.open

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

代码运行成功，下面是pandas创建的dataframe

contentid                 object
processed_time            datetime64[ns]
access_time               datetime64[ns]

最后，当我在 Hive 和 athena 中查询 parquet 文件时，时间戳值是 +50942-11-30 14:00:00.000 而不是 2018-12-21 23:45:00

非常感谢任何帮助

【问题讨论】：

尝试在 hive 中插入时将列转换为数据时间格式 pd.to_datetime(df['access_time', 'processed_time'], unit='ms', errors='coerce') 也试过了。但还是一样在创建 DF 时不要解析列，而是转换为 datetime 对象为 datetime.datetime.strptime('2018-12-21 23:45:00','%y-%m-% d %H:%m') 并应用于 df 的日期列。如果 Athena/Hive 不直接支持，您可能需要使用此处 prestodb.io/docs/current/functions/datetime.html 中的函数，具体取决于您的 python 脚本生成的格式。使用“parquet-tools cat”检查架构的数据和架构。如果您无法找到正确的转换函数，请在此处发布时间戳格式。这些答案是否有帮助，我遇到了完全相同的问题？ 【参考方案1】：

我知道这个问题很老了，但它仍然很重要。

如前所述，Athena 仅支持 int96 作为时间戳。使用 fastparquet 可以为 Athena 生成正确格式的 parquet 文件。重要的部分是 times='int96' 因为这告诉 fastparquet 将 pandas 日期时间转换为 int96 时间戳。

from fastparquet import write
import pandas as pd

def write_parquet():
  df = pd.read_csv('some.csv')
  write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')

【讨论】：

非常有帮助！非常感谢！【参考方案2】：

我通过这种方式解决了问题。

用to_datetime方法转换df系列

接下来用 .dt accesor 选择 datetime64[ns] 的日期部分

例子：

df.field = pd.to_datetime(df.field)
df.field = df.field.dt.date

之后，雅典娜会识别数据

【讨论】：

【参考方案3】：

你可以试试：

dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)

【讨论】：

【参考方案4】：

问题似乎出在 Athena 上，它似乎只支持 int96，当您在 pandas 中创建时间戳时，它是 int64

包含字符串日期的数据框列是“sdate”，我首先将其转换为时间戳

# add a new column w/ timestamp
df["ndate"] = pandas.to_datetime["sdate"]
# convert the timestamp to microseconds
df["ndate"] = pandas.to_datetime(["ndate"], unit='us')

# Then I convert my dataframe to pyarrow
table = pyarrow.Table.from_pandas(df, preserve_index=False)

# After that when writing to parquet add the coerce_timestamps and 
# use_deprecated_int96_timstamps. (Also writing to S3 directly)
OUTBUCKET="my_s3_bucket"

pyarrow.parquet.write_to_dataset(table, root_path='s3://0/logs'.format(OUTBUCKET), partition_cols=['date'], filesystem=s3, coerce_timestamps='us', use_deprecated_int96_timestamps=True)

【讨论】：

【参考方案5】：

我遇到了同样的问题，经过大量研究，现在已经解决了。

当你这样做时

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

它在后台使用 fastparquet，它对 DateTime 使用不同的编码比雅典娜兼容的。

解决方法是：卸载 fastparquet 并安装 pyarrow

pip 卸载 fastparquet pip install pyarrow

再次运行您的代码。这次应该可以了。 :)

【讨论】：

以上是关于Pandas 数据框类型 datetime64[ns] 在 Hive/Athena 中不起作用的主要内容，如果未能解决你的问题，请参考以下文章