将 Pandas 数据帧随时间附加到 SQLite3 数据库并返回

Posted 2023-03-11

技术标签:

【中文标题】将 Pandas 数据帧随时间附加到 SQLite3 数据库并返回【英文标题】：Appending Pandas dataframes with time to SQLite3 databases & back 【发布时间】：2018-08-16 05:32:29 【问题描述】：

我正在尝试这个：

import pandas as pd
import sqlite3
import datetime, pytz

#nowtime=datetime.datetime.now(pytz.utc)
nowtime=datetime.datetime.now()

print(nowtime)
df = pd.DataFrame(columns=list('ABCD'))
df.loc[0]=(3,0.141,"five-nine",nowtime)
df.loc[1]=(1,0.41,"four-two",nowtime)

print(df)

db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('create table if not exists ABCD ( A integer, B real, C text, D timestamp );')
c.execute('insert into ABCD (A,B,C, D) values (?,?,?,?);',(1,2.2,'4',nowtime))
c.executemany('insert into ABCD (A,B,C, D) values (?,?,?,?);',df.to_records(index=False))

db.commit()

print(pd.read_sql('select * from ABCD;',db))

得到这个：

 2018-03-07 19:09:58.584953
   A      B          C                          D
0  3  0.141  five-nine 2018-03-07 19:09:58.584953
1  1  0.410   four-two 2018-03-07 19:09:58.584953
   A      B          C                           D
0  1  2.200          4  2018-03-07 19:09:58.584953
1  3  0.141  five-nine    b'\xa8hx?\t\xb9\x19\x15'
2  1  0.410   four-two    b'\xa8hx?\t\xb9\x19\x15'

理想情况下，我想将一些带有时间戳的数据推送到 sqlite3 中，并以可互操作的方式将其恢复回 pandas/python/numpy。

我已经看到 Appending Pandas dataframe to sqlite table by primary key 用于追加，但我不确定如何使用 sqlite3 使用 datetime.datetime、pandas Timestamps 或 numpy.datetime64 次。

此外，还有 How to read datetime back from sqlite as a datetime instead of string in Python?，但我不知道如何在 pandas 中做到这一点。

我花了很多时间在https://***.com/a/21916253/1653571 和令人困惑的多个 to_datetime()s 上。

使用 times、sqlite3 和 pandas 的好方法是什么？

＃＃＃＃＃＃＃更新：

我尝试了这些更改：

db = sqlite3.connect(':memory:',detect_types=sqlite3.PARSE_DECLTYPES)

#...
for index,row in df.iterrows():
    print(row)
    c.execute('insert into ABCD (A,B,C,D) values (?,?,?,?);',(row.A,row.B,row.C,row.D.to_pydatetime()))


x = pd.read_sql('select *  from ABCD;',db)

print('Type of a pd.read_sql(SQLite3) timestamp  : ',type(x['D'][0]))

x = c.execute('select * from ABCD').fetchall()

print(x)
print('Type of a sqlite.execute(SQLite3) timestamp  : ',type(x[0][3]))

使用 SQLite3 数据类型并测试返回值：

Type of a pd.read_sql(SQLite3) timestamp  :  <class 'pandas._libs.tslib.Timestamp'>
[(1, 2.2, '4', datetime.datetime(2018, 3, 8, 14, 46, 2, 520333)), (3, 141.0, 'five-nine', datetime.datetime(2018, 3, 8, 14, 46, 2, 520333)), (1, 41.0, 'four-two', datetime.datetime(2018, 3, 8, 14, 46, 2, 520333))]
Type of a sqlite.execute(SQLite3) timestamp  :  <class 'datetime.datetime'>

另外，当我尝试datetime.datetime.now(pytz.utc) 来获取 UTC 感知时间时，它破坏了很多东西。使用 datetime.datetime.utcnow() 通过返回不受时区影响的非时区感知对象效果更好。

还要注意关于sqlite3.connect(detect_types=...) 参数的Python sqlite3 文档。启用detect_types=PARSE_DECLTYPES|PARSE_COLNAMES 提示 python 对系统之间传递的数据运行转换器。

https://docs.python.org/3/library/sqlite3.html#sqlite3.PARSE_DECLTYPES 用于 create table ... xyzzy timestamp, ... 转化 https://docs.python.org/3/library/sqlite3.html#sqlite3.PARSE_COLNAMES 用于 select ... date as "dateparsed [datetime]"... 转换

【问题讨论】：

【参考方案1】：

问题源自 pandas 的 to_records()，它将您的日期时间字段转换为带有 T 分隔符的 ISO 时间戳：

print(df.to_records(index=False))
# [(3, 0.141, 'five-nine', '2018-03-07T20:40:39.808427000')
#  (1, 0.41 , 'four-two', '2018-03-07T20:40:39.808427000')]

考虑将日期时间列转换为字符串，然后运行游标executemany()：

df.D = df.D.astype('str')

print(df.to_records(index=False))
# [(3, 0.141, 'five-nine', '2018-03-07 20:40:39.808427')
#  (1, 0.41 , 'four-two', '2018-03-07 20:40:39.808427')]

总共：

db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('create table if not exists ABCD ( A integer, B real, C text, D timestamp );')
c.execute('insert into ABCD (A,B,C, D) values (?,?,?,?);',(1,2.2,'4',nowtime))

df['D'] = df['D'].astype('str')
c.executemany('insert into ABCD (A,B,C, D) values (?,?,?,?);',df.to_records(index=False))

db.commit()
print(pd.read_sql('select * from ABCD;',db))

#    A      B          C                           D
# 0  1  2.200          4  2018-03-07 20:47:15.031130
# 1  3  0.141  five-nine  2018-03-07 20:47:15.031130
# 2  1  0.410   four-two  2018-03-07 20:47:15.031130

【讨论】：

我不确定我是否喜欢 str 作为数据类型。重新解析回pandas.Timestamp 或datetime.datetime 会不会更慢更复杂？你检查了read_sql 之后的dtypes 应该派生自SQLite 类型吗？转换为字符串只是为了迁移到 SQLite。否则不要使用to_records()，请参阅iterrows。使用 SQLAlchemy 连接尝试使用 to_sql 方法。我发现返回的数据类型是str，除非我在db.connect() 中添加detect_types=sqlite3.PARSE_DECLTYPES 参数。使用该参数，c.execute(...).fetch... 返回 datetime.datetime 和 pd.read_sql 返回 pandas._libs.tslib.Timestamp。除非设置了参数，否则 SQLite 类型似乎完全被忽略了。作为一个轻量级的文件级数据库，SQLite 只有很少的data types:TEXT, NUMERIC, INTEGER, REAL, BLOB. 没有 timestamp 类型并且可能被保存到其最接近的亲和类中文本。 @DaveX 它不仅在内部存储为TEXT，SQLite 除了TEXT 之外没有其他概念。当您看到 datetime.datetime 从 PARSE_DECLTYPES 返回之类的内容时，那是因为您正在从确实具有这些类型的 Pandas 中读取。如您所见，它们将始终作为字符串存储在 SQLite 中。 SQLite 端没有解决方案，你的问题是在将 SQLite 读回 Pandas 时出现的，所以需要在 Pandas 端进行数据类型转换。【参考方案2】：

主要问题是 SQLite 没有日期时间数据类型。

PARSE_DECLTYPES 在读取 SQLite 时无能为力，因为 SQLite 中声明的列数据类型永远不会是日期时间。

由于您可以控制 Pandas 数据框，因此您知道将它们保存回 SQLite 时的类型。

您正在使用的read_sql 方法...

是 read_sql_table 和 read_sql_query 的便捷包装器（以及为了向后兼容）并将委托给特定的功能取决于提供的输入（数据库表名或 SQL 查询）。

在您的示例中，您提供了一个查询，因此它委托给 read_sql_query 方法 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query

这有一个参数parse_dates，可以是：

column_name: arg dict 的字典，其中 arg 字典对应于 pandas.to_datetime() 的关键字参数 特别有用没有原生日期时间支持的数据库，例如 SQLite

由于您提前知道哪些列是数据类型，您可以将它们存储为具有与 parse_dates 期望的结构相匹配的结构的 dict，然后将其传递给 read_sql 方法。

在我将 pandas df 保存回 csv 或其他文件的其他情况下，我使用了类似的方法来保存架构，以便在将 csv 加载回 pandas 时重新引入。 read_csv 方法有一个 dbtypes 参数，它完全采用下面的结构。

def getPandasSchema(df):
    ''' 
    takes a pandas dataframe and returns the dtype dictionary
    useful for applying types when reloading that dataframe from csv etc
    '''
    return dict(zip(df.columns.tolist(),df.dtypes.tolist()))

【讨论】：

以上是关于将 Pandas 数据帧随时间附加到 SQLite3 数据库并返回的主要内容，如果未能解决你的问题，请参考以下文章

尝试使用 pandas 数据框将数据附加到 BigQuery 表时出错

python - 如何将 numpy 数组附加到 pandas 数据帧

将 Pymongo 数据从列表附加到 pandas 数据框

使用 Pandas、Python 将数据附加到 HDF5 文件

将多个字典附加到 Pandas 数据框：错误 DataFrame 构造函数未正确调用？

Python 3.x - 使用 for 循环将数据附加到 Pandas 数据帧