将 MySQL 结果集转换为 NumPy 数组的最有效方法是啥？

Posted 2023-02-24

技术标签:

【中文标题】将 MySQL 结果集转换为 NumPy 数组的最有效方法是啥？【英文标题】：What's the most efficient way to convert a MySQL result set to a NumPy array?将 MySQL 结果集转换为 NumPy 数组的最有效方法是什么？ 【发布时间】：2011-10-27 01:41:50 【问题描述】：

我正在使用 mysqldb 和 Python。我有一些基本的查询，例如：

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

我需要“结果”作为 NumPy 数组，并且我希望节省内存消耗。似乎逐行复制数据效率极低（需要双倍内存）。有没有更好的方法将 MySQLdb 查询结果转换成 NumPy 数组格式？

我希望使用 NumPy 数组格式的原因是因为我希望能够轻松地对数据进行切片和切块，而在这方面 python 似乎对多维数组不太友好。

e.g. b = a[a[:,2]==1]

谢谢！

【问题讨论】：

【参考方案1】：

这个方案使用了Kieth的fromiter技术，但是对SQL结果的二维表结构的处理更加直观。此外，它通过避免 python 数据类型中的所有重塑和扁平化来改进 Doug 的方法。使用structured array，我们几乎可以直接从 MySQL 结果读取到 numpy，几乎完全删除 python 数据类型。我说“几乎”是因为 fetchall 迭代器仍然产生 python 元组。

虽然有一个警告，但这并不是什么大问题。您必须提前知道列的数据类型和行数。

知道列类型应该很明显，因为您知道查询大概是什么，否则您总是可以使用 curs.description 和 MySQLdb.FIELD_TYPE.* 常量的映射。

知道行数意味着您必须使用客户端游标（这是默认设置）。我对 MySQLdb 和 MySQL 客户端库的内部结构知之甚少，但我的理解是，当使用客户端游标时，整个结果都会被提取到客户端内存中，尽管我怀疑实际上涉及到一些缓冲和缓存。这意味着对结果使用双倍内存，一次用于游标副本，一次用于数组副本，因此如果结果集很大，最好尽快关闭游标以释放内存。

严格来说，您不必提前提供行数，但这样做意味着数组内存会提前分配一次，并且不会随着更多行从迭代器进入而不断调整大小，这意味着提供巨大的性能提升。

还有一些代码

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

请参阅 dtype 的 numpy 文档和上面有关结构化数组的链接，了解如何指定列数据类型和列名。

【讨论】：

如果有人在这里寻找二维数组而不是结构化数组，转换它非常容易：ndarray_data = A.view(np.int32).reshape((len(A),-1)) 为所有数据替换最佳类型。【参考方案2】：

fetchall 方法实际上返回一个迭代器，numpy 有fromiter 方法从一个迭代器初始化一个数组。因此，根据表中的数据，您可以轻松地将两者结合起来，或者使用适配器生成器。

【讨论】：

Fromiter 只生成一维数组对象，对吧？在这个例子中，我们需要一个 2-d.. 我想你可以以某种方式转换它，但在这种情况下，这仍然是最有效的方法吗？是的，你可以在之后重塑它。 Numpy 数组以这种方式非常有效。您可以将 shape 属性设置为元组 (2,)，这应该可以工作。嗨 Keith，感谢您提供的信息 - 很高兴知道 Numpy 可以优雅地处理这些问题。不幸的是，我正在努力使用您推荐的 fromiter() 函数。results = c.fetchall() D = np.fromiter(results, dtype=float, count=-1) 给出了ValueError: setting an array element with a sequence.。结果是 1D 还是 2D 似乎并不重要——有什么想法吗？尝试取出“iterable”关键字参数，并将其改为位置（第一个）参数。【参考方案3】：

NumPy 的 fromiter 方法在这里似乎是最好的（就像在这个之前的 Keith 的回答中一样）。

使用 fromiter 将通过调用 MySQLdb 游标方法返回的结果集重新转换为 NumPy 数组很简单，但有几个细节可能值得一提。

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

注意 fromiter 返回一个 1D NumPY 数组，

（当然，这是有道理的，因为您可以使用 fromiter 通过传递 count 的参数来仅返回单个 MySQL 表行的一部分）。

不过，您必须恢复 2D 形状，因此谓词调用游标方法 rowcount。以及最后一行中对 reshape 的后续调用。

最后，参数count的默认参数是'-1'，它只是检索整个iterable

【讨论】：

谢谢，我想这正是我想要的。虽然，当我尝试运行您的代码时，它告诉我“TypeError：找不到所需的参数'iter'（pos 1）”。它适合你吗？ c.execute("SELECT id, rating FROM video")results = c.fetchall()num_rows = int(c.rowcount)D = np.fromiter(iterable=results, dtype=float, count=-1)D = D.reshape(num_rows, -1) 编辑了我的答案以包括重铸和展平“结果”的中间步骤。为了节省打字，我没有在我的原始答案中包含这些琐碎的步骤，而是在注释行中说明“'result'是一个嵌套元组”

以上是关于将 MySQL 结果集转换为 NumPy 数组的最有效方法是啥？的主要内容，如果未能解决你的问题，请参考以下文章