Numpy解决：关于 dtype=object 的含义及坑点

Posted 2023-04-03 Shihao Weng

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Numpy解决：关于 dtype=object 的含义及坑点相关的知识，希望对你有一定的参考价值。

0.直接上两段代码：

$C o d e 1 :$

import numpy as np
a = []
e = 0.3
a.append(['s1', 's2', 's3', float(e)])
a = np.array(a)
print(type(a[0, 3]))

$输出结果为：$
<class 'numpy.str_'>

$C o d e 2 :$

import numpy as np
a = []
e = 0.3
a.append(['s1', 's2', 's3', float(e)])
a = np.array(a,dtype=object)
print(type(a[0, 3]))

$输出结果为：$
<class 'float'>

1.解释：

$n u m p y$ 数组存储为连续的内存块。它们通常有单一的数据类型(例如整数、浮点数或固定长度的字符串)，然后内存中的位被解释为具有该数据类型的值。
创建 $d t y p e = o bj ec t$ 的数组是不同的。数组占用的内存现在充满了存储在内存其他地方的 $P y t h o n$ 对象的指针(很像 $P y t h o n$ 列表实际上只是对象指针的列表，而不是对象本身)。

2.文档原话:

$n u m p y$ arrays are stored as contiguous blocks of memory. They usually have a single datatype (e.g. integers, floats or fixed-length strings) and then the bits in memory are interpreted as values with that datatype.
Creating an array with $d t y p e = o bj ec t$ is different. The memory taken by the array now is filled with pointers to $P y t h o n$ objects which are being stored elsewhere in memory (much like a $P y t h o n$ list is really just a list of pointers to objects, not the objects themselves).

3.存在的坑点：

如 $C o d e 1$ 所示，如果多 $a pp e n d$ 几行后要对 $n u m p y$ 矩阵按照第三列排序时使用np.argsort()函数时 $p y t h o n$ 会默认按照str类型的字典序排序，影响最终排序结果！！！！且你如果不懂的话很难发现！！

$22.08.04$
$Happy\\ Chinese\\ Valentine's\\ Day_\\ chui\\ chui\\hearts$

为啥 object dtype 数组包含 datetime.datetime 对象而不是 numpy.datetime64 对象？

【中文标题】为啥 object dtype 数组包含 datetime.datetime 对象而不是 numpy.datetime64 对象？【英文标题】：Why does object dtype array contain datetime.datetime objects instead of numpy.datetime64 objects?为什么 object dtype 数组包含 datetime.datetime 对象而不是 numpy.datetime64 对象？ 【发布时间】：2020-08-20 21:01:37 【问题描述】：

我不明白为什么我的 numpy 数组将 numpy.datetime64 值存储在 dts 中作为 datetime.datetime 对象。

In [1]: import numpy as np
In [2]: import datetime
In [3]: arr = np.ones((3,), dtype='O') 
In [4]: dts = np.array([np.datetime64(datetime.datetime.today())] * 2)
In [5]: dts
Out[5]: 
array(['2020-08-20T14:44:03.945058', '2020-08-20T14:44:03.945058'],
      dtype='datetime64[us]')
In [6]: arr[:2] = dts 
In [7]: arr                                                                     
Out[7]: 
array([datetime.datetime(2020, 8, 20, 14, 44, 3, 945058),
       datetime.datetime(2020, 8, 20, 14, 44, 3, 945058), 1], dtype=object)

我已经能够使用下面的代码解决这个问题，但我的实际情况更复杂，我更愿意使用上面的方法。

In [8]: arr = np.ones((3,), dtype='O')  
In [9]: dts = np.array([np.datetime64(datetime.datetime.today())] * 2) 
In [10]: for i in [0, 1]: 
    ...:     arr[i] = dts[i]  
In [11]: arr                                                                    
Out[11]: 
array([numpy.datetime64('2020-08-20T14:53:20.878553'),
       numpy.datetime64('2020-08-20T14:53:20.878553'), 1], dtype=object)

在给定arr 为object dtype 的情况下，为什么第一种方法不存储来自dts 的确切对象类型？

【问题讨论】：

你有点要求它 【参考方案1】：

In [346]: dts = np.array([np.datetime64(datetime.datetime.today())] * 2)                             
In [347]: dts                                                                                        
Out[347]: 
array(['2020-08-20T14:46:12.940815', '2020-08-20T14:46:12.940815'],
      dtype='datetime64[us]')

tolist 将数组转换为列表，尽可能将元素呈现为基础 Python 对象 - 显然 datatime64 被编程为将自身呈现为 datetime 对象：

In [348]: dts.tolist()                                                                               
Out[348]: 
[datetime.datetime(2020, 8, 20, 14, 46, 12, 940815),
 datetime.datetime(2020, 8, 20, 14, 46, 12, 940815)]

将dts 数组转换为对象类型也会将元素转换为datetime：

In [387]: dts.astype(object)[0]                                                                      
Out[387]: datetime.datetime(2020, 8, 20, 14, 46, 12, 940815)

所以arr[:]= dts 必须经过tolist 或astype(object)。

In [349]: dts[0]                                                                                     
Out[349]: numpy.datetime64('2020-08-20T14:46:12.940815')
In [350]: arr = np.ones(2, object)                                                                   
In [351]: arr[:] = dts                                                                               
In [352]: arr                                                                                        
Out[352]: 
array([datetime.datetime(2020, 8, 20, 14, 46, 12, 940815),
       datetime.datetime(2020, 8, 20, 14, 46, 12, 940815)], dtype=object)

浮点数也会发生类似的情况：

In [360]: x = np.array([1.23, 23.2])                                                                 
In [361]: type(x[0])                                                                                 
Out[361]: numpy.float64
In [362]: arr[:] = x                                                                                 
In [363]: arr                                                                                        
Out[363]: array([1.23, 23.2], dtype=object)
In [364]: type(arr[0])                                                                               
Out[364]: float

分配单个项目保留数据类型：

In [365]: arr[0] = x[0]                                                                              
In [366]: arr                                                                                        
Out[366]: array([1.23, 23.2], dtype=object)
In [367]: type(arr[0])                                                                               
Out[367]: numpy.float64
In [368]: type(arr[1])                                                                               
Out[368]: float

arr 现在包含 np.float64 和 float。

请记住，对象 dtype 数组存储对对象的引用 - 对象位于内存中的其他位置。在这方面，它很像一个列表。另一方面，数字 dtype 数组存储字节，这些字节由 dtype 机制解释。 dts[0] 实际上并没有引用 dts 的 8 字节块；这是一个新的datetime64 对象。而arr[0]（在上面的代码中）是另一个datetime64对象（具有相同的值）。

【讨论】：

您有什么办法可以避免for 循环以获得我正在寻找的结果？ @dshanahan 你有理由使用dtype='O'吗？您可以改用结构化数组，前两列使用真正的 numpy datetime64 吗？

以上是关于Numpy解决：关于 dtype=object 的含义及坑点的主要内容，如果未能解决你的问题，请参考以下文章