Datawhale powerful-numpy《从小白到入门》学习笔记

Posted 临风而眠

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Datawhale powerful-numpy《从小白到入门》学习笔记相关的知识,希望对你有一定的参考价值。

Datawhale powerful-numpy《从小白到入门》学习笔记

持续更新中

文章目录

摘自官方文档的一些话

What is NumPy?

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object(多维数组对象), various derived(派生) objects (such as masked arrays[【掩码数组】 and matrices), and an assortment of routines for fast operations on arrays(用于数组快速操作的各种API), including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms(离散傅里叶变换), basic linear algebra, basic statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object(NumPy包的核心是ndarray. This encapsulates n-dimensional arrays of homogeneous data types(它封装了python原生的同数据类型的n维数组), with many operations being performed in compiled code for performance(许多操作在编译代码中执行以提高性能). There are several important differences between NumPy arrays and the standard Python sequences:

  • NumPy arrays have a fixed size at creation(NumPy数组在创建时有固定大小), unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original(更改ndarray的大小将创建一个新数组并删除原始数组).
  • The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory.
    • The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.
  • NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data(有助于对大量数据进行高级数学运算和其他类型的运算). Typically, such operations are executed more efficiently(执行更高效) and with less code than is possible using Python’s built-in sequences.
  • A growing plethora of(越来越多的) scientific and mathematical Python-based packages(基于python的科学和数学软件包) are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays(尽管这些软件包通常支持python的内置序列作为输入,但它们会在处理之前把此类输入转换为NumPy数组,并且它们通常会输出NumPy数组). In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.

The points about sequence size and speed are particularly important in scientific computing. As a simple example, consider the case of multiplying each element in a 1-D sequence with the corresponding element in another sequence of the same length. If the data are stored in two Python lists, a and b, we could iterate over each element(遍历每个元素):

c = []
for i in range(len(a)):
    c.append(a[i]*b[i])

This produces the correct answer, but if a and b each contain millions of numbers, we will pay the price for the inefficiencies of looping in Python(为python中循环的低效付出代价). We could accomplish the same task much more quickly in C by writing (for clarity we neglect variable declarations and initializations, memory allocation, etc.【为了清楚起见,我们忽略了变量声明和初始化、内存分配等】)

for (i = 0; i < rows; i++) 
  c[i] = a[i]*b[i];

This saves all the overhead involved in interpreting the Python code and manipulating Python objects, but at the expense of the benefits gained from coding in Python. Furthermore, the coding work required increases with the dimensionality of our data (所需的编码工作随着我们数据的维数而增加). In the case of a 2-D array, for example, the C code (abridged as before 【如前所述】) expands to

for (i = 0; i < rows; i++) 
  for (j = 0; j < columns; j++) 
    c[i][j] = a[i][j]*b[i][j];
  

NumPy gives us the best of both worlds(两全其美的办法): element-by-element operations are the “default mode” when an ndarray is involved, but the element-by-element operation is speedily executed by pre-compiled C code(预编译的C代码可以快速执行逐个元素的操作). In NumPy

c = a * b

does what the earlier examples do, at near-C speeds, but with the code simplicity we expect from something based on Python(以接近C的速度执行前面示例所做的操作,同时具有我们期望的python代码的简洁性). Indeed(事实上), the NumPy idiom is even simpler! This last example illustrates two of NumPy’s features which are the basis of much of its power: vectorization and broadcasting.

Why is NumPy Fast?

Vectorization(矢量化) describes the absence of any explicit looping(代码中没有任何显式循环), indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code(这些事情只是在优化的、预编译的C代码中”在幕后“发生). Vectorized code has many advantages, among which are(其中包括):

  • vectorized code is more concise and easier to read(简洁易读)
  • fewer lines of code generally means fewer bugs
  • the code more closely resembles standard mathematical notation 标准数学符号 (making it easier, typically, to correctly code mathematical constructs【通常更容易对数学结构进行编码】)
  • vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.(如果没有矢量化,我们的代码将充满低效且难读的for循环)

Broadcasting is the term used to describe the implicit element-by-element behavior of operations(广播是用于描述操作的隐式逐个元素行为的术语); generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise(按位运算), functional(函数运算), etc., behave in this implicit element-by-element fashion(都以这种隐含的逐元素方式运行), i.e.(即), they broadcast. Moreover, in the example above, a and b could be multidimensional arrays of the same shape, or a scalar and an array, or even two arrays of with different shapes, provided that the smaller array is “expandable” to the shape of the larger in such a way that the resulting broadcast is unambiguous(前提是较小的数组可以”扩展“为较大的形状且符合广播的规则,有明确的结果). For detailed “rules” of broadcasting see Broadcasting.

Who Else Uses NumPy?

NumPy fully supports an object-oriented approach(面向对象方法), starting, once again, with ndarray. For example, ndarray is a class, possessing numerous methods and attributes(拥有许多方法和属性). Many of its methods are mirrored by functions in the outer-most NumPy namespace(Numpy最外层的命名空间), allowing the programmer to code in whichever paradigm they prefer(允许程序员以他们喜欢的任何范式进行编码). This flexibility has allowed the NumPy array dialect and NumPy ndarray class to become the de-facto(实际上) language of multi-dimensional data interchange used in Python.(成为Python中使用最多的多维数据交换的实际语言)

一.创建和生成

先导入

# 导入 library
import numpy as np
# 画图工具
import matplotlib.pyplot as plt
  • 在实际工作过程中,我们时不时需要验证或查看 array 相关的 API 或互操作。
  • 有时候在使用 sklearn,matplotlib,PyTorch,Tensorflow 等工具时也需要一些简单的数据进行实验。

下面学习的创建和生成array的多种方法中,最常用的一般是 linspace/logspace 和 random,前者常常用在画坐标轴上,后者则用于生成「模拟数据」。举例来说,当我们需要画一个函数的图像时,X 往往使用 linspace 生成,然后使用函数公式求得 Y,再 plot;当我们需要构造一些输入(比如 X)或中间输入(比如 Embedding、hidden state)时,random 很方便。

1.从python列表或元组创建

  • 重点掌握传入 list 创建一个 array 即可:np.array(list)

  • ⚠️ 需要注意的是:「数据类型」

    • array 是要保证每个元素类型相同的

从列表创建

  • 一维list

    >>> # 一个 list
    >>> np.array([1,2,3])
    array([1, 2, 3])
    
  • 二维list

    >>> # 二维(多维类似)
    >>> # 注意,有一个小数哦
    >>> np.array([[1, 2., 3], [4, 5, 6]])
    array([[1., 2., 3.],
           [4., 5., 6.]])
    
  • 指定数据类型

    
    >>> # 您也可以指定数据类型
    >>> np.array([1, 2, 3], dtype=np.float16)
    array([1., 2., 3.], dtype=float16)
    >>> # 如果指定了 dtype,输入的值都会被转为对应的类型,而且不会四舍五入
    >>> lst = [
    ...     [1, 2, 3],
    ...     [4, 5, 6.8]
    ... ]
    >>> np.array(lst, dtype=np.int32)
    array([[1, 2, 3],
           [4, 5, 6]])
    

从元组创建

  • 创建

    >>> # 一个 tuple
    >>> np.array((1.1, 2.2))
    array([1.1, 2.2])
    >>> # tuple,一般用 list 就好,不需要使用 tuple
    >>> np.array([(1.1, 2.2, 3.3), (4.4, 5.5, 6.6)])
    array([[1.1, 2.2, 3.3],
           [4.4, 5.5, 6.6]])
    >>> np.array(((1.1, 2.2, 3.3), (4.4, 5.5, 6.6)))
    array([[1.1, 2.2, 3.3],
           [4.4, 5.5, 6.6]])
    
  • 转换

    >>> np.asarray((1,2,3))
    array([1, 2, 3])
    >>> np.asarray(((1.1, 2.2, 3.3), (4.4, 5.5, 6.6)))
    array([[1.1, 2.2, 3.3],
           [4.4, 5.5, 6.6]])
    >>> np.asarray(([1.,2.,3.],(4.,5.,6.)))
    array([[1., 2., 3.],
           [4., 5., 6.]])
    

2.使用arange生成

  • range 是 Python 内置的整数序列生成器,arange 是 numpy 的,效果类似,会生成一维的向量。
  • 如下情况会需要用arange来创建array
    • 需要创建一个连续一维向量作为输入(比如编码位置时可以使用)
    • 需要观察筛选、抽样的结果时,有序的 array 一般更加容易观察
  • ⚠️ 需要注意的是:在 reshape 时,目标的 shape 需要的元素数量一定要和原始的元素数量相等。
>>> np.arange(12)
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
>>> np.arange(12).reshape(3,4)
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> np.arange(12.0).reshape(2,6)
array([[ 0.,  1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10., 11.]])
>>> np.arange(100,124,2)
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122])
>>> np.arange(100,124,2).reshape(3,2,2)
array([[[100, 102],
        [104, 106]],

       [[108, 110],
        [112, 114]],

       [[116, 118],
        [120, 122]]])
>>> np.arange(100.,124.,2).reshape(3,2,2)
array([[[100., 102.],
        [104., 106.]],

       [[108., 110.],
        [112., 114.]],

       [[116., 118.],
        [120., 122.]]])
>>> np.arange(100.,124.,2.).reshape(3,2,2)
array([[[100., 102.],
        [104., 106.]],

       [[108., 110.],
        [112., 114.]],

       [[116., 118.],
        [120., 122.]]])

3.使用linspace/logspace生成

4.使用ones/zeros创建

5.使用random生成

6.从文件读取

二.统计和属性

1.尺寸相关

2.最值分位

3.平均求和标准差

三.形状和转变

1.改变形状

2.反序

3.转置

四.分解和组合

1.切片和索引

2.拼接

3.重复

4.分拆

五.筛选和过滤

1.条件筛选

2.提取

3.抽样

4.最值 Index

六.矩阵和运算

1.算术

2.矩阵

七.小结

参考资料

四.分解和组合

1.切片和索引

2.拼接

3.重复

4.分拆

五.筛选和过滤

1.条件筛选

2.提取

3.抽样

4.最值 Index

六.矩阵和运算

1.算术

2.矩阵

七.小结

参考资料

以上是关于Datawhale powerful-numpy《从小白到入门》学习笔记的主要内容,如果未能解决你的问题,请参考以下文章

《Datawhale项目实践系列》发布!

Datawhale团队第七期录取名单!

Datawhale面经项目来了!

Datawhale来交大啦!

英特尔 x Datawhale联合认证发布!

Datawhale寒假组队学习来了!