Google老师亲授 TensorFlow2.0入门到进阶笔记- (dataset)

Posted 2022-01-27 一杯敬朝阳一杯敬月光

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Google老师亲授 TensorFlow2.0入门到进阶笔记- (dataset)相关的知识，希望对你有一定的参考价值。

版本：numpy 1.16.6 tensorflow 2.2.0 tensorflow.keras 2.3.0-tf

1. 引入

DataSet基础使用
- tf.data.Dataset.from_tensor_slices # 构建dataset
- repeat, batch, interleave, map, shuffle, list_files,...
csv
- tf.data.TextLineDataset, # 读取文本文件
- tf.io.decode_csv # 解析csv
tfrecord
- tf.train.FloatList, tf.train.Int64List, tf.train.BytesList
- tf.train.Feature, tf.train.Features, tf.train.Example # 封装tfexample写到文件中去
- example.SerializeToString # 序列化
- tf.io.ParseSingleExample # 解析一个具体的tf example
- tf.io.VarLenFeature, tf.io.FixedLenFeature
- tf.data.TFRecoredDataset, tf.io.TFRecoredOptions

2. 基础API使用

2.1 从内存中构建数据

从内存中构建数据集，参数可以是普通的列表、numpy的一个数组、元组或字典，其中元组形如(x,y)，字典形如key1: x, key2:y，其中x和y的第一个维度需要相同

# 普通的列表
# TensorSliceDataset shapes: (), types: tf.int32
dataset = tf.data.Dataset.from_tensor_slices(list(range(10)))
# numpy数组
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))

x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
# 元祖
# TensorSliceDataset shapes: ((2,), ()), types: (tf.int64, tf.string)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
# 字典
# TensorSliceDataset 
# shapes: feature: (2,), label: (), 
# types: feature: tf.int32, label: tf.string
dataset = tf.data.Dataset.from_tensor_slices('feature': x, 'label': y)

2.2 遍历数据

列表 or numpy数组

dataset = tf.data.Dataset.from_tensor_slices(list(range(10)))
for item in dataset:
    print(item)
    print(item.shape, type(item))
    print(item.numpy())
    print()

其中，dataset的类型是：class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'，里面的每一个元素的类型是：class 'tensorflow.python.framework.ops.EagerTensor'

输出：形如：

tf.Tensor(0, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
0

tf.Tensor(1, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
1
.
.
.
tf.Tensor(9, shape=(), dtype=int32)
() <class 'tensorflow.python.framework.ops.EagerTensor'>
9

元组

x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset = tf.data.Dataset.from_tensor_slices((x, y))
for item in dataset:
    print(item)
    print("=" * 40)
    break
    
for item_x, item_y in dataset:
    print(item_x)
    print(item_y)
    print("=" * 20)

输出：

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>)
========================================
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
====================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
====================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
====================

字典

x = [[1, 2], [3, 4], [5, 6]]
y = ['cat', 'dog', 'fox']
dataset = tf.data.Dataset.from_tensor_slices('feature': x, 'label': y)
for item in dataset:
    print(item)
    print(item['feature'])
    print(item['label'])
    print("=" * 20)

输出：

'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
====================
'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
====================
'feature': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([5, 6], dtype=int32)>, 'label': <tf.Tensor: shape=(), dtype=string, numpy=b'fox'>
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
====================

2.2 repeat

重复

遍历：repeat将原数据重复指定次数，在遍历的时候，每个元素的类型同未repeat的时候一致，只不过元素的数目变多了，是原来数据的制指定次数倍。

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
dataset = dataset.repeat(3)
print(dataset)
print(type(dataset))

输出：

<RepeatDataset shapes: (), types: tf.int64>
<class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>

dataset = tf.data.Dataset.from_tensor_slices('feature': x, 'label': y)
dataset = dataset.repeat(2)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item['feature'])
    print(item['label'])
    print("=" * 40)

输出：

<RepeatDataset shapes: feature: (2,), label: (), types: feature: tf.int32, label: tf.string>
<class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'> 
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
========================================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
========================================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
========================================
tf.Tensor([1 2], shape=(2,), dtype=int32)
tf.Tensor(b'cat', shape=(), dtype=string)
========================================
tf.Tensor([3 4], shape=(2,), dtype=int32)
tf.Tensor(b'dog', shape=(), dtype=string)
========================================
tf.Tensor([5 6], shape=(2,), dtype=int32)
tf.Tensor(b'fox', shape=(), dtype=string)
========================================

2.3 batch

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
dataset = dataset.repeat(3).batch(7, drop_remainder=True)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item)
    print(item.shape, type(item))
    print(item.numpy())
    print()

<BatchDataset shapes: (7,), types: tf.int64>
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[0 1 2 3 4 5 6]

tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[7 8 9 0 1 2 3]

tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[4 5 6 7 8 9 0]

tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
(7,) <class 'tensorflow.python.framework.ops.EagerTensor'>
[1 2 3 4 5 6 7]

dataset = tf.data.Dataset.from_tensor_slices('feature': x, 'label': y)
dataset = dataset.repeat(2).batch(5)
print(dataset)
print(type(dataset))
for item in dataset:
    print(item['feature'])
    print(item['label'])
    print(type(item), type(item['feature']), type(item['label']))
    print("=" * 40)

输出：

<BatchDataset shapes: feature: (None, 2), label: (None,), types: feature: tf.int32, label: tf.string>
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
tf.Tensor(
[[1 2]
 [3 4]
 [5 6]
 [1 2]
 [3 4]], shape=(5, 2), dtype=int32)
tf.Tensor([b'cat' b'dog' b'fox' b'cat' b'dog'], shape=(5,), dtype=string)
<class 'dict'> <class 'tensorflow.python.framework.ops.EagerTensor'> <class 'tensorflow.python.framework.ops.EagerTensor'>
========================================
tf.Tensor([[5 6]], shape=(1, 2), dtype=int32)
tf.Tensor([b'fox'], shape=(1,), dtype=string)
<class 'dict'> <class 'tensorflow.python.framework.ops.EagerTensor'> <class 'tensorflow.python.framework.ops.EagerTensor'>
========================================

2.4 interleave

interleave: 对现有dataset中的每一个元素做处理，每个元素做完处理会产生一个新的结果，interleave会把这些新的结果合并起来，形成一个新的数据集。
case: 例如现有的dataset里面存入的是一系列的文件名，用interleave去做一个变化，遍历文件名数据集中的所有元素集文件名，把文件名对应的文件的内容读取出来，这样每个文件名都形成新的数据集，interleave把新的数据集合并起来，成为一个总的大数据集。

几个关键的参数：

map_fn: 做什么变换
cycle_length：并行程度，同时并行的处理该dataset中多少个元素
block_length：从上面变换的结果中每次取多少个出来

以上是关于Google老师亲授 TensorFlow2.0入门到进阶笔记- (dataset)的主要内容，如果未能解决你的问题，请参考以下文章