将 TFRecords 和 tf.Examples 转换为常用数据类型

Posted 2023-03-11

技术标签:

【中文标题】将 TFRecords 和 tf.Examples 转换为常用数据类型【英文标题】：Converting TFRecords and tf.Examples to commonly used data types 【发布时间】：2020-11-27 11:31:00 【问题描述】：

我正在学习创建 TensorFlow Extended 管道并发现它们非常有用。但是，我还没有弄清楚如何调试和测试通过这些管道的（表格）数据。我知道 TensorFlow 使用 TFRecords/tf.Examples，它们是 protobufs。

这些可以通过使用TFRecordDataset 和 tf.Example 的parseFromString 进行人工阅读。不过，这种格式很难阅读。

如何实际测试数据？我觉得我需要一个熊猫数据框。而且由于我们有 100 多个列和不同的用例，每次我想这样做时，我几乎无法定义所有列。我可以以某种方式使用我的架构吗？谢谢！

编辑：我会接受@TheEngineer 的回答，因为它给了我关于如何实现我想要的东西的关键提示。不过，我还是想分享我的解决方案。

免责声明：我使用此代码只是为了测试并查看我的管道中发生了什么。在生产中使用此代码时要小心。可能有更好、更安全的方法。

import sys 
import numpy as np
import tensorflow_data_validation as tfdv 

# Our default values for missing values within the tfrecord. We'll restore them later
STR_NA_VALUE = "NA"
INT_NA_VALUE = -sys.maxsize - 1
FLOAT_NA_VALUE = float("nan")

# Create a dict containing FixedLenFeatures using our schema
def load_schema_as_feature_dict(schema_path):
    schema = tfdv.load_schema_text(schema_path)

    def convert_feature(feature):
        if feature.type == 1:
            return tf.io.FixedLenFeature((), tf.string, STR_NA_VALUE)
        if feature.type == 2:
            return tf.io.FixedLenFeature((), tf.int64, INT_NA_VALUE)
        if feature.type == 3:
            return tf.io.FixedLenFeature((), tf.float32, FLOAT_NA_VALUE)
        raise ValueError("Non-implemented type ".format(feature.type))

    return dict((feature.name, convert_feature(feature)) for feature in schema.feature)  

def as_pandas_frame(tfrecord_path, schema_path):
    feature_dict = load_schema_as_feature_dict(schema_path)
    dataset = tf.data.TFRecordDataset(tfrecord_path, compression_type="GZIP")
    parsed_dataset = dataset.map(lambda serialized_example: tf.io.parse_single_example(serialized_example, feature_dict))
    df = pd.DataFrame(list(parsed_dataset.as_numpy_iterator()))
    
    # Restore NA values from default_values we had to set
    for key, value in np.object: str.encode(STR_NA_VALUE), np.int64: INT_NA_VALUE, np.float: FLOAT_NA_VALUE.items():
        type_columns = df.select_dtypes(include=[key]).columns
        df[type_columns] = df[type_columns].replace(value:None)
    
    return df

现在，您只需使用存储的 tfrecord 和 schema.pbtxt 文件调用此函数：

df = as_pandas_frame("path/to/your/tfrecord.gz", "path/to/your/schema.pbtxt")

【问题讨论】：

【参考方案1】：

我不确定您所说的 TFRecordDataset 是什么意思，很难阅读。但她是我如何使用我的 TFRecord 数据的一个例子。 Feature_description 包含 TFRecord 中每个样本所拥有的特性（及其数据类型）一旦以这种方式加载记录，您就可以使用它执行各种操作，包括批处理、扩充、在管道中改组或访问单个文件、转换他们到 numpy 等。

import tensorflow as tf
import numpy as np
from PIL import Image

filenames = []
for i in range(128):
    name = "./../result/validation-%.5d-of-%.5d" % (i, 128)
    filenames.append(name)

def read_tfrecord(serialized_example):
    feature_description = 
            'image/height': tf.io.FixedLenFeature((), tf.int64),
            'image/width': tf.io.FixedLenFeature((), tf.int64),
            'image/colorspace': tf.io.FixedLenFeature((), tf.string),
            'image/channels': tf.io.FixedLenFeature((), tf.int64),
            'image/class/label': tf.io.FixedLenFeature((), tf.int64),
            'image/encoded': tf.io.FixedLenFeature((), tf.string),
    

    parsed_features = tf.io.parse_single_example(serialized_example, feature_description)

    parsed_features['image/encoded'] = tf.io.decode_jpeg(
            parsed_features['image/encoded'], channels=3)

    return parsed_features



data = tf.data.TFRecordDataset(filenames)


parsed_dataset = data.shuffle(128).map(read_tfrecord).batch(128)


for sample in parsed_dataset.take(1):
        numpyed = sample['image/encoded'].numpy()
        img = Image.fromarray(numpyed, 'RGB')
        img.show()
        tf.print(sample['image/class/label'])

【讨论】：

我所说的“难以阅读”的意思是我只能获得 features feature ... 格式的输出，并且无法将其转换为 pd 框架。此外，拥有 100 多个功能，我无法使用您所展示的功能描述。我已经编辑了我的问题，以显示我如何从我的模式中提取 FixedLenFeatures 以获取表格数据。谢谢！

以上是关于将 TFRecords 和 tf.Examples 转换为常用数据类型的主要内容，如果未能解决你的问题，请参考以下文章

将 .tfrecords 文件拆分为多个 .tfrecords 文件

TensorFlow TFRecords简介

TFRecords文件的生成和读取（样例实现）

tensorflow二进制文件读取与tfrecords文件读取

原始数据划分以及TFrecords实战

tensorflow中tfrecords使用介绍