如何从 pandas 创建与 tf.data.experimental.make_csv_dataset 相同的结构

Posted 2023-02-16

技术标签:

【中文标题】如何从 pandas 创建与 tf.data.experimental.make_csv_dataset 相同的结构【英文标题】：How to create the same structure of tf.data.experimental.make_csv_dataset from pandas 【发布时间】：2021-12-15 01:19:58 【问题描述】：

tf.data.experimental.make_csv_dataset 创建一个 TF 数据集，为 Kears 监督训练做好准备。

titanic_file = tf.keras.utils.get_file("titanic_train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic = tf.data.experimental.make_csv_dataset(
    titanic_file,
    label_name="survived",
    batch_size=1,   # To compre with the head of CSV
    shuffle=False,  # To compre with the head of CSV
    header=True,
)
for row in titanic.take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"feature:20s: value")
    
    print(f"label/survived      : label")    
-----
sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : [0]

如何从 Pandas 创建相同的内容？在下面尝试，但标签是字典而不是 int32。

df = pd.read_csv(titanic_file)
titanic_from_pandas = tf.data.Dataset.from_tensor_slices((
    dict(df.loc[:, df.columns != 'survived']),
    dict(df.loc[:, ['survived']])
))
for row in titanic_from_pandas.batch(1).take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"feature:20s: value")
    
    print(f"label/survived      : label")    
---
sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : 'survived': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>  <-----

顺便说一下，为 Keras 监督训练准备的数据结构是（特征、标签），但是哪个文档定义了它？

【问题讨论】：

只需df['survived']。你清楚地将dict传递给tf.data.Dataset.from_tensor_slices，所以你得到了dict，我不明白问题出在哪里：P tensorflow.org/api_docs/python/tf/keras/Model#fit 定义了应该传递给.fit()的内容 【参考方案1】：

正如@Proko 建议的那样。

titanic_from_pandas = tf.data.Dataset.from_tensor_slices((
    dict(df.loc[:, df.columns != 'survived']),
    df.loc[:, 'survived']
))
for row in titanic_from_pandas.batch(1).take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"feature:20s: value")
    
    print(f"label/survived      : label")    
---
sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : [0]

【讨论】：

以上是关于如何从 pandas 创建与 tf.data.experimental.make_csv_dataset 相同的结构的主要内容，如果未能解决你的问题，请参考以下文章