Pandas 字符串系列到张量的 int 规范化

Posted 2023-03-12

技术标签:

【中文标题】Pandas 字符串系列到张量的 int 规范化【英文标题】：Pandas String Series to int normalisation for Tensor 【发布时间】：2018-07-06 19:16:04 【问题描述】：

我有一个带有重复字符串值的 Pandas::Series 对象，我需要将其规范化为 int 值以输入 TensorFlow。

我已经研究过按照this 将其转换为Category，但它会为每个项目创建一个代码，而不是识别重复项。

例如我希望进行以下转换

['a', 'b', 'c', 'd', 'a', 'a', 'c'] -> [1, 2, 3, 4, 1, 1, 3]

【问题讨论】：

【参考方案1】：

你需要一点改变factorize:

print ((pd.factorize(['a', 'b', 'c', 'd', 'a', 'a', 'c'])[0] + 1).tolist())
[1, 2, 3, 4, 1, 1, 3]

【讨论】：

【参考方案2】：

转换为类别后需要添加cat.codes

pd.Series(['a', 'b', 'c', 'd', 'a', 'a', 'c']).astype('category').cat.codes+1
Out[1407]: 
0    1
1    2
2    3
3    4
4    1
5    1
6    3
dtype: int8

【讨论】：

以上是关于Pandas 字符串系列到张量的 int 规范化的主要内容，如果未能解决你的问题，请参考以下文章

Pandas 系列列表到一个系列

将 pandas 系列的 dtype <- 'datetime64' 转换为 dtype <- 'np.int' 而无需迭代

将 Pandas 数据帧转换为 PyTorch 张量？

从 Pandas 数据帧转换为 TensorFlow 张量对象

翻译: Pandas Pytorch 数据预处理

使用形状的因子级别将 pandas.DataFrame 转换为 numpy 张量 [重复]