在 python 中使用 pandas，numpy 是不是有 pyspark.ml.feature StringIndexer 的替代方法？

Posted 2023-04-15

技术标签:

【中文标题】在 python 中使用 pandas，numpy 是不是有 pyspark.ml.feature StringIndexer 的替代方法？【英文标题】：Is there an alternative for pyspark.ml.feature StringIndexer in python using pandas, numpy?在 python 中使用 pandas，numpy 是否有 pyspark.ml.feature StringIndexer 的替代方法？ 【发布时间】：2018-05-02 23:45:30 【问题描述】：

StringIndexer 将标签的字符串列编码为标签索引的列。

id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0

如何在不使用 pyspark.ml.feature StringIndexer 的情况下在 python 中实现这一点？

【问题讨论】：

为什么 a = 0 而 c = 1？是的，你可以使用pd.factorize。 【参考方案1】：

既然你提到了pandas，请尝试使用ngroup

df.groupby('category').ngroup()
Out[564]: 
0    0
1    1
2    2
3    0
4    0
5    2
dtype: int64

【讨论】：

以上是关于在 python 中使用 pandas，numpy 是不是有 pyspark.ml.feature StringIndexer 的替代方法？的主要内容，如果未能解决你的问题，请参考以下文章