Pyspark 数据框中的 regexp_replace
Posted
技术标签:
【中文标题】Pyspark 数据框中的 regexp_replace【英文标题】:regexp_replace in Pyspark dataframe 【发布时间】:2020-07-02 14:57:41 【问题描述】:我在 Pyspark 数据帧上运行了regexp_replace
命令,然后所有数据的数据类型都更改为字符串。为什么会这样?
下面是我使用 regex_replace 之前的表格
root
|-- account_id: long (nullable = true)
|-- credit_card_limit: long (nullable = true)
|-- credit_card_number: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- phone_number: long (nullable = true)
|-- amount: long (nullable = true)
|-- date: string (nullable = true)
|-- shop: string (nullable = true)
|-- transaction_code: string (nullable = true)
应用 regexp_replace 后的架构
root
|-- date_type: date (nullable = true)
|-- c_phone_number: string (nullable = true)
|-- c_account_id: string (nullable = true)
|-- c_credit_card_limit: string (nullable = true)
|-- c_credit_card_number: string (nullable = true)
|-- c_amount: string (nullable = true)
|-- c_full_name: string (nullable = true)
|-- c_transaction_code: string (nullable = true)
|-- c_shop: string (nullable = true)
我使用的代码:
df=df.withColumn('c_phone_number',regexp_replace("phone_number","[^0-9]","")).drop('phone_number')
df=df.withColumn('c_account_id',regexp_replace("account_id","[^0-9]","")).drop('account_id')
df=df.withColumn('c_credit_card_limit',regexp_replace("credit_card_limit","[^0-9]","")).drop('credit_card_limit')
df=df.withColumn('c_credit_card_number',regexp_replace("credit_card_number","[^0-9]","")).drop('credit_card_number')
df=df.withColumn('c_amount',regexp_replace("amount","[^0-9 ]","")).drop('amount')
df=df.withColumn('c_full_name',regexp_replace("full_name","[^a-zA-Z ]","")).drop('full_name')
df=df.withColumn('c_transaction_code',regexp_replace("transaction_code","[^a-zA-Z]","")).drop('transaction_code')
df=df.withColumn('c_shop',regexp_replace("shop","[^a-zA-Z ]","")).drop('shop')
为什么会这样?有没有办法将其转换为其原始数据类型,或者我应该再次使用 cast 吗?
【问题讨论】:
【参考方案1】:您可能想查看来自 spark git 的代码 regexp_replace
-
override def nullSafeEval(s: Any, p: Any, r: Any): Any =
if (!p.equals(lastRegex))
// regex value changed
lastRegex = p.asInstanceOf[UTF8String].clone()
pattern = Pattern.compile(lastRegex.toString)
if (!r.equals(lastReplacementInUTF8))
// replacement string changed
lastReplacementInUTF8 = r.asInstanceOf[UTF8String].clone()
lastReplacement = lastReplacementInUTF8.toString
val m = pattern.matcher(s.toString())
result.delete(0, result.length())
while (m.find)
m.appendReplacement(result, lastReplacement)
m.appendTail(result)
UTF8String.fromString(result.toString)
-
上面的代码接受表达式为
Any
,然后在上面调用toString()
终于在toString
中再次转换结果
UTF8String.fromString(result.toString)
参考-spark-git
【讨论】:
以上是关于Pyspark 数据框中的 regexp_replace的主要内容,如果未能解决你的问题,请参考以下文章