PySpark:用数值替换字符串
Posted
技术标签:
【中文标题】PySpark:用数值替换字符串【英文标题】:PySpark: replace string by numerical values 【发布时间】:2018-04-10 13:33:34 【问题描述】:我有下表:
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _created| _updated| name| description| indication| name| patents_patent|
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|2005-06-13|2016-08-17| Lepirudin|Lepirudin is iden...|For the treatment...| Lepirudin|"data" : ["coun...|
|2005-06-13|2017-04-27| Cetuximab|Cetuximab is an e...|Cetuximab, used i...| Cetuximab|"data" : ["coun...|
|2005-06-13|2017-06-14| Dornase alfa|Dornase alfa is a...|Used as adjunct t...| Dornase alfa|"data" : ["coun...|
|2005-06-13|2016-08-17| Denileukin diftitox|A recombinant DNA...|For treatment of ...| Denileukin diftitox| NULL|
|2005-06-13|2017-03-10| Etanercept|Dimeric fusion pr...|Etanercept is ind...| Etanercept|"data" : ["coun...|
|2005-06-13|2017-07-06| Bivalirudin|Bivalirudin is a ...|For treatment of ...| Bivalirudin|"data" : ["coun...|
|2005-06-13|2017-07-05| Leuprolide|Leuprolide belong...|For treatment of ...| Leuprolide|"data" : ["coun...|
|2005-06-13|2017-06-16|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|"data" : ["coun...|
|2005-06-13|2017-06-08| Alteplase|Human tissue plas...|For management of...| Alteplase| NULL|
|2005-06-13|2016-12-08| Sermorelin|Sermorelin acetat...|For the treatment...| Sermorelin| NULL|
|2005-06-13|2016-08-17| Interferon alfa-n1|Purified, natural...|For treatment of ...| Interferon alfa-n1| NULL|
理想情况下,我需要派生 2 个表:
table_one 我将过滤掉patent_patent不为NULL的表,并将patent-patent中的字符串替换为1:
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _created| _updated| name| description| indication| name| patents_patent|
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|2005-06-13|2016-08-17| Lepirudin|Lepirudin is iden...|For the treatment...| Lepirudin|1|
|2005-06-13|2017-04-27| Cetuximab|Cetuximab is an e...|Cetuximab, used i...| Cetuximab|1|
|2005-06-13|2017-06-14| Dornase alfa|Dornase alfa is a...|Used as adjunct t...| Dornase alfa|1|
|2005-06-13|2017-03-10| Etanercept|Dimeric fusion pr...|Etanercept is ind...| Etanercept|1|
|2005-06-13|2017-07-06| Bivalirudin|Bivalirudin is a ...|For treatment of ...| Bivalirudin|1|
|2005-06-13|2017-07-05| Leuprolide|Leuprolide belong...|For treatment of ...| Leuprolide|1|
|2005-06-13|2017-06-16|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|1|
|
table_two = 过滤掉patents_patent为空的表并将空替换为0
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _created| _updated| name| description| indication| name| patents_patent|
+----------+----------+--------------------+--------------------+--------------------+--------------------+------------------
|2005-06-13|2016-08-17| Denileukin diftitox|A recombinant DNA...|For treatment of ...| Denileukin diftitox| 0|
|2005-06-13|2017-06-08| Alteplase|Human tissue plas...|For management of...| Alteplase| 0|
|2005-06-13|2016-12-08| Sermorelin|Sermorelin acetat...|For the treatment...| Sermorelin| 0|
|2005-06-13|2016-08-17| Interferon alfa-n1|Purified, natural...|For treatment of ...| Interferon alfa-n1| 0|
我试过了:
我试过这个:
from pyspark.sql.functions import col, expr, when
data = table.where(col("patents_patent").isNull())
data = table.filter("patents_patent is not NULL")
结果错误或为空:!
root
|-- _created: string (nullable = true)
|-- _updated: string (nullable = true)
|-- name: string (nullable = true)
|-- description: string (nullable = true)
|-- indication: string (nullable = true)
|-- patents_patent: string (nullable = true)
感谢您的帮助!
【问题讨论】:
你能说明你尝试了什么以及问题出在哪里吗? 我更新了问题patents_patent
是什么类型的可以分享一下架构吗?
实际上是 null
的值,还是您在示例中显示的字符串 "NULL"
?
th 字符串为 NULL 如示例所示
【参考方案1】:
table_one 我将过滤掉patent_patent不为NULL的表,并将patent-patent中的字符串替换为1:
对于第一种情况,你应该这样做
data_not_null = table.filter((table['patents_patent'] != "NULL")).withColumn('patents_patent', f.lit("1"))
table_two = 过滤掉patents_patent为空的表并将空替换为0
第二次你应该做以下事情
data_null = table.where(f.col("patents_patent").isNull() | (table['patents_patent'] == "NULL")).withColumn('patents_patent', f.lit("0"))
对于这些我导入为
from pyspark.sql import functions as f
当然f.col("patents_patent")
和table['patents_patent']
的意思是一样的
【讨论】:
以上是关于PySpark:用数值替换字符串的主要内容,如果未能解决你的问题,请参考以下文章