PySpark:用数值替换字符串

Posted

技术标签:

【中文标题】PySpark:用数值替换字符串【英文标题】:PySpark: replace string by numerical values 【发布时间】:2018-04-10 13:33:34 【问题描述】:

我有下表:

 +----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |  _created|  _updated|                name|         description|          indication|                name|      patents_patent|
    +----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |2005-06-13|2016-08-17|           Lepirudin|Lepirudin is iden...|For the treatment...|           Lepirudin|"data" : ["coun...|
    |2005-06-13|2017-04-27|           Cetuximab|Cetuximab is an e...|Cetuximab, used i...|           Cetuximab|"data" : ["coun...|
    |2005-06-13|2017-06-14|        Dornase alfa|Dornase alfa is a...|Used as adjunct t...|        Dornase alfa|"data" : ["coun...|
    |2005-06-13|2016-08-17| Denileukin diftitox|A recombinant DNA...|For treatment of ...| Denileukin diftitox|                NULL|
    |2005-06-13|2017-03-10|          Etanercept|Dimeric fusion pr...|Etanercept is ind...|          Etanercept|"data" : ["coun...|
    |2005-06-13|2017-07-06|         Bivalirudin|Bivalirudin is a ...|For treatment of ...|         Bivalirudin|"data" : ["coun...|
    |2005-06-13|2017-07-05|          Leuprolide|Leuprolide belong...|For treatment of ...|          Leuprolide|"data" : ["coun...|
    |2005-06-13|2017-06-16|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|"data" : ["coun...|
    |2005-06-13|2017-06-08|           Alteplase|Human tissue plas...|For management of...|           Alteplase|                NULL|
    |2005-06-13|2016-12-08|          Sermorelin|Sermorelin acetat...|For the treatment...|          Sermorelin|                NULL|
    |2005-06-13|2016-08-17|  Interferon alfa-n1|Purified, natural...|For treatment of ...|  Interferon alfa-n1|                NULL|

理想情况下,我需要派生 2 个表:

table_one 我将过滤掉patent_patent不为NULL的表,并将patent-patent中的字符串替换为1:

+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |  _created|  _updated|                name|         description|          indication|                name|      patents_patent|
    +----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |2005-06-13|2016-08-17|           Lepirudin|Lepirudin is iden...|For the treatment...|           Lepirudin|1|
    |2005-06-13|2017-04-27|           Cetuximab|Cetuximab is an e...|Cetuximab, used i...|           Cetuximab|1|
    |2005-06-13|2017-06-14|        Dornase alfa|Dornase alfa is a...|Used as adjunct t...|        Dornase alfa|1|
    |2005-06-13|2017-03-10|          Etanercept|Dimeric fusion pr...|Etanercept is ind...|          Etanercept|1|
    |2005-06-13|2017-07-06|         Bivalirudin|Bivalirudin is a ...|For treatment of ...|         Bivalirudin|1|
    |2005-06-13|2017-07-05|          Leuprolide|Leuprolide belong...|For treatment of ...|          Leuprolide|1|
    |2005-06-13|2017-06-16|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|Peginterferon alf...|1|
    |

table_two = 过滤掉patents_patent为空的表并将空替换为0

    +----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
   |  _created|  _updated|                name|         description|          indication|                name|      patents_patent|
    +----------+----------+--------------------+--------------------+--------------------+--------------------+------------------
   |2005-06-13|2016-08-17| Denileukin diftitox|A recombinant DNA...|For treatment of ...| Denileukin diftitox|                0|

    |2005-06-13|2017-06-08|           Alteplase|Human tissue plas...|For management of...|           Alteplase|                0|
    |2005-06-13|2016-12-08|          Sermorelin|Sermorelin acetat...|For the treatment...|          Sermorelin|                0|
    |2005-06-13|2016-08-17|  Interferon alfa-n1|Purified, natural...|For treatment of ...|  Interferon alfa-n1|                0|

我试过了:

我试过这个:

from pyspark.sql.functions import col, expr, when

data = table.where(col("patents_patent").isNull())

data = table.filter("patents_patent is not NULL")

结果错误或为空:!

root
 |-- _created: string (nullable = true)
 |-- _updated: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- indication: string (nullable = true)
 |-- patents_patent: string (nullable = true)

感谢您的帮助!

【问题讨论】:

你能说明你尝试了什么以及问题出在哪里吗? 我更新了问题 patents_patent 是什么类型的可以分享一下架构吗? 实际上是 null 的值,还是您在示例中显示的字符串 "NULL" th 字符串为 NULL 如示例所示 【参考方案1】:

table_one 我将过滤掉patent_patent不为NULL的表,并将patent-patent中的字符串替换为1:

对于第一种情况,你应该这样做

data_not_null = table.filter((table['patents_patent'] != "NULL")).withColumn('patents_patent', f.lit("1"))

table_two = 过滤掉patents_patent为空的表并将空替换为0

第二次你应该做以下事情

data_null = table.where(f.col("patents_patent").isNull() | (table['patents_patent'] == "NULL")).withColumn('patents_patent', f.lit("0"))

对于这些我导入为

from pyspark.sql import functions as f

当然f.col("patents_patent")table['patents_patent']的意思是一样的

【讨论】:

以上是关于PySpark:用数值替换字符串的主要内容,如果未能解决你的问题,请参考以下文章

Pyspark - 用不同的字符替换字符串的一部分(字符数不均匀)

Pyspark 数组列 - 用默认值替换空元素

修改字符串列并替换子字符串 pyspark

在 Pyspark 中屏蔽/替换字符串列的内部

PySpark 将低于计数阈值的值替换为值

使用 PySpark 删除和替换字符