PySpark 将低于计数阈值的值替换为值

Posted 2023-04-15

技术标签:

【中文标题】PySpark 将低于计数阈值的值替换为值【英文标题】：PySpark replace values below count threshold with values 【发布时间】：2017-07-21 15:35:33 【问题描述】：

我正在尝试用另一个值 - Spark 2.1.0、PySpark API 和使用 DataFrame 替换所有低于设置阈值的值。

当我在一个示例（下面的df1）上测试该功能时，它可以工作。但不是我的真实数据。两者都有 dtype 属性 - 字符串。使用df1，我在for循环中运行'cat'和'integers'列，这也是我想要的真实数据DF，它再次完美运行。

df1 = spark.createDataFrame([
    (0, "a","1"),
    (1, "b","1"),
    (2, "c","1"),
    (3, "a","1"),
    (4, "a","2"),
    (5, "c","2"),
    (6,"b","1"),
    (7,"b","1"),
], ["id", "cat","integer"])

def cutoff(df,feat,threshold, otherclass='other'):
    if isinstance(threshold, float) and threshold<1:
        threshold = str(int(threshold*df.count()))

    if isinstance(threshold,int):
        threshold = str(threshold)

    temp = df.groupBy(feat).count().orderBy('count')
    replace = temp.filter("count<"+threshold).select(feat).rdd.map(lambda r:r[0]).collect()
    print "replacing ", replace,replace.__class__, " with ", otherclass, " subset ", feat
    df = df.replace(replace,otherclass,feat)

    return df

但是当使用真实数据（从 Hive 导入 SQL）时，我得到了

mydata (just a part):
+---------------+-----+
|   browser_name|count|
+---------------+-----+
|         Chrome| 2197|
|             IE|  719|
|        Firefox|  542|
|  Mobile Safari|  370|
|android Browser|  361|
|           Edge|  265|
| Chrome WebView|  203|

replacing  [u'Iron', u'UCBrowser', u'Puffin', u'Opera Mini', u'Yandex', u'Maxthon', u'Silk', u'Vivaldi', None, u'MIUI Browser', u'Chromium', u'WebKit', u'IEMobile', u'Facebook', u'Chrome WebView', u'Safari', u'Opera', u'Android Browser', u'Mobile Safari', u'Edge', u'IE', u'Firefox'] <type 'list'>  with  other  subset  browser_name


Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-1704248642413819893.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-1704248642413819893.py", line 265, in <module>
    exec(code)
  File "<stdin>", line 12, in <module>
  File "<stdin>", line 10, in cutoff
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 1345, in replace
    self._jdf.na().replace(self._jseq(subset), self._jmap(rep_dict)), self.sql_ctx)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4471.replace.
: scala.MatchError: null
    at org.apache.spark.sql.DataFrameNaFunctions.replace0(DataFrameNaFunctions.scala:351)
    at org.apache.spark.sql.DataFrameNaFunctions.replace(DataFrameNaFunctions.scala:336)
    at sun.reflect.GeneratedMethodAccessor359.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

所以它抱怨replace 函数。但是当我只跑 df.replace(['Chrome','Firefox'],'hovno','browser_name').show(10) 它再次像魅力一样工作（我什至尝试在函数生成列表时以 unicode 输入列表并且它没问题）。所以我想知道我的DF 在它不起作用的功能中做了什么？我理解MatchError 无法找到要替换的源值，但它们肯定在那里。

谢谢一百万！

【问题讨论】：

【参考方案1】：

问题是我的数据框包含 replace 无法匹配的 None 值。作为一种解决方法，我首先使用fillna，它将 None 值替换为某些东西，然后它就像一个魅力。

【讨论】：

以上是关于PySpark 将低于计数阈值的值替换为值的主要内容，如果未能解决你的问题，请参考以下文章