有没有办法将生成的 groupby 流加入到 kafka-spark 结构化流中的原始流？

Posted 2023-04-18

技术标签:

【中文标题】有没有办法将生成的 groupby 流加入到 kafka-spark 结构化流中的原始流？【英文标题】：Is there a way to join resulting stream of groupby back to its original stream in kafka-spark-structured streaming? 【发布时间】：2019-04-15 08:34:56 【问题描述】：

我正在阅读来自 Kafka 主题的流。我正在事件时间执行窗口 groupBy 操作。现在，我想将这个来自 groupBy 的结果流加入到原始流中。

#I am reading from a Kafka topic. The following is a ping statement:
2019-04-15 13:32:33 | 64 bytes from google.com (X.X.X.X): icmp_seq=476 ttl=63 time=0.201ms
2019-04-15 13:32:34 | 64 bytes from google.com (X.X.X.X): icmp_seq=477 ttl=63 time=0.216ms
2019-04-15 13:32:35 | 64 bytes from google.com (X.X.X.X): icmp_seq=478 ttl=63 time=0.245ms
2019-04-15 13:32:36 | 64 bytes from google.com (X.X.X.X): icmp_seq=479 ttl=63 time=0.202ms
and so on..

root
|--key: binary
|--value: binary
|--topic: string
|--partition: integer
|--offset:long
|--timestamp:timestamp
|--timestampType:integer

#value contains the above ping statement so, I cast it as string.
words = lines.selectExpr("CAST(value AS STRING)")

#Now I split that line into columns with its values.
words = words.withColumn('string_val', F.regexp_replace(F.split(words.value, " ")[6], ":", "")) \
.withColumn('ping', F.regexp_replace(F.split(words.value, " ")[10], "time=", "").cast("double")) \
.withColumn('date', F.split(words.value, " ")[0]) \
.withColumn('time', F.regexp_replace(F.split(words.value, " ")[1], "|", ""))

words = words.withColumn('date_time', F.concat(F.col('date'), F.lit(" "), F.col('time')))
words = words.withColumn('date_format', F.col('date_time').cast('timestamp'))

#Now the schema becomes like this
root
|--value:string
|--string_val:string
|--ping:double
|--date:string
|--time:string
|--date_time:string
|--date_format:timestamp

#Now I have to perform a windowed groupBy operation with watermark
w = F.window('date_format', '30 seconds', '10 seconds')
words = words \
.withWatermark('date_format', '1 minutes') \
.groupBy(w).agg(F.mean('ping').alias('value'))

#Now my schema becomes like this
root
|--window:struct
|   |--start:timestamp
|   |--end:timestamp
|--value

有没有办法将此结果流加入到其原始流中？

【问题讨论】：

【参考方案1】：

这可以使用 spark 2.3 中引入的“流到流连接”来实现对于 spark 2.3 之前的任何版本，您必须将聚合保存在某个存储（内存中或磁盘）中，并与用于存储聚合状态的存储执行原始流的左外连接。

【讨论】：

以上是关于有没有办法将生成的 groupby 流加入到 kafka-spark 结构化流中的原始流？的主要内容，如果未能解决你的问题，请参考以下文章