获取 TypeError：在 Apache Spark / Databricks 中尝试流数据时，路径只能是单个字符串

Posted 2023-04-15

技术标签:

【中文标题】获取 TypeError：在 Apache Spark / Databricks 中尝试流数据时，路径只能是单个字符串【英文标题】：Getting TypeError: path can be only a single string when trying Stream Data in Apache Spark / Databricks 【发布时间】：2021-05-13 15:11:13 【问题描述】：

我正在尝试在 Databricks 上的 Apache Spark 中测试流数据。

使用 Azure 事件中心进行流式传输相对简单，但我正在尝试流式传输一些静态数据。

我首先使用以下数据帧读取存储在名为 teststream 的文件夹中的静态数据

thestream = spark.read.parquet('/mnt/lake/RAW/teststream/')

然后，我尝试读取“teststream”文件夹中的数据，方法是使用以下代码将其转换为随着数据到达而不断更新的流式查询：

streamingFlights = (spark
              .readStream
              .option("maxFilesPerTrigger", 1) #Treat a sequence of files as a stream by selecting one file at a time
              .csv(thestream)
            )

但是，当我运行上述程序时，我收到以下错误：

TypeError: path can be only a single string

关于导致错误的原因有什么想法吗？

【问题讨论】：

您正在尝试提供一个 Dataframe 作为 csv 位置。这不起作用。也许看看结构化流编程指南，开始使用 Spark 结构化流。免费获取Learning Spark，来自databricks.com/p/ebook/learning-spark-from-oreilly的2ed @mike，我认为您可能并不完全正确。这是因为以下代码在 Databricks Academy 课程中。

streamingFlights = (spark               .readStream               .schema(schema)               .option("maxFilesPerTrigger", 1) #Treat a sequence of files as a stream by selecting one file at a time               .csv(flightsPath)             )

它允许您从 csv 文件中读取流作为位置 @Patterson 那么您的问题中可能有错字：csv(thestream) where thestream ist defined with spark.read.... 【参考方案1】：

以下代码将按照文件创建的时间顺序一次读取一个文件来模拟文件流。

dataPath = "/mnt/lake/RAW/DummyEventData/"

static = spark.read.parquet(dataPath)
dataSchema = static.schema

deltaStreamWithTimestampDF = (spark
  .readStream
  .format("delta")
  .option("maxFilesPerTrigger", 1)
  .schema(dataSchema)
  .parquet(dataPath)
)

【讨论】：

以上是关于获取 TypeError：在 Apache Spark / Databricks 中尝试流数据时，路径只能是单个字符串的主要内容，如果未能解决你的问题，请参考以下文章