使用 sbt test 运行时 Spark 测试失败

Posted 2023-03-24

技术标签:

【中文标题】使用 sbt test 运行时 Spark 测试失败【英文标题】：Spark tests failing when running with sbt test 【发布时间】：2019-11-04 02:56:38 【问题描述】：

我们已经为 spark 编写了单元测试，在本地模式下使用 4 个线程。

当一个个启动时，例如通过 intellij 或 sbt testOnly，每个测试都运行良好。

当使用 sbt test 启动时，它们会失败并出现类似

的错误

[info] java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.csv.CSVFileFormat not a subtype

我们特别将 sbt 和 spark 版本升级到最新版本，尝试在 build.sbt 中使用 fork in test := true 运行，但这没有帮助。

Spark 版本为 2.4.3，sbt 版本为 1.2.8，scala 版本为 2.12.8。

sbt 配置没什么特别的：

libraryDependencies ++= Seq(
  Dependencies.Test.concordion,
  Dependencies.`spark-sql` exclude("org.slf4j","slf4j-log4j12"),
  Dependencies.`better-files`
)

fork in test := true


dependencyOverrides += "com.google.guava" % "guava" % "11.0.2" 
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7.1"

我们正在使用一个包含多个不同子项目的 sbt 项目，以这种方式定义：

scalacOptions in ThisBuild ++= Seq(
  "-encoding", "UTF-8", // source files are in UTF-8
  "-deprecation", // warn about use of deprecated APIs
  "-Yrangepos", // use range positions for syntax trees
  "-language:postfixOps", //  enables postfix operators
  "-language:implicitConversions", // enables defining implicit methods and members
  "-language:existentials", // enables writing existential types
  "-language:reflectiveCalls", // enables reflection
  "-language:higherKinds", // allow higher kinded types without `import scala.language.higherKinds`
  "-unchecked", // warn about unchecked type parameters
  "-feature", // warn about misused language features
  /*"-Xlint",               // enable handy linter warnings
    "-Xfatal-warnings",     // turn compiler warnings into errors*/
  "-Ypartial-unification" // allow the compiler to unify type constructors of different arities
)

autoCompilerPlugins := true

addCompilerPlugin(Dependencies.`kind-projector`)
addCompilerPlugin(Dependencies.`better-monadic-for`)


// Define the root project, and make it compile all child projects
lazy val `datarepo` =
  project
    .in(file("."))
    .aggregate(
      `foo`,
      `foo-other`,
      `sparkusingproject`,
      `sparkusingproject-test`,
      `sparkusingproject-other`,
    )

// Define individual projects, the directories they reside in, and other projects they depend on
lazy val `foo` =
  project
    .in(file("foo"))
    .settings(Common.defaultSettings: _*)

lazy val `foo-other` =
  project
    .in(file("foo-other"))
    .dependsOn(`foo`)
    .settings(Common.defaultSettings: _*)

【问题讨论】：

您能否提供有关代码的更多详细信息。 @Nikk：我添加了更多细节，对你来说更好吗？ 【参考方案1】：

我刚刚在测试中遇到了这个异常，这是由于尝试在与我启动 SparkSession 的线程不同的线程中运行 Spark 操作引起的。您可能想要禁用parallelExecution in Test（无论如何建议在 Spark 集成测试中使用此功能）。

具体来说，我尝试并行执行多个 Spark 操作，并尝试在 Scala 的 ExecutionContext.global 线程池中执行此操作。当我改为创建 newFixedPoolExecutor 时，一切都开始正常了。

AFAICT 这是因为在DataSource.scala:610 中，Spark 获取了线程的 ContextClassLoader：

    val loader = Utils.getContextOrSparkClassLoader

并且，当在 Scala 的默认线程池中运行时，类加载器不包含相关的类和接口。相反，当您创建一个新的线程池时，它会从当前线程继承正确的类加载器并在之后正常工作。

【讨论】：

以上是关于使用 sbt test 运行时 Spark 测试失败的主要内容，如果未能解决你的问题，请参考以下文章