Apache Spark 2.4.0、AWS EMR、Spark Redshift 和 User 类抛出异常:java.lang.AbstractMethodError
Posted
技术标签:
【中文标题】Apache Spark 2.4.0、AWS EMR、Spark Redshift 和 User 类抛出异常:java.lang.AbstractMethodError【英文标题】:Apache Spark 2.4.0, AWS EMR, Spark Redshift and User class threw exception: java.lang.AbstractMethodError 【发布时间】:2019-03-11 13:16:47 【问题描述】:我使用 Apache Spark 2.4.0、AWS EMR 和 Spark Redshift,现在在读取 Spark DataFrame 中的 Redshift 表时遇到以下错误:
User class threw exception: java.lang.AbstractMethodError
at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:48)
at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:47)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:47)
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyReadSchema(DataSourceUtils.scala:39)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:400)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:168)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:403)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3360)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
版本:
spark-redshift_2.11:3.0.0-preview1
apache.spark 2.11:2.0.4
我做错了什么以及如何解决这个问题?
【问题讨论】:
看起来版本不匹配,但您是否知道 Spark Redshift 不再受支持或维护?而且很久没来了。 考虑到这一点,有哪些选项可以解决此问题? Spark redshift 连接器似乎已经过时,它是为旧的 spark 版本构建的。用例是什么?对于少量数据的测试,JDBC 效果很好。对于实际数据加载,最好写入 S3 并使用 Redshift 复制命令从那里加载,或者使用 Firehose 自动为您执行此操作。 我需要加入 Spark DataFrames。其中之一必须基于 Redshift 表创建。 ***.com/questions/55704871/…希望我的帖子对你有帮助,我认为这是一个依赖问题 【参考方案1】:您可以改用https://mvnrepository.com/artifact/com.amazon.redshift/redshift-jdbc42-no-awssdk,然后使用jdbc:redshift://...
创建表
【讨论】:
感谢您的回答。使用纯 JDBC 读取/写入大量数据(如 1TB+)的 Redshift 性能如何? 至少写作不是很高效。前面说了,最好通过S3来处理。以上是关于Apache Spark 2.4.0、AWS EMR、Spark Redshift 和 User 类抛出异常:java.lang.AbstractMethodError的主要内容,如果未能解决你的问题,请参考以下文章
Apache Spark错误使用hadoop将数据卸载到AWS S3
不清楚在 aws cloudformation yaml 模板中添加 --conf spark.jars.packages=org.apache.spark:spark-avro_2.11:2.4.4
在 ec2 上托管的 apache spark 中使用 AWS EMRFS