通过 Spark 运行时出现 Sqoop 错误

Posted

技术标签:

【中文标题】通过 Spark 运行时出现 Sqoop 错误【英文标题】:Sqoop Error while run via Spark 【发布时间】:2018-03-09 05:23:46 【问题描述】:

当我通过 sqoop 命令运行此代码时,它可以工作

sqoop import --connect "jdbc:sqlserver://myhost:port;databaseName=DBNAME" \
 --username MYUSER -P \
 --compress --compression-codec snappy \
 --as-parquetfile \
 --table MYTABLE \
 --warehouse-dir /user/myuser/test1/ \
 --m 1

然后我创建如下 spark scala 代码。但是当我使用 spark-submit 执行项目时,它不起作用

val sqoop_options: SqoopOptions = new SqoopOptions()
sqoop_options.setConnectString("jdbc:sqlserver://myhost:port;databaseName=DBNAME")
sqoop_options.setTableName("MYTABLE");
sqoop_options.setUsername("MYUSER");
sqoop_options.setPassword("password");
sqoop_options.setNumMappers(1);
sqoop_options.setTargetDir("/user/myuser/test1/");
sqoop_options.setFileLayout(FileLayout.ParquetFile);
sqoop_options.setCompressionCodec("org.apache.hadoop.io.compress.SnappyCodec")
val importTool = new ImportTool
val sqoop = new Sqoop(importTool, conf, sqoop_options);
val retCode = ToolRunner.run(sqoop, null);

它返回驱动程序未找到错误,即使我在同一个集群上运行它。 我已经在 /var/lib/sqoop 目录中放置了适当的库,这就是 sqoop 命令运行良好的原因。但是,当我通过 spark-submit 运行它时,它会引用另一个库路径吗?

详细错误日志:

 /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/spark/conf/spark-env.sh: line 75: spark.driver.extraClassPath=.:/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar://opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/zookeeper.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler.jar: No such file or directory
/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/spark/conf/spark-env.sh: line 77: spark.executor.extraClassPath=.:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar://opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/zookeeper.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler.jar: No such file or directory
2018-03-09 13:59:37,332 INFO  [main] security.UserGroupInformation: Login successful for user myuser using keytab file myuser.keytab
2018-03-09 13:59:37,371 INFO  [main] sqoop.Sqoop: Running Sqoop version: 1.4.6
2018-03-09 13:59:37,426 WARN  [main] sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2018-03-09 13:59:37,478 INFO  [main] manager.SqlManager: Using default fetchSize of 1000
2018-03-09 13:59:37,479 INFO  [main] tool.CodeGenTool: Beginning code generation
2018-03-09 13:59:37,479 INFO  [main] tool.CodeGenTool: Will generate java class as codegen_MYTABLE
Exception in thread "main" java.lang.RuntimeException: Could not load db driver class: com.microsoft.sqlserver.jdbc.SQLServerDriver
        at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856)
        at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767)
        at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270)
        at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
        at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:227)
        at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:295)
        at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1833)
        at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1645)
        at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at com.test.spark.sqoop.SqoopExample$.importSQLToHDFS(SqoopExample.scala:56)
        at com.test.spark.sqoop.SqoopExample$.main(SqoopExample.scala:18)
        at com.test.spark.sqoop.SqoopExample.main(SqoopExample.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

现在我的错误是:

spark-submit --files kafka-jaas.conf,ampuser.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka-jaas.conf" --driver-java-options "-Djava.security.auth.login.config=kafka-jaas.conf" --conf spark.driver.extraClassPath=/var/lib/sqoop/sqljdbc4.jar:/opt/cloudera/parcels/CDH/lib/sqoop/lib/*,/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/* --class com.danamon.spark.sqoop.SqoopExample --deploy-mode client --master yarn kafka-streaming-0.0.1-SNAPSHOT-jar-with-dependencies.jar


18/03/13 20:54:51 INFO security.UserGroupInformation: Login successful for user ampuser using keytab file ampuser.keytab
18/03/13 20:54:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/03/13 20:54:51 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
18/03/13 20:54:51 INFO manager.SqlManager: Using default fetchSize of 1000
18/03/13 20:54:51 INFO tool.CodeGenTool: Beginning code generation
18/03/13 20:54:51 INFO tool.CodeGenTool: Will generate java class as codegen_BD_AC_ACCT_PREFERENCES
18/03/13 20:54:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [BD_AC_ACCT_PREFERENCES] AS t WHERE 1=0
18/03/13 20:54:52 INFO orm.CompilationManager: $HADOOP_MAPRED_HOME is not set
Note: /tmp/sqoop-ampuser/compile/95e3ef854d67b50d8ef72955151dc846/codegen_BD_AC_ACCT_PREFERENCES.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/03/13 20:54:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-ampuser/compile/95e3ef854d67b50d8ef72955151dc846/codegen_BD_AC_ACCT_PREFERENCES.jar
18/03/13 20:54:54 INFO mapreduce.ImportJobBase: Beginning import of BD_AC_ACCT_PREFERENCES
18/03/13 20:54:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
Exception in thread "main" java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:190)
        at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:94)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:259)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
        at org.apache.sqoop.manager.SQLServerManager.importTable(SQLServerManager.java:163)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at com.danamon.spark.sqoop.SqoopExample$.importSQLToHDFS(SqoopExample.scala:57)
        at com.danamon.spark.sqoop.SqoopExample$.main(SqoopExample.scala:18)
        at com.danamon.spark.sqoop.SqoopExample.main(SqoopExample.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 22 more

是不是我的 Cloudera 安装配置不正确造成的?或者 HADOOP_HOME、MAPRED_HOME 等设置不正确? 我应该为此创建新问题吗?

【问题讨论】:

你能粘贴错误信息吗? 我更新了问题, 您的连接字符串似乎有问题,您使用的哪个驱动程序是类路径中的驱动程序?您可以使用 JTDS 驱动程序并通过将其保存在类路径中进行检查吗? sqoop_options.setConnectString("jdbc:jtds:sqlserver://myhost:port;databaseName=DBNAME") jdbc:jdbc:sqlserver?你那里有太多的jdbc。此外,JDBC jar 需要位于 Spark 执行器类路径中 我的笨蛋,现在我添加了 jar extraClassPath 然后我得到了 java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat 错误。我将所有sqoop lib添加到extraClassPath,然后这个错误通过了。现在的错误是 IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext,但在 org.apache.sqoop.config.ConfigurationHelper.getJobNumMaps(ConfigurationHelper.java:65) 上应该有类 【参考方案1】:

您需要在 ~/.bashrc 中将 HADOOP_MAPRED_HOME 设置为 $HADOOP_HOME

sudo nano ~/.bashrc

然后添加这一行

export HADOOP_MAPRED_HOME = $HADOOP_HOME

保存文件,然后运行此命令

source ~/.bashrc

【讨论】:

以上是关于通过 Spark 运行时出现 Sqoop 错误的主要内容,如果未能解决你的问题,请参考以下文章

在 Hadoop 2.7.3 上执行简单 SQOOP 导入命令时出现 Sqoop 错误

将数据从 MySQL 加载到 HDFS 时出现 Sqoop 错误

Oozie - 运行 sqoop 时出现异常:无法加载数据库驱动程序类:com.mysql.jdbc.Driver

尝试使用 python 3 运行 Spark 时出现几个错误

通过 SQOOP 连接到 DB2 时出现连接管理器错误

提交 pyspark 作业时出现语法错误