错误 - 使用 Apache Sqoop 和 Dataproc 从 SQL Server 导入 GCS

Posted

技术标签:

【中文标题】错误 - 使用 Apache Sqoop 和 Dataproc 从 SQL Server 导入 GCS【英文标题】:ERROR - Import from SQL Server to GCS using Apache Sqoop & Dataproc 【发布时间】:2021-09-22 12:29:10 【问题描述】:

我正在尝试将数据从 SQL Server 导入到 Google Cloud Storage,稍后我会将其上传到 BigQuery。我通过 Google 的 Cloud Shell 完成所有这些工作。

我已经完成了下载 Sqoop 和 Sql server JDBC 文件并下载然后上传到特定谷歌云存储的初始步骤。我还创建了一个 Google Dataproc 集群来提交 Sqoop 作业,但是当我尝试使用提交代码时,它会抛出一些错误。

我正在关注这个过程 (https://medium.com/datamindedbe/import-sql-server-data-in-bigquery-d640441d5d56),就我而言,我试图先提取一个表。Code to submit a job through dataproc

我尝试了什么

我确实有 SQL 服务器 jdbc .jar (mssql-jdbc-8.2.1.jre8.jar) 文件 在云存储中与其他依赖文件

我还检查了我的 SQL Server 2014 中的 TCP/IP 连接

按照错误提示处于推荐状态

我用来向 DATAPROC 集群提交 SQOOP 作业的代码

CLUSTERNAME="sqoop-cluster"
BUCKET="gs://sqoop-bucket-20092021"
libs=`gsutil ls $BUCKET/jars | paste -sd, --`
JDBC_STR="jdbc:sqlserver://RUKSQLRS01:1433;databaseName=RUKDataWarehouse"
SQL_USER="RUKSQLDataWarehouse_Reporting"
SQL_PASS="gs://sqoop-bucket-20092021/creds/sqoop.password"
TABLE="LBD_Task"
SCHEMA="dbo"

gcloud dataproc jobs submit hadoop \
    --region europe-west2 \
    --cluster="$CLUSTERNAME"\
    --jars=$libs \
    --class=org.apache.sqoop.Sqoop \
    -- \
    import \
    -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
    -Dmapreduce.job.user.classpath.first=true \
    --connect "$JDBC_STR" \
    --username "$SQL_USER" \
    --password-file "$SQL_PASS" \
    --table "$SCHEMA.$TABLE" \
    --warehouse-dir "$BUCKET/output/$TABLE" \
    --num-mappers 1 \
    --as-avrodatafile

我遇到的错误

21/09/22 11:30:46 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
21/09/22 11:30:46 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
21/09/22 11:30:48 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
21/09/22 11:30:48 INFO manager.SqlManager: Using default fetchSize of 1000
21/09/22 11:30:48 INFO tool.CodeGenTool: Beginning code generation
21/09/22 11:31:02 ERROR manager.SqlManager: Error executing statement: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host RUKSQLRS01, port 1433 has failed. Error: "RUKSQLRS01. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".
com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host RUKSQLRS01, port 1433 has failed. Error: "RUKSQLRS01. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:227)
    at com.microsoft.sqlserver.jdbc.SQLServerException.ConvertConnectExceptionToSQLServerException(SQLServerException.java:284)
    at com.microsoft.sqlserver.jdbc.SocketFinder.findSocket(IOBuffer.java:2435)
    at com.microsoft.sqlserver.jdbc.TDSChannel.open(IOBuffer.java:635)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:2010)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:1687)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:1528)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:866)
    at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:569)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:247)
    at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:904)
    at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:59)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:763)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:786)
    at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:289)
    at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:260)
    at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:246)
    at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:327)
    at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1872)
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1671)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:501)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
21/09/22 11:31:02 ERROR tool.ImportTool: Import failed: java.io.IOException: No columns to generate for ClassWriter
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1677)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:501)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

【问题讨论】:

您的 SQL 服务器在哪里?是否可以从 Dataproc 集群访问?错误TCP/IP connection to the host RUKSQLRS01, port 1433 has failed表示主机名为RUKSQLRS01,是不是同一个VPC网络的GCE VM?可以从主节点运行nslookup RUKSQLRS01 吗? SQL Server 托管在 AWS 上 可以从 GCE 访问吗?那你怎么保证RUKSQLRS01能解析到IP地址呢? 【参考方案1】:

这似乎是一个网络问题。您的 SQL 服务器在 GCP 之外,您正尝试通过主机名访问它。您需要使用外部 IP 并在 SQL Server 端设置防火墙规则以允许从 GCP 访问,或者在您的 GCP VPC 网络和 SQL Server 网络之间设置 *** 并通过内部 IP 访问 SQL Server。

【讨论】:

以上是关于错误 - 使用 Apache Sqoop 和 Dataproc 从 SQL Server 导入 GCS的主要内容,如果未能解决你的问题,请参考以下文章

错误: 找不到或无法加载主类 org.apache.sqoop.Sqoop

Apache sqoop 错误

Sqoop 导入错误:org.apache.hadoop.security.AccessControlException:权限被粘性位拒绝

Apache Sqoop 启动配置错误:org.apache.hadoop.mapred.YarnClientProtocolProvider not a subtype

Apache Sqoop - Overview Apache Sqoop 概述

sqoop安装和使用