Sqoop Hcatalog 导入作业已完成,但表中不存在数据

Posted

技术标签:

【中文标题】Sqoop Hcatalog 导入作业已完成,但表中不存在数据【英文标题】:Sqoop Hcatalog import job completed but data is not present in the table 【发布时间】:2019-09-25 12:31:41 【问题描述】:

我试图将 hcatalog 与 sqoop 集成,以便将数据从 rdbms(oracle) 导入数据湖(在 hive 中)。

sqoop-import --connect connection-string --username username --password pass --table --hcatalog-database data_extraction --hcatalog-table --hcatalog-storage-stanza 'stored as orcfile' -m1 --详细的

作业已执行 e=成功,但无法找到数据。 另外,检查了在hcatalog中创建的表的位置,检查位置后发现没有为此创建任何目录,并且只找到了一个0字节的文件_$文件夹$。

please found the stack trace :

19/09/25 17:53:37 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
19/09/25 17:54:02 DEBUG db.DBConfiguration: Fetching password from job credentials store
19/09/25 17:54:03 INFO db.DBInputFormat: Using read commited transaction isolation
19/09/25 17:54:03 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '1=1' and upper bound '1=1'
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: number of splits:1
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569355854349_1231
19/09/25 17:54:04 INFO impl.YarnClientImpl: Submitted application application_1569355854349_1231
19/09/25 17:54:04 INFO mapreduce.Job: The url to track the job: http://<PII-removed-by-me>/application_1569355854349_1231/
19/09/25 17:54:04 INFO mapreduce.Job: Running job: job_1569355854349_1231
19/09/25 17:57:34 INFO hive.metastore: Closed a connection to metastore, current connections: 1
 19/09/25 18:02:59 INFO mapreduce.Job: Job job_1569355854349_1231 running in uber mode : false
19/09/25 18:02:59 INFO mapreduce.Job:  map 0% reduce 0%
19/09/25 18:03:16 INFO mapreduce.Job:  map 100% reduce 0%
19/09/25 18:03:18 INFO mapreduce.Job: Job job_1569355854349_1231 completed successfully
19/09/25 18:03:18 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=425637
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=87
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=1
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                S3: Number of bytes read=0
                S3: Number of bytes written=310154
                S3: Number of read operations=0
                S3: Number of large read operations=0
                S3: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=29274
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=14637
                Total vcore-milliseconds taken by all map tasks=14637
                Total megabyte-milliseconds taken by all map tasks=52459008
        Map-Reduce Framework
                Map input records=145608
                Map output records=145608
                Input split bytes=87
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=199
                CPU time spent (ms)=4390
                Physical memory (bytes) snapshot=681046016
                Virtual memory (bytes) snapshot=5230788608
                Total committed heap usage (bytes)=1483210752
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 582.8069 seconds (0 bytes/sec)
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Retrieved 145608 records.
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table null
19/09/25 18:03:19 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@1d548a08

【问题讨论】:

我们在 s3 上有 hive 仓库。 【参考方案1】:

解决了。 因为我们使用的是 AWS EMR(托管 hadoop 服务)。在他们的网站上已经提到过。 Aws Forum Screenshot

当您使用 Sqoop 将输出写入 Amazon S3 中的 HCatalog 表时,通过将 mapred.output.direct.NativeS3FileSystem 和 mapred.output.direct.EmrFileSystem 属性设置为 false 来禁用 Amazon EMR 直接写入。有关详细信息,请参阅使用 HCatalog。您可以使用 Hadoop -D mapred.output.direct.NativeS3FileSystem=false-D mapred.output.direct.EmrFileSystem=false 命令。

如果不禁用直接写入,则不会发生错误,但会在 Amazon S3 中创建表并且不会写入数据。

可以在https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-sqoop-considerations.html找到

【讨论】:

以上是关于Sqoop Hcatalog 导入作业已完成,但表中不存在数据的主要内容,如果未能解决你的问题,请参考以下文章

用于从 Netezza 到 HDFS 的数据传输的 Sqoop 作业

Sqoop报警告hcatalog does not exist!...accumulo does not exist!解决方案

Sqoop 导入映射器失败,但 sqoop 作业显示正在运行

无法使用导入解析 org.apache.hcatalog.pig.hcatloader

通过 oozie 从 sqoop 作业增量导入不会更新 sqoop 元存储中的 incremental.last.value

Sqoop 作业因 Oracle 导入的 KiteSDK 验证错误而失败