如何在 Spark SQL 中使用 snappy 压缩将数据写入配置单元表

Posted 2023-04-18

技术标签:

【中文标题】如何在 Spark SQL 中使用 snappy 压缩将数据写入配置单元表【英文标题】：How to write data to hive table with snappy compression in Spark SQL 【发布时间】：2019-03-02 10:31:02 【问题描述】：

我有一个使用 Hive 命令创建的 orc hive 表

create table orc1(line string) stored as orcfile

我想使用 spark sql 向该表写入一些数据，我使用以下代码并希望在 HDFS 上快速压缩数据

  test("test spark orc file format with compression") 
    import SESSION.implicits._
    Seq("Hello Spark", "Hello Hadoop").toDF("a").createOrReplaceTempView("tmp")
    SESSION.sql("set hive.exec.compress.output=true")
    SESSION.sql("set mapred.output.compress=true")
    SESSION.sql("set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec")
    SESSION.sql("set io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec")
    SESSION.sql("set mapred.output.compression.type=BLOCK")
    SESSION.sql("insert overwrite table orc1 select a from tmp  ")

数据写入了，但是是NOT用snnapy压缩的。

如果我在 Hive Beeline/Hive 中运行 insert overwrite 来写入数据并使用上面的 set command ，那么我可以看到表的文件是用 snappy 压缩的。

所以，我想问一下如何在 Spark SQL 2.1 中使用 snappy 压缩将数据写入 Hive 创建的 orc 表

【问题讨论】：

您找到解决方案了吗？ 【参考方案1】：

您可以像这样在创建表命令上将压缩设置为 snappy

create table orc1(line string) stored as orc tblproperties ("orc.compress"="SNAPPY");

然后对表的任何插入都将被快速压缩（我也在命令中将orcfile 更正为orc）。

【讨论】：

谢谢，我试过了，好像不行。我使用hdfs dfs -cat 对表文件的内容进行分类，我可以在其中看到Hello and Hadoop，所以它不应该被压缩。如果压缩了，应该是看不到内容的。

以上是关于如何在 Spark SQL 中使用 snappy 压缩将数据写入配置单元表的主要内容，如果未能解决你的问题，请参考以下文章