从CSV创建表,其值包含用引号括起来的逗号

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从CSV创建表,其值包含用引号括起来的逗号相关的知识,希望对你有一定的参考价值。

我正在尝试使用我上传到HDFS目录的CSV在Impala中创建一个表。 CSV包含用引号括起来的逗号的值。

例:

1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."

Impala documentation说这可以用ESCAPED BY条款解决。这是我目前的代码:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

我也尝试过使用ESCAPED BY '"'条款。在这两种情况下,Impala都使用引号中的逗号并将其用作分隔符,将值拆分为两列。

有关如何修复代码的任何想法,以便不会发生这种情况?

编辑(6/9/2015)

所以,我根据@K S Nidhin和@JTUP的建议,经历了以下变化。但是,每个变体返回的结果与不使用SERDEPROPERTIES运算符的查询返回相同的结果,逗号仍然导致值出现在错误的列中:

变化1

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ( "quoteChar" = "'", "escapeChar" = "\" ) 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

变化2

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'
WITH SERDEPROPERTIES ( 'quoteChar' = '"', 'escapeChar' = '\' )

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

变化3

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = """
)

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

还有其他想法,还是SERDEPROPERTIES运营商的进一步变化?

编辑(2016年10月6日)

我能够使用SERDESERDEPROPERTIES运算符在Hive中工作(基于Hive Documentation中提供的代码),并使用正确的表创建查询的不同变体:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar'     = '"',
   'escapeChar'    = '\'
)   
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

由于在Impala中没有SERDE运算符,因此该解决方案无法在那里运行。我很好在Hive中创建表格,但我仍然无法在Impala中找到可行的解决方案。

答案
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'

WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = """
)

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;

添加SERDEPROPERTIES,希望可以做到这一点

另一答案

我所做的是首先将分隔符从逗号转换为其他字符,例如pipe('|')。你可以在linux上使用csvformat(csvkit的一部分)。

csvformat -D | input_filename.csv > input_filename-pipe.csv

之后,将分隔符设置为“|”在impala查询中

 TERMINATED BY '|'

以上是关于从CSV创建表,其值包含用引号括起来的逗号的主要内容,如果未能解决你的问题,请参考以下文章

从 .CSV 文件的数值中删除双引号和逗号

CSV文件格式要求

csv文件

CSV文件格式介绍

从 CSV 文件中删除字符

如何在使用 EMR/Hive 将数据从 S3 导入 DynamoDB 时处理包含在引号 (CSV) 中的字段