将csv文件导入greenplum外部表时如何转义换行符?

Posted

技术标签:

【中文标题】将csv文件导入greenplum外部表时如何转义换行符?【英文标题】:how to escape newline when importing csv file into greenplum external table? 【发布时间】:2020-05-05 17:32:18 【问题描述】:

我正在尝试像这样从 csv 创建外部表:

CREATE EXTERNAL TABLE hctest.ex_abs
(
a text,
b text,
c text,
d text,
e text,
f text,
g text
)
LOCATION ('gpfdist://192.168.56.111:10000/absdatasample.csv')
FORMAT 'CSV' (DELIMITER '|' HEADER);

csv 由竖线 (|) 分隔,如下所示:

Employee ID|Time Type|Start Date|End Date|Number Of Days|Comment|Ration of Leave
90007507|Leave|11/27/2020|11/27/2020|1|dear mas Andria,

seek for approval for 1 day off. Thank you.|8
90007507|Leave|05/08/2020|05/08/2020|1|dear mas Andria, kindly approve 1 day leave at 8th May. Thank you.|5
90006391|Leave|04/27/2020|04/30/2020|4|Requesting leave days for new baby born|7
90006988|Leave|04/20/2020|04/21/2020|2|Dear Mas Tommy,
Herewith I would like to ask your approval for my leave which will be taken on 20 - 21 April 2020 (2 days of leave). I take this leave because of I need to attend the family wedding out of town along with visiting my extended family before Ramadhan in my hometown. 

Your approval will be highly appreciated.

Thank you,
Andrian Indrawan|2
90005573|Leave|04/09/2020|04/09/2020|1||4
90007088|Leave|04/08/2020|04/09/2020|2||9
90004055|Leave|04/08/2020|04/09/2020|2|Leave for family's reason|6

我发现了错误:

ERROR:  missing data for column "g"  (seg0 slice1 192.168.56.111:6000 pid=4486)
DETAIL:  External table ex_absdata, line 2 of gpfdist://192.168.56.111:10000/absdatasample.csv: "90007507|Leave|11/27/2020|11/27/2020|1|dear mas Andria,"

我该如何解决这个问题?

【问题讨论】:

文件中有换行符或回车符,并且没有任何双引号。这不是有效的 CSV 文件。 @JonRoberts 是的,我想是的。但我无法设置或重新处理数据,因为它是公司的报告制作。有解决此问题的建议吗? 【参考方案1】:

你有几个选择。

    您可以在外部表上使用 LOG ERRORS 选项,以便加载好的数据并拒绝坏的数据。但是文件中嵌入了很多换行符。 修复文件。

我找到了this 的例子。

然后我拿了那个例子和你的示例文件。

awk -F\| ' while (NF < 7 || $NF == "")  brokenline=$0; getline; $0 = brokenline $0; print ' load.txt

Employee ID|Time Type|Start Date|End Date|Number Of Days|Comment|Ration of Leave
90007507|Leave|11/27/2020|11/27/2020|1|dear mas Andria,seek for approval for 1 day off. Thank you.|8
90007507|Leave|05/08/2020|05/08/2020|1|dear mas Andria, kindly approve 1 day leave at 8th May. Thank you.|5
90006391|Leave|04/27/2020|04/30/2020|4|Requesting leave days for new baby born|7
90006988|Leave|04/20/2020|04/21/2020|2|Dear Mas Tommy,Herewith I would like to ask your approval for my leave which will be taken on 20 - 21 April 2020 (2 days of leave). I take this leave because of I need to attend the family wedding out of town along with visiting my extended family before Ramadhan in my hometown. Your approval will be highly appreciated.Thank you,Andrian Indrawan|2
90005573|Leave|04/09/2020|04/09/2020|1||4
90007088|Leave|04/08/2020|04/09/2020|2||9
90004055|Leave|04/08/2020|04/09/2020|2|Leave for family's reason|6

另一个可能的解决方法是创建一个 perl、awk、sed、python 等脚本(如上所示)来检查文件,使其成为真正的带双引号的 CSV。如果你这样做了,你可以保留嵌入的换行符。我认为这太过分了,因为您可以从非结构化数据中获得的任何值都不需要换行符。

【讨论】:

不错。这是一个很好的方法,我解决了我的问题。非常感谢您的帮助。很高兴见到您。

以上是关于将csv文件导入greenplum外部表时如何转义换行符?的主要内容,如果未能解决你的问题,请参考以下文章

将数据插入 Greenplum 物理表

如何在使用 EMR/Hive 将数据从 S3 导入 DynamoDB 时处理包含在引号 (CSV) 中的字段

将 CSV 导入 BigQuery 中的表时无法添加字段

在 java 上将 csv 文件加载到 GreenPlum 失败

Greenplum,是不是可以将 CSV 导出到远程服务器?

MySQL导入csv表时UTF8字符串无效