Clickhouse--数据输入输出格式实战
Posted 扫地增
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Clickhouse--数据输入输出格式实战相关的知识,希望对你有一定的参考价值。
文章比较长,为了方便大家各取所需特设置目录如下:
文章目录
- TabSeparated(TSV)使用tab(\\t)分隔输出
- TabSeparatedRaw(TSVRaw)
- TabSeparatedWithNames (TSVWithNames)
- TabSeparatedWithNamesAndTypes (TSVWithNamesAndTypes)
- TSKV
- CSV
- CSVWithNames
- JSON
- JSONString
- JSONAsString
- JSONCompact
- JSONCompactString
- JSONEachRow
- JSONStringEachRow 与JSONCompactStringEachRow
- JSONCompactEachRow
- JSONEachRowWithProgress
- JSONStringsEachRowWithProgress
- JSONCompactEachRowWithNamesAndTypes
- JSONCompactStringEachRowWithNamesAndTypes
- Native
- Null
- Pretty
- PrettyCompact
- PrettyCompactMonoBlock
- PrettyNoEscapes
- PrettyCompactNoEscapes
- PrettySpace
- PrettySpaceNoEscapes
- RowBinary
- RowBinaryWithNamesAndTypes
- Values
- Vertical
- VerticalRaw
- XML
- CapnProto
- Protobuf
- 其他
ClickHouse 可以接受多种数据格式,可以在 (
INSERT(输入)
) 以及 (SELECT(输出)
) 请求中使用。
下列表格列出了支持的数据格式以及在 (INSERT(输入)) 以及 (SELECT(输出)) 请求中使用它们的方式
Format | Input | Output |
---|---|---|
TabSeparated | ✔ | ✔ |
TabSeparatedRaw | ✔ | ✔ |
TabSeparatedWithNames | ✔ | ✔ |
TabSeparatedWithNamesAndTypes | ✔ | ✔ |
Template | ✔ | ✔ |
TemplateIgnoreSpaces | ✔ | ✗ |
CSV | ✔ | ✔ |
CSVWithNames | ✔ | ✔ |
CustomSeparated | ✔ | ✔ |
Values | ✔ | ✔ |
Vertical | ✗ | ✔ |
VerticalRaw | ✗ | ✔ |
JSON | ✗ | ✔ |
JSONAsString | ✔ | ✗ |
JSONString | ✗ | ✔ |
JSONCompact | ✗ | ✔ |
JSONCompactString | ✗ | ✔ |
JSONEachRow | ✔ | ✔ |
JSONEachRowWithProgress | ✗ | ✔ |
JSONStringEachRow | ✔ | ✔ |
JSONStringEachRowWithProgress | ✗ | ✔ |
JSONCompactEachRow | ✔ | ✔ |
JSONCompactEachRowWithNamesAndTypes | ✔ | ✔ |
JSONCompactStringEachRow | ✔ | ✔ |
JSONCompactStringEachRowWithNamesAndTypes | ✔ | ✔ |
TSKV | ✔ | ✔ |
Pretty | ✗ | ✔ |
PrettyCompact | ✗ | ✔ |
PrettyCompactMonoBlock | ✗ | ✔ |
PrettyNoEscapes | ✗ | ✔ |
PrettySpace | ✗ | ✔ |
Protobuf | ✔ | ✔ |
ProtobufSingle | ✔ | ✔ |
Avro | ✔ | ✔ |
AvroConfluent | ✔ | ✗ |
Parquet | ✔ | ✔ |
Arrow | ✔ | ✔ |
ArrowStream | ✔ | ✔ |
ORC | ✔ | ✗ |
RowBinary | ✔ | ✔ |
RowBinaryWithNamesAndTypes | ✔ | ✔ |
Native | ✔ | ✔ |
Null | ✗ | ✔ |
XML | ✗ | ✔ |
CapnProto | ✔ | ✗ |
LineAsString | ✔ | ✗ |
RawBLOB | ✔ | ✔ |
TabSeparated(TSV)使用tab(\\t)分隔输出
TabSeparated
使用tab(\\t)
分隔输出,在TabSeparated格式中,数据按行写入。每行包含由制表符分隔的值。每个值后面都有一个制表符(\\t
),但行中的最后一个值除外,后跟一个换行符(\\n
)。到处都有严格的Unix换行符。最后一行还必须在末尾包含换行符。值以文本格式编写,不带引号,并且转义了特殊字符。
localhost :) select user_id,user_name,student_id,grade,province,city from user_info limit 2 format TabSeparated;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM user_info
LIMIT 2
FORMAT TabSeparated
Query id: 3d08d32f-57c8-4c73-a602-c8ff9b4cfac6
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.008 sec.
我们复制结果放在java字符串中发现数据的分隔符如官网所述:
该格式也可以在名称下使用
TSV
。
localhost :) select user_id,user_name,student_id,grade,province,city from user_info limit 2 format TSV;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM user_info
LIMIT 2
FORMAT TSV
Query id: 9d573f28-1c46-4c9f-9840-138e904081e9
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.007 sec.
我们复制结果放在java字符串中发现数据的分隔符如官网所述:
该TabSeparated格式便于使用自定义程序和脚本处理数据。默认情况下,它在HTTP界面和命令行客户端的批处理模式下使用。这种格式还允许在不同的DBMS之间传输数据。例如,您可以从mysql获取转储并将其上传到ClickHouse,反之亦然。
TabSeparatedRaw(TSVRaw)
与
TabSeparated
格式不同的是,写入行时没有转义。使用TabSeparatedRaw
格式进行解析时,每个字段均不允许使用制表符或换行符。
该格式也可以在名称下使用TSVRaw
。
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TSVRaw;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TSVRaw
Query id: 47c09efe-1adf-495d-9b0f-eaf5270ac881
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.005 sec.
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TabSeparatedRaw;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TabSeparatedRaw
Query id: d2881684-21de-41ec-944c-a1b8d64c9464
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.008 sec.
TabSeparatedWithNames (TSVWithNames)
与
TabSeparated
格式的不同之处在于,列名写在第一行中。
在解析期间,第一行将被完全忽略。您不能使用列名称来确定其位置或检查其正确性。(将来可能会添加对解析标题行的支持。)
该格式也可以在名称下使用TSVWithNames
。
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TSVWithNames;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TSVWithNames
Query id: 6c2bdb3e-b2ad-49c8-bebe-b6e8735f5ce2
user_id user_name student_id grade province city
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.008 sec.
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TabSeparatedWithNames;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TabSeparatedWithNames
Query id: 4e56deb1-4601-420c-bdbb-f49f256df5e5
user_id user_name student_id grade province city
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.008 sec.
TabSeparatedWithNamesAndTypes (TSVWithNamesAndTypes)
与
TabSeparated
格式的不同之处在于,列名被写入第一行,而列类型被写入第二行。在解析期间,第一行和第二行被完全忽略。
该格式也可以在名称下使用TSVWithNamesAndTypes
。
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TabSeparatedWithNames;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TabSeparatedWithNames
Query id: 4e56deb1-4601-420c-bdbb-f49f256df5e5
user_id user_name student_id grade province city
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.008 sec.
localhost :) select user_id,user_name,student_id,grade,province,city from test.dm_hfs_user_action_label_v4_180_string limit 2 format TabSeparatedWithNamesAndTypes;
SELECT
user_id,
user_name,
student_id,
grade,
province,
city
FROM test.dm_hfs_user_action_label_v4_180_string
LIMIT 2
FORMAT TabSeparatedWithNamesAndTypes
Query id: 991f7a7b-3538-4263-ad27-0958f9b22539
user_id user_name student_id grade province city
String String String String String String
200000000000000000470669 小明 100000000000000000470669 高三 北京 海淀
400000000000000000470725 小王 200000000000000000470725 高三 @ @
2 rows in set. Elapsed: 0.006 sec.
TSKV
类似于
TabSeparated
,但以name = value
格式输出值。以与TabSeparated
格式相同的方式对名称进行转义,并且也对=
符号进行转义。NULL
输出为
\\N
。当有大量的小列时,这种格式是低效的,通常没有理由使用它。数据的输出和解析都支持这种格式。对于解析,任何顺序都支持不同列的值。可以省略某些值,用
-
表示, 它们被视为等于它们的默认值。在这种情况下,零和空行被用作默认值。作为默认值,不支持表中指定的复杂值。
localhost :) SELECT event_time,action,user_id,user_name,school_id,school_name,province,city,district,lng,lat FROM fct_hfs.fct_hfs_action_log LIMIT 2 FORMAT TSKV;
SELECT
event_time,
action,
user_id,
user_name,
school_id,
school_name,
province,
city,
district,
lng,
lat
FROM fct_hfs.fct_hfs_action_log
LIMIT 2
FORMAT TSKV
Query id: 37c93b78-aa5a-4ce1-8649-4a1aa15a5470
event_time=2020-12-12 15:54:02 action=@ user_id= user_name=@ school_id= school_name=@ province=@ city=@ district=@ lng=0.0 lat=0.0
event_time=2020-12-12 06:03:30 action=@ user_id= user_name=@ school_id= school_name=@ province=@ city=@ district=@ lng=0.0 lat=0.0
2 rows in set. Elapsed: 0.018 sec.
CSV
按逗号分隔的数据格式(
RFC
)。格式化的时候,行是用双引号括起来的。字符串中的双引号会以两个双引号输出,除此之外没有其他规则来做字符转义了。日期和时间也会以双引号包括。数字的输出不带引号。值由一个单独的字符隔开,这个字符默认是
,
。行使用Unix
换行符(LF
)分隔。 数组序列化成CSV
规则如下:首先将数组序列化为TabSeparated
格式的字符串,然后将结果字符串用双引号包括输出到CSV
。CSV
格式的元组被序列化为单独的列(即它们在元组中的嵌套关系会丢失)。默认情况下间隔符是,
,在format_csv_delimiter
中可以了解更多间隔符配置。解析的时候,可以使用或不使用引号来解析所有值。支持双引号和单引号。行也可以不用引号排列。
在这种情况下,它们被解析为逗号或换行符(CR 或 LF
)。在解析不带引号的行时,若违反RFC
规则,会忽略前导和尾随的空格和制表符。 对于换行,全部支持Unix(LF),Windows(CR LF)和 Mac OS Classic(CR LF)
。NULL
将输出为\\N
。CSV 格式是和 TabSeparated 一样的方式输出总数和极值。
localhost :) SELECT event_time,action,user_id,user_name,school_id,school_name,province,city,district,lng,lat FROM fct_hfs.fct_hfs_action_log LIMIT 2 FORMAT CSV;
SELECT
event_time,
action,
user_id,
user_name,
school_id,
school_name,
province,
city,
district,
lng,
lat
FROM fct_hfs.fct_hfs_action_log
LIMIT 2
FORMAT CSV
Query id: 74af53aa-dcea-4943-829d-95707eb5ed81
"2020-12-12 15:54:02","@","","@","","@","@","@","@","0.0","0.0"
"2020-12-12 06:03:30","@","","@","","@","@","@","@","0.0","0.0"
2 rows in set. Elapsed: 0.016 sec.
CSVWithNames
会输出带头部行,和
TabSeparatedWithNames
一样。
localhost :) SELECT event_time,action,user_id,user_name,school_id,school_name,province,city,district,lng,lat FROM fct_hfs.fct_hfs_action_log LIMIT 2 FORMAT CSVWithNames;
SELECT
event_time,
action,
user_id,
user_name,
school_id,
school_name,
province,
city,
district,
lng,
lat
FROM fct_hfs.fct_hfs_action_log
LIMIT 2
FORMAT CSVWithNames
Query id: cff9ada4-e963-4047-b510-0fa1a5151b9c
"event_time","action","user_id","user_name","school_id","school_name","province","city","district","lng","lat"
"2020-12-12 15:54:02","@","","@","","@","@","@","@","0.0","0.0"
"2020-12-12 06:03:30","@","","@","","@","@","@","@","0.0","0.0"
2 rows in set. Elapsed: 0.017 sec.
JSON
以
JSON
格式输出数据。除了数据表之外,它还输出列名称和类型以及一些附加信息:输出行的总数以及在没有LIMIT
时可以输出的行数。
localhost :) SELECT event_time,action,user_id,user_name,school_id,school_name,province,city,district,lng,lat FROM fct_hfs.fct_hfs_action_log LIMIT 1 FORMAT JSON;
SELECT
event_time,
action,
user_id,
user_name,
school_id,
school_name,
province,
city,
district,
lng,
lat
FROM fct_hfs.fct_hfs_action_log
LIMIT 1
FORMAT JSON
Query id: 3343d9f1-8da3-4d42-bb50-1d64afce6911
{
"meta":
[
{
"name": "event_time",
"type": "DateTime"
},
{
"name": "action",
"type": "String"
},
{
"name": "user_id",
"type": "String"
},
{
"name": "user_name",
"type": "String"
},
{
"name": "school_id",
"type": "String"
},
{
"name": "school_name",
"type": "String"
},
{
"name": "province",
"type": "String"
},
{
"name": "city",
"type": "String"
},
{
"name": "district",
"type": "String"
},
{
"name": "lng",
"type": "String"
},
{
"name": "lat",
"type": "String"
}
],
"data":
[
{
"event_time": "2020-12-12 15:54:02",
"action": "@",
"user_id": "",
"user_name": "@",
"school_id": "",
"school_name": "@",
"province": "@",
"city": "@",
"district": "@",
"lng": "0.0",
"lat": "0.0"
}
],
"rows": 1,
"rows_before_limit_at_least": 1,
"statistics":
{
"elapsed": 0.007974414,
"rows_read": 1,
"bytes_read": 106
}
}
1 rows in set. Elapsed: 0.022 sec.
JSON
与javascript
兼容。为了确保这一点,一些字符被额外转义:斜杠/
转义为\\/
;替换换行符U+2028
和U+2029
,它们会破坏某些浏览器,转义为\\uXXXX
。ASCII
控制字符被转义:退格符、换行符、换行符、回车符和水平制表符
被替换为\\b、\\f、\\n、\\r、\\t
,以及使用\\uXXXX
序列的00-1F
范围内的剩余字节。无效的UTF-8
序列被更改为替换字符,因此输出文本将由有效的UTF-8
序列组成。兼容JavaScript
,默认将Int64
和UInt64
整型括在双引号内。要删除引号,可以将配置参output_format_json_quote_64bit_integers
设置为0
。
rows
– 结果输出的行数。
rows_before_limit_at_least
去掉LIMIT
过滤后的最小行总数。 只会在查询包含LIMIT
条件时输出。
若查询包含GROUP BY
,rows_before_limit_at_least
就是去掉LIMIT
后过滤后的准确行数。
totals
– 总值 (当使用TOTALS
条件时)。
extremes
– 极值(当extremes
设置为1
时)。
该格式仅适用于输出查询结果,但不适用于解析输入(将数据插入到表中)。
ClickHouse
支持NULL
, 在JSON
输出中显示为null
。若要在输出中启用+nan、-nan、+inf、-inf
值,请设置output_format_json_quote_denormals
为1
。
JSONString
与
JSON
的不同之处在于数据字段以字符串输出,而不是以类型化JSON值输出。新版本中支持,小编使用的是ClickHouse server version 20.11.3
不支持此种格式,读者可以在更高版本中尝试。下面给出官方示例。
{
"meta":
[
{
"name": "'hello'",
"type": "String"
},
{
"name": "multiply(42, number)",
"type": "UInt64"
},
{
"name": "range(5)",
"type": "Array(UInt8)"
}
],
"data":
[
{
"'hello'": "hello",
"multiply(42, number)": "0",
"range(5)": "[0,1,2,3,4]"
},
{
"'hello'": "hello",
"multiply(42, number)": "42",
"range(5)": "[0,1,2,3,4]"
},
{
"'hello'": "hello",
"multiply(42, number)": "84",
"range(5)": "[0,1,2,3,4]"
}
],
"rows": 3,
"rows_before_limit_at_least": 3
}
JSONAsString
这个格式简单的理解就是将
json
格式的数据直接左右字符串插入到表的字段中。这样就造成这种格式只能对具有单个字段类型的表进行解析String
,其余的列必须设置为DEFAULT
或MATERIALIZED
,或者忽略。一旦将整个JSON对象
收集为字符串
,就可以使用JSON函数运行它。
在这种格式中,一个JSON
对象被解释为一个值。如果一次插入几个JSON
对象可以使用逗号分隔,这样DB
会将他们解释为独立的行。
DROP TABLE IF EXISTS test.json_as_string;
CREATE TABLE IF NOT EXISTS test.json_as_string (
json String)
ENGINE = Memory;
INSERT INTO test.json_as_string
FORMAT JSONAsString {"foo":{"bar":{"x":"y"},"baz":1}},{},{"any json stucture":1};
localhost :) SELECT * FROM test.json_as_string;
SELECT *
FROM test.json_as_string
Query id: bb32f6e0-5fdd-45dc-868b-292c5e0bd47e
┌─json──────────────────────────────┐
│ {"foo":{"bar":{"x":"y"},"baz":1}} │
│ {} │
│ {"any json stucture":1} │
└───────────────────────────────────┘
3 rows in set. Elapsed: 0.003 sec.
JSONCompact
与
JSON
格式不同的是它以数组的方式输出结果,而不是以结构体。JSONCompact
示例如下:
localhost :) select * from label limit 1 format JSONCompact
SELECT *
FROM label
LIMIT 1
FORMAT JSONCompact
Query id: a8185efc-5b6a-4a3b-b786-ba236b6eb724
{
"meta":
[
{
"name": "user_id",
"type": "String"
},
{
"name": "label_name",
"type": "String"
},
{
"name": "childs",
"type": "Array(String)"
}
],
"data":
[
["1", "11,22", ["小菜鸟","四年级"《ClickHouse企业级应用:入门进阶与实战》5 ClickHouse函数
《ClickHouse企业级应用:入门进阶与实战》5 ClickHouse函数
极富参考价值!第1章 ClickHouse 简介《ClickHouse 企业级大数据分析引擎实战》...
《ClickHouse企业级应用:入门进阶与实战》6 ClickHouse SQL基础