在 clickhouse 中设置缓存字典时出现 OOM 错误

Posted

技术标签:

【中文标题】在 clickhouse 中设置缓存字典时出现 OOM 错误【英文标题】:OOM error when setting up cache dictionaries in clickhouse 【发布时间】:2021-02-08 12:58:02 【问题描述】:

运行带有 16Gb RAM 的 centos 7.6 VM 我收到以下错误

Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from localhost:9000

查询具有 75 列和 100000 行 (126Mb) 的缓存字典时。这是意料之中的,因为字典似乎很小。我正在使用clickhouse-client --query 启动查询:

SELECT  dictGet('CacheDictionary', 'date', toUInt64(number)) AS date, SUM(dictGet('CacheDictionary', 'filterColumn', toUInt64(number))) AS val, AVG(dictGet('CacheDictionary', 'filterColumn', toUInt64(number))) AS avg FROM system.numbers(1, 100000) GROUP BY date

由于我只查询 3 列,我应该只缓存 3*numberOfRows 吗?检查dmesg我得到相应的OOM错误。

[613700.447158] CPU: 1 PID: 5859 Comm: node Not tainted 3.10.0-957.5.1.el7.x86_64 #1
[613700.449266] Hardware name: Scaleway SCW-GP1-S, Bios 0.0.0 02/06/2015
[613700.451020] Call Trace:
[613700.451708]  [<ffffffff9cf61e41>] dump_stack+0x19/0x1b
[613700.452792]  [<ffffffff9cf5c86a>] dump_header+0x90/0x229
[613700.453875]  [<ffffffff9cb00f3b>] ? cred_has_capability+0x6b/0x120
[613700.455220]  [<ffffffff9c9ba524>] oom_kill_process+0x254/0x3d0
[613700.456475]  [<ffffffff9cb0101e>] ? selinux_capable+0x2e/0x40
[613700.457641]  [<ffffffff9c9bad66>] out_of_memory+0x4b6/0x4f0
[613700.458844]  [<ffffffff9cf5d36e>] __alloc_pages_slowpath+0x5d6/0x724
[613700.459965]  [<ffffffff9c9c1145>] __alloc_pages_nodemask+0x405/0x420
[613700.460773]  [<ffffffff9ca0e0a8>] alloc_pages_current+0x98/0x110
[613700.461521]  [<ffffffff9c9b6387>] __page_cache_alloc+0x97/0xb0
[613700.462156]  [<ffffffff9c9b8fe8>] filemap_fault+0x298/0x490
[613700.462849]  [<ffffffffc0461186>] ext4_filemap_fault+0x36/0x50 [ext4]
[613700.463623]  [<ffffffff9c9e458a>] __do_fault.isra.59+0x8a/0x100
[613700.464350]  [<ffffffff9c9e4b3c>] do_read_fault.isra.61+0x4c/0x1b0
[613700.465079]  [<ffffffff9c9e94e4>] handle_pte_fault+0x2f4/0xd10
[613700.465756]  [<ffffffff9c9ec01d>] handle_mm_fault+0x39d/0x9b0
[613700.466454]  [<ffffffff9cf6f5e3>] __do_page_fault+0x203/0x500
[613700.467136]  [<ffffffff9c9f1a37>] ? do_munmap+0x327/0x480
[613700.467749]  [<ffffffff9cf6f9c6>] trace_do_page_fault+0x56/0x150
[613700.468410]  [<ffffffff9cf6ef42>] do_async_page_fault+0x22/0xf0
[613700.469126]  [<ffffffff9cf6b788>] async_page_fault+0x28/0x30
[613700.469766] Mem-Info:
[613700.470040] active_anon:5266300 inactive_anon:2788185 isolated_anon:0
 active_file:184 inactive_file:53 isolated_file:0
 unevictable:0 dirty:11 writeback:223 unstable:0
 slab_reclaimable:9023 slab_unreclaimable:9736
 mapped:1319 shmem:1310 pagetables:18040 bounce:0
 free:49178 free_pcp:61 free_cma:0
[613700.473741] Node 0 DMA free:14912kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15004kB managed:14912kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[613700.478645] lowmem_reserve[]: 0 2827 31990 31990
[613700.479345] Node 0 DMA32 free:122332kB min:5788kB low:7232kB high:8680kB active_anon:1511568kB inactive_anon:1243404kB active_file:84kB inactive_file:140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3126684kB managed:2895228kB mlocked:0kB dirty:24kB writeback:272kB mapped:148kB shmem:216kB slab_reclaimable:3228kB slab_unreclaimable:3372kB kernel_stack:144kB pagetables:7784kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:47688 all_unreclaimable? yes
[613700.484885] lowmem_reserve[]: 0 0 29163 29163
[613700.485457] Node 0 Normal free:59468kB min:59716kB low:74644kB high:89572kB active_anon:19553632kB inactive_anon:9909336kB active_file:652kB inactive_file:72kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29866176kB mlocked:0kB dirty:20kB writeback:620kB mapped:5128kB shmem:5024kB slab_reclaimable:32864kB slab_unreclaimable:35572kB kernel_stack:3728kB pagetables:64376kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7605 all_unreclaimable? yes
[613700.490897] lowmem_reserve[]: 0 0 0 0
[613700.491483] Node 0 DMA: 2*4kB (U) 1*8kB (U) 1*16kB (U) 3*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 2*2048kB (M) 2*4096kB (M) = 14912kB
[613700.493471] Node 0 DMA32: 134*4kB (UEM) 83*8kB (UEM) 149*16kB (UEM) 74*32kB (UE) 41*64kB (UE) 27*128kB (UEM) 13*256kB (UE) 8*512kB (UEM) 101*1024kB (UM) 0*2048kB 0*4096kB = 122880kB
[613700.495884] Node 0 Normal: 408*4kB (UE) 320*8kB (UE) 301*16kB (UE) 252*32kB (UE) 184*64kB (UEM) 130*128kB (UE) 34*256kB (UE) 2*512kB (UE) 5*1024kB (M) 0*2048kB 0*4096kB = 60336kB
[613700.498284] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[613700.499258] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[613700.500155] 152719 total pagecache pages
[613700.500582] 151262 pages in swap cache
[613700.500989] Swap cache stats: add 25089927, delete 24939368, find 1630185/1802893
[613700.501787] Free swap  = 0kB
[613700.502119] Total swap = 1048572kB
[613700.502538] 8387598 pages RAM
[613700.502891] 0 pages HighMem/MovableOnly
[613700.503352] 193519 pages reserved
[613700.503741] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[613700.504600] [ 1773]     0  1773    63151     1390     100      153             0 systemd-journal
[613700.505791] [ 1812]     0  1812     9945        2      25      538         -1000 systemd-udevd
[613700.506886] [ 2179]     0  2179     6594       41      19       35             0 systemd-logind
[613700.507899] [ 2185]    81  2185    16581       57      32       80          -900 dbus-daemon
[613700.508887] [ 3808]     0  3808    26865        2      50      498             0 dhclient
[613700.509845] [ 3965]     0  3965    22931       12      45      251             0 master
[613700.510888] [ 3967]    89  3967    22974       12      45      246             0 qmgr
[613700.511796] [ 4041]     0  4041    90366      307     113      577             0 rsyslogd
[613700.512756] [ 4060]     0  4060     2476        1      10       32             0 agetty
[613700.513715] [ 4061]     0  4061     2476        1      10       31             0 agetty
[613700.514736] [ 4064]     0  4064     6476        1      18       52             0 atd
[613700.515708] [ 4065]     0  4065     6533      105      18       65             0 crond
[613700.516641] [ 4110]     0  4110    28215       13      57      243         -1000 sshd
[613700.517563] [ 5851]     0  5851     3247        0      11       47             0 sh
[613700.518476] [ 5859]     0  5859   380762     1460     310    10577             0 node
[613700.519397] [ 6163]     0  6163     3841       52      12       65             0 bash
[613700.520320] [24945]     0 24945     3813        1      12      103             0 bash
[613700.521241] [26835]     0 26835    38743       56      77      850             0 sshd
[613700.522116] [26838]     0 26838     3779        2      12       65             0 bash
[613700.523063] [26848]     0 26848     3314       40      13       98             0 bash
[613700.523971] [26898]     0 26898   208799     8732     320     5172             0 node
[613700.524890] [26914]     0 26914   218819     1071     149     2602             0 node
[613700.525827] [30635]    89 30635    22438       10      46      240             0 pickup
[613700.526717] [31250]     0 31250     5768        1      16       56             0 anacron
[613700.527670] [32757]     0 32757   413109      921     220    12462             0 python3
[613700.528622] [ 1083]   999  1014 11454714  7881670   16173   207603             0 TCPHandler
[613700.529600] [ 1174]     0  1174     1941       18       9        0             0 sleep
[613700.530531] [ 1344]     0  1344   123706     6927     125        0             0 clickhouse-clie
[613700.531556] Out of memory: Kill process 1083 (TCPHandler) score 958 or sacrifice child
[613700.532483] Killed process 1083 (TCPHandler) total-vm:45818856kB, anon-rss:31526680kB, file-rss:0kB, shmem-rss:0kB

以及对应的stderr.log(没有输出到clickhouse-server.err.log

Processing configuration file '/etc/clickhouse-server/users.xml'.
Include not found: networks
Saved preprocessed configuration to '/var/lib/clickhouse//preprocessed_configs/users.xml'.
Processing configuration file '/etc/clickhouse-server/dicts/benchmark_dictionary.xml'.
Saved preprocessed configuration to '/var/lib/clickhouse//preprocessed_configs/dicts_benchmark_dictionary.xml'.
Processing configuration file '/etc/clickhouse-server/config.xml'.
Include not found: clickhouse_remote_servers
Include not found: clickhouse_compression
Saved preprocessed configuration to '/var/lib/clickhouse//preprocessed_configs/config.xml'.
Processing configuration file '/etc/clickhouse-server/dicts/benchmark_dictionary.xml'.
Saved preprocessed configuration to '/var/lib/clickhouse//preprocessed_configs/dicts_benchmark_dictionary.xml'.
Processing configuration file '/etc/clickhouse-server/dicts/benchmark_dictionary.xml'.
Saved preprocessed configuration to '/var/lib/clickhouse//preprocessed_configs/dicts_benchmark_dictionary.xml'.
Status file /var/run/clickhouse-server/clickhouse-server.pid already exists - unclean restart. Contents:
26053
Logging trace to /var/log/clickhouse-server/clickhouse-server.log
Logging errors to /var/log/clickhouse-server/clickhouse-server.err.log
Status file /var/lib/clickhouse/status already exists - unclean restart. Contents:
PID: 26053
Started at: 2021-02-08 03:26:05
Revision: 54438

这里是字典 xml 文件:

<yandex>
        <dictionary> 
        <name>CacheDictionary</name> 
        <source>
        <executable>
            <command>
            awk 'BEGIN  FS=","; OFS=","  $1=$1; print ' /var/lib/clickhouse/user_files/testCache.csv 
            </command> 
            <format>CSV</format>
        </executable>
        </source>
        <layout>
        <cache>
        <size_in_cells>7500000</size_in_cells>
        </cache>
        </layout>
        <lifetime>0</lifetime>
        <structure>
            <id>
                <name>referentialKey</name>
            </id>

            <attribute>
                <name>date</name>
                <type>Date</type>
                <null_value></null_value>
            </attribute>
            <attribute>
                <name>integers</name>
                <type>UInt64</type>
                <null_value></null_value>
            </attribute>
            <attribute>
                <name>filterColumn</name>
                <type>Float64</type>
                <null_value></null_value>
            </attribute>
            
            <attribute>
                <name>random0</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>

            ...
            
            <attribute>
                <name>random74</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
            
        </structure>
        </dictionary>
        </yandex>
        

【问题讨论】:

【参考方案1】:
 dictionary with 75 columns 
<size_in_cells>7500000</size_in_cells>

这样的字典可能会吃掉 100GB RAM。 检查system.dictionaries.bytes_allocated&lt;size_in_cells&gt;10000

https://github.com/ClickHouse/ClickHouse/issues/2738

【讨论】:

以上是关于在 clickhouse 中设置缓存字典时出现 OOM 错误的主要内容,如果未能解决你的问题,请参考以下文章

在结构中设置数组元素时出现总线错误

在 smack 中设置 vcard 中的名字时出现 NullPointerException

在 Android Studio 中设置 Flutter 时出现问题

在 ListView 中设置页脚布局时出现问题

在代码中设置自动布局约束时出现奇怪的问题

尝试在 lambda 函数中设置按钮标签时出现分段错误