clickhouse 20.x 分布式表测试与chproxy的部署

Posted flyfish225

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了clickhouse 20.x 分布式表测试与chproxy的部署相关的知识,希望对你有一定的参考价值。

标签(空测试用例格分隔):clickhouse 系列


一: clickhouse20.x的分布式表测

1.1:clickhosue 分布式表创建

准备测试文件:
参考官网
  https://clickhouse.com/docs/en/getting-started/example-datasets/metrica

下载文件:
curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
# Validate the checksum
md5sum hits_v1.tsv
# Checksum should be equal to: f3631b6295bf06989c1437491f7592cb

1.2:创建库与分布式表

clickhouse-client -h node01 -u default --password tzck123.com --query "CREATE DATABASE IF NOT EXISTS datasets on cluster tzcluster3s2r02" 
clickhouse-client -h node01 -u default --password tzck123.com --query "show databases"

创建本地表:

clickhouse-client -h node03 -u default --password tzck123.com

CREATE TABLE datasets.hits_v1 on cluster tzcluster3s2r02 ( WatchID UInt64,  JavaEnable UInt8,  Title String,  GoodEvent Int16,  EventTime DateTime,  EventDate Date,  CounterID UInt32,  ClientIP UInt32,  ClientIP6 FixedString(16),  RegionID UInt32,  UserID UInt64,  CounterClass Int8,  OS UInt8,  UserAgent UInt8,  URL String,  Referer String,  URLDomain String,  RefererDomain String,  Refresh UInt8,  IsRobot UInt8,  RefererCategories Array(UInt16),  URLCategories Array(UInt16), URLRegions Array(UInt32),  RefererRegions Array(UInt32),  ResolutionWidth UInt16,  ResolutionHeight UInt16,  ResolutionDepth UInt8,  FlashMajor UInt8, FlashMinor UInt8,  FlashMinor2 String,  NetMajor UInt8,  NetMinor UInt8, UserAgentMajor UInt16,  UserAgentMinor FixedString(2),  CookieEnable UInt8, javascriptEnable UInt8,  IsMobile UInt8,  MobilePhone UInt8,  MobilePhoneModel String,  Params String,  IPNetworkID UInt32,  TraficSourceID Int8, SearchEngineID UInt16,  SearchPhrase String,  AdvEngineID UInt8,  IsArtifical UInt8,  WindowClientWidth UInt16,  WindowClientHeight UInt16,  ClientTimeZone Int16,  ClientEventTime DateTime,  SilverlightVersion1 UInt8, SilverlightVersion2 UInt8,  SilverlightVersion3 UInt32,  SilverlightVersion4 UInt16,  PageCharset String,  CodeVersion UInt32,  IsLink UInt8,  IsDownload UInt8,  IsNotBounce UInt8,  FUniqID UInt64,  HID UInt32,  IsOldCounter UInt8, IsEvent UInt8,  IsParameter UInt8,  DontCountHits UInt8,  WithHash UInt8, HitColor FixedString(1),  UTCEventTime DateTime,  Age UInt8,  Sex UInt8,  Income UInt8,  Interests UInt16,  Robotness UInt8,  GeneralInterests Array(UInt16), RemoteIP UInt32,  RemoteIP6 FixedString(16),  WindowName Int32,  OpenerName Int32,  HistoryLength Int16,  BrowserLanguage FixedString(2),  BrowserCountry FixedString(2),  SocialNetwork String,  SocialAction String,  HTTPError UInt16, SendTiming Int32,  DNSTiming Int32,  ConnectTiming Int32,  ResponseStartTiming Int32,  ResponseEndTiming Int32,  FetchTiming Int32,  RedirectTiming Int32, DOMInteractiveTiming Int32,  DOMContentLoadedTiming Int32,  DOMCompleteTiming Int32,  LoadEventStartTiming Int32,  LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32,  FirstPaintTiming Int32,  RedirectCount Int8, SocialSourceNetworkID UInt8,  SocialSourcePage String,  ParamPrice Int64, ParamOrderID String,  ParamCurrency FixedString(3),  ParamCurrencyID UInt16, GoalsReached Array(UInt32),  OpenstatServiceName String,  OpenstatCampaignID String,  OpenstatAdID String,  OpenstatSourceID String,  UTMSource String, UTMMedium String,  UTMCampaign String,  UTMContent String,  UTMTerm String, FromTag String,  HasGCLID UInt8,  RefererHash UInt64,  URLHash UInt64,  CLID UInt32,  YCLID UInt64,  ShareService String,  ShareURL String,  ShareTitle String,  ParsedParams Nested(Key1 String,  Key2 String, Key3 String, Key4 String, Key5 String,  ValueDouble Float64),  IslandID FixedString(16),  RequestNum UInt32,  RequestTry UInt8) ENGINE = ReplicatedReplacingMergeTree(/clickhouse/tables/layer-shard/datasets/hits_v1,replica) PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192

以本地表建分布式表:
   create TABLE datasets.hits_v1_all on cluster tzcluster3s2r02 as datasets.hits_v1
ENGINE = Distributed("tzcluster3s2r02", "datasets", "hits_v1", rand()); 

解压数据:
xz -v -d hits_v1.tsv.xz

插入数据:
clickhouse-client -h node03 -u default -d datasets --password tzck123.com --query "INSERT INTO datasets.hits_v1 FORMAT TSV" < hits_v1.tsv

select count(1) from datasets.datasets.hits_v1


从任意一个节点的用分布式表查询
 clickhouse-client -h node01 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node02 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node03 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node04 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node05 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node06 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 






测试完成之后所有的节点分布式表都是1674680 条数据
将导入数据复制到 node01 与  node05 节点插入
目前node01 与 node05 节点的 hits_v1 表没有数据


clickhouse-client -h node01 -u default -d datasets --password tzck123.com --query "INSERT INTO datasets.hits_v1 FORMAT TSV" < hits_v1.tsv

clickhouse-client -h node05 -u default -d datasets --password tzck123.com --query "INSERT INTO datasets.hits_v1 FORMAT TSV" < hits_v1.tsv



node01 与 node05 的hits_v1 表分别有 1674680 条数据

看一下分布表hits_v1_all 有多少条数

从任意一个节点的用分布式表查询
 clickhouse-client -h node01 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node02 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node03 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node04 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node05 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 
 clickhouse-client -h node06 -u default -d datasets --password tzck123.com --query "select count(1) from datasets.hits_v1_all" 






分布式表有5024040 条数据是 1674680 的 3倍 因为集群是3分片2副本的集群 

通过以上测试与集群的特性 我们可以在生产应用clickhouse的时候进行写本地表,读分布式表

上层的负载可以采用openresty 做tcp 的 8123端口都代理做连接

二:关于openresty 的代理 clickhouse 负载

如何安装 openresty 这边省略 可以参考 flyfish的文章:https://blog.51cto.com/flyfish225/3108573
需要给openresty 加入插件--with-stream 模块支持tcp的代理:

这里列出openresty的配置文件大家可以参考一下:

cd /usr/local/openresty/nginx/conf

vim nginx.conf

-----
#user  nobody;
worker_processes  8;

error_log  /usr/local/openresty/nginx/logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;

pid        logs/nginx.pid;

events 
    worker_connections  1024;


stream 

    log_format proxy $remote_addr [$time_local] 
                 $protocol $status $bytes_sent $bytes_received 
                 $session_time "$upstream_addr" 
                 "$upstream_bytes_sent" "$upstream_bytes_received" "$upstream_connect_time";

    access_log /usr/local/openresty/nginx/logs/tcp-access.log proxy ;
    open_log_file_cache off;
    include /usr/local/openresty/nginx/conf/conf.d/*.stream;


http 
    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  $remote_addr - $remote_user [$time_local] "$request" 
                      $status $body_bytes_sent "$http_referer" 
                      "$http_user_agent" "$http_x_forwarded_for";

    access_log  /usr/local/openresty/nginx/logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  60;

    gzip  on;

    server 
        listen       18080;
        server_name  localhost;

        #charset koi8-r;

        #access_log  logs/host.access.log  main;

        location / 
            root   html;
            index  index.html index.htm;
        

        #error_page  404              /404.html;

        # redirect server error pages to the static page /50x.html
        #
        error_page   500 502 503 504  /50x.html;
        location = /50x.html 
            root   html;
        
    

-----
重要的是启用 openrestry的tcp 代理模块这里

cd /usr/local/openresty/nginx/conf/conf.d
vim ck_prod.stream

----
upstream ck 
    server  192.168.100.142:8123 weight=25 max_fails=3 fail_timeout=60s;
    server  192.168.100.143:8123 weight=25 max_fails=3 fail_timeout=60s;
    server  192.168.100.144:8123 weight=25 max_fails=3 fail_timeout=60s;
    server  192.168.100.145:8123 weight=25 max_fails=3 fail_timeout=60s;
    server  192.168.100.146:8123 weight=25 max_fails=3 fail_timeout=60s;


server 
    listen 18123;
    proxy_pass ck;

----

之后保存从新启动openresty
cd /usr/local/openresty/nginx
sbin/nginx -t 
sbin/nginx 

打开dbvs 客户端工具进行连接





三: 关于 chproxy 代理

3.1 chproxy 的介绍:

chproxy 一个功能强大的clickhouse http 代理以及负载均衡中间件
chproxy 是基于golang 编写的clickhouse http 服务的代理以及负载均衡中间件,功能比较丰富 
基于yaml 配置,对于多集群流量处理是一个很不错工具

github:
     https://github.com/Vertamedia/chproxy

官网:
    https://www.chproxy.org/cn

chproxy 实现写本地表读分布式表:

3.2 chproxy 的部署:

去github 上面下载chproxy

chproxy-linux-amd64-v1.14.0.tar.gz

mv chproxy /usr/bin/

chproxy --version

配置chproxy 的代理
mkdir /etc/chproxy/
cd /etc/chproxy/
vim chproxy.yml
-----------------
server:
  http:
      listen_addr: ":19000"
      allowed_networks: ["192.168.100.0/24","192.168.120.0/24" ]

users:
  - name: "distributed-write"
    to_cluster: "distributed-write"
    to_user: "default"

  - name: "replica-write"
    to_cluster: "replica-write"
    to_user: "default"

  - name: "distributed-read"
    to_cluster: "distributed-read"
    to_user: "default"
    max_concurrent_queries: 6
    max_execution_time: 1m

clusters:
  - name: "replica-write"
    replicas:
      - name: "replica"
        nodes: ["node01:8123", "node02:8123", "node03:8123", "node04:8123","node05:8123","node06:8123"]
    users:
      - name: "default"
        password: "tzck123.com"

  - name: "distributed-write"
    nodes: [
      "node01:8123",
      "node02:8123",
      "node03:8123",
      "node04:8123",
      "node05:8123",
      "node06:8123"
    ]
    users:
      - name: "default"
        password: "tzck123.com"

  - name: "distributed-read"
    nodes: [
      "node01:8123",
      "node02:8123",
      "node03:8123",
      "node04:8123",
      "node05:8123",
      "node06:8123"
    ]
    users:
    - name: "default"
      password: "tzck123.com"

caches:
  - name: "shortterm"
    dir: "/etc/chproxy/cache/shortterm"
    max_size: 150Mb
    expire: 130s

-----------------

启动文件:

vim chproxy.sh
-------

#!/bin/bash
cd /etc/chproxy
ps -ef | grep chproxy | head -2 | tail -1 | awk print $2 | xargs kill -9
nohup /usr/bin/chproxy -config=/etc/chproxy/config.yml >> ./chproxy.out 2>&1 &

-----

chmod +x chproxy.sh 

./chproxy.sh 

测试chproxy 
在node节点上面查询
echo select * from system.clusters | curl http://localhost:8123/?user=default&password=tzck123.com --data-binary @-

在chproxy 代理上面查询:

echo select * from system.clusters | curl http://192.168.100.120:19000/?user=distributed-read&password= --data-binary @-
echo select * from system.clusters | curl http://192.168.100.120:19000/?user=distributed-write&password= --data-binary @-



echo select count(1) from datasets.hits_v1 | curl http://192.168.100.120:19000/?user=distributed-read&password= --data-binary @-

echo select count(1) from datasets.hits_v1_all | curl http://192.168.100.120:19000/?user=distributed-read&password= --data-binary @-

以上是关于clickhouse 20.x 分布式表测试与chproxy的部署的主要内容,如果未能解决你的问题,请参考以下文章

常见ClickHouse集群部署架构

clickhouse 20.x 与prometheus + grafana+ckman的集成

ClickHouse kafka引擎落盘分布式表

战斗民族开源神器!ClickHouse为什么能够征服各个大厂?

ClickHouse的初步认识与使用

ClickHouse入门介绍与其特性