timescaledb 集成 madlib

Posted rongfengliang-荣锋亮

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了timescaledb 集成 madlib相关的知识,希望对你有一定的参考价值。

github 上有人提出了一个问题(2017 很早了),然后搜索timescaledb 的docs 文档,发现有
一片介绍的文章,所以尝试运行下
备注: 环境使用虚拟机安装(没有使用docker madlib 的原因,实际上可以尝试基于timescaledb 的镜像改造)

安装madlib

这个可以参考madlib 官方文档,或者https://www.cnblogs.com/rongfengliang/p/10298159.html
因为timescaledb 是一个pg 的标准扩展,可以复用以前的安装方式

  • 添加pg 10 repo
 
yum install https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm
 
  • 安装pg python 基本包
yum -y install postgresql10-plpython supervisor
 
  • 安装madlib 依赖包
    注意python 版本,我使用python 2.7 安装时候失败了,修改为了python34
 
yum update -y && yum install -y \\
                    git \\
                    gcc \\
                    wget \\
                    postgresql10-devel \\
                    openssl \\
                    m4 \\
                    vim \\
                    flex \\
                    bison \\
                    graphviz \\
                    java \\
                    epel-release \\
                    python34-devel
 
 
  • 安装pip 包
    默认一般是包含的
 
yum install -y python34-pip
 
 
  • pg_conf 配置(环境变量)
PATH="$PATH:/usr/pgsql-10/bin"
 
 
  • 安装python 依赖(通过pip)
 pip3 install awscli pygresql paramiko --upgrade
 
 
  • 安装apache-madlib
下载rpm 
wget   https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
安装
yum install -y apache-madlib-1.15.1-bin-Linux.rpm
 
 

timescaledb 扩展安装

  • 下载timescaledb pg 扩展
wget https://timescalereleases.blob.core.windows.net/rpm/timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
 
  • 安装
yum install -y timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
 
  • 初始化数据库
/usr/pgsql-10/bin/postgresql-10-setup initdb
 
 
  • 配置pg 扩展
    /var/lib/pgsql/10/data/postgresql.conf 文件
 
+ shared_preload_libraries = \'timescaledb\'
 
 
  • 修改pg_hba.conf 添加访问支持
    之后修改之后,需要重启服务,systemctl restart postgresql-10
 
/var/lib/pgsql/10/data/pg_hba.conf
修改如下:
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
 
 
  • 启动数据库
systemctl enable postgresql-10
systemctl start postgresql-10
 
 

注册madlib 扩展

  • 下载&&安装madlib 扩展
下载rpm 
wget https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
安装
yum install -y apache-madlib-1.15.1-bin-Linux.rpm
 
 
  • 注册
/usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install
 
 

效果

madpack.py: INFO : Detected PostgreSQL version 10.6.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version = 1.15.1 (/usr/local/madlib/Versions/1.15.1/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=postgres, schema=madlib)
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > Creating language PL/Python...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)
madpack.py: INFO : > Preparing objects for the following modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpac
 
 
  • 检查
/usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install-check
 
 

效果

TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 121 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 112 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 133 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 133 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 44 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 168 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 238 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 21 milliseconds
TEST CASE RES
 
 

简单测试

内容来自官方文档
https://docs.timescale.com/v0.11/tutorials/tutorial-forecasting,https://docs.timescale.com/v1.1/tutorials/tutorial-hello-nyc
同时需要安装的组件也比较多,gis,timescale。。

  • 预备安装
    gis 扩展安装,后边需要,实际上这个是可选的
wget https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm // 这个yum 源
上边实际上已经执行了
yum  install -y postgis25_10
 
 
  • 加载扩展
    都在postgres 数据库
 
psql -U postgres -d  postgres -h localhost 
\\c postgres
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
CREATE EXTENSION postgis;
 
 
  • 加载数据部分数据(因为gis 数据有依赖关系)
    数据有点多,可能会比较慢
 
wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz
tar -xvzf nyc_data.tar.gz
psql -U postgres -d  postgres -h localhost < nyc_data.sql
psql -U postgres -d postgres -h localhost -c "\\COPY rides FROM nyc_data_rides.csv CSV"
 
 
  • 初始化gis 扩展 && forecast.sql
    数据有点多,可能会比较慢
 
ALTER TABLE rides ADD COLUMN pickup_geom geometry(POINT,2163);
ALTER TABLE rides ADD COLUMN dropoff_geom geometry(POINT,2163);
生成geo 数据:
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);
wget http://assets.iobeam.com/sql/forecast.sql
// 此过程会创建rides_count rides_length rides_price  等hypertables 
psql -U postgres -d postgres -h localhost -f forecast.sql
 
 

查询操作

  • 查询rides_price
    如下:
 
SELECT * FROM rides_price;
      one_hour | trip_price
---------------------+------------------
 2016-01-01 00:00:00 | 58.34
 2016-01-01 01:00:00 | 58.34
 2016-01-01 02:00:00 | 58.34
 2016-01-01 03:00:00 | 58.34
 2016-01-01 04:00:00 | 58.34
 2016-01-01 05:00:00 | 59.59
 2016-01-01 06:00:00 | 58.34
 2016-01-01 07:00:00 | 60.3833333333333
 2016-01-01 08:00:00 | 61.2575
 2016-01-01 09:00:00 | 58.435
 2016-01-01 10:00:00 | 63.952
 2016-01-01 11:00:00 | 59.9576923076923
 2016-01-01 12:00:00 | 60.4
 
 
  • 创建训练以及测试的数据集
    这个主要是测试从肯尼迪国际机场到时代广场的车程价格
 
-- Make the training dataset
SELECT * INTO rides_price_train FROM rides_price
WHERE one_hour <= \'2016-01-21 23:59:59\';
-- Make the testing dataset
SELECT * INTO rides_price_test FROM rides_price
WHERE one_hour >= \'2016-01-22 00:00:00\';
 
 

效果

SELECT * INTO rides_price_train FROM rides_price
postgres-# WHERE one_hour <= \'2016-01-21 23:59:59\';
SELECT 504
postgres=# SELECT * INTO rides_price_test FROM rides_price
postgres-# WHERE one_hour >= \'2016-01-22 00:00:00\';
SELECT 240
 
 
  • 使用madlib 函数生成训练数据
DROP TABLE IF EXISTS rides_price_output;
DROP TABLE IF EXISTS rides_price_output_residual;
DROP TABLE IF EXISTS rides_price_output_summary;
DROP TABLE IF EXISTS rides_price_forecast_output;
SELECT madlib.arima_train(\'rides_price_train\', -- input table
      \'rides_price_output\', -- output table
      \'one_hour\', -- timestamp column
      \'trip_price\', -- time-series column
      NULL, -- grouping columns
      TRUE, -- include_mean
      ARRAY[2,1,3] -- non-seasonal orders
      );
SELECT madlib.arima_forecast(\'rides_price_output\', -- model table
                        \'rides_price_forecast_output\', -- output table
                        240 -- steps_ahead (10 days)
                        );
 
 
  • 查看生成的结果
 
    SELECT * FROM rides_price_forecast_output;
 

数据

 SELECT * FROM rides_price_forecast_output;
 steps_ahead | forecast_value
-------------+----------------
           1 | 62.317574661
           2 | 62.7126520761
           3 | 62.892038632
           4 | 62.755044625
           5 | 62.6064068106
           6 | 62.6197088752
           7 | 62.7032172965
           8 | 62.729257786
           9 | 62.6956015739
          10 | 62.6685762986
          11 | 62.6755999496
          12 | 62.6926055721
          13 | 62.695637064
          14 | 62.6878845777
          15 | 62.683
 
 

结论: 从肯尼迪国际机场到时代广场的车程价格在日常基础上保持不变

  • 模型评估的
ALTER TABLE rides_price_test ADD COLUMN id SERIAL PRIMARY KEY;以上是关于timescaledb 集成 madlib的主要内容,如果未能解决你的问题,请参考以下文章

Greenplum 实时数据仓库实践(10)——集成机器学习库MADlib

madlib 集成 hasura graphql-engine 试用

timescaledb 集成prometheus

TimescaleDB 简单试用

用SQL玩转数据挖掘之MADlib——安装

用SQL玩转数据挖掘之MADlib——安装