将PostgreSQL数据库的表导入到elasticsearch中
Posted 一泽涟漪
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了将PostgreSQL数据库的表导入到elasticsearch中相关的知识,希望对你有一定的参考价值。
1.查看PostgreSQL表结构和数据信息
edbstore=# d customers Table "edbstore.customers" Column | Type | Modifiers ----------------------+-----------------------+---------------------------------------------------------------- customerid | integer | not null default nextval(‘customers_customerid_seq‘::regclass) firstname | character varying(50) | not null lastname | character varying(50) | not null address1 | character varying(50) | not null address2 | character varying(50) | city | character varying(50) | not null state | character varying(50) | zip | integer | country | character varying(50) | not null region | smallint | not null email | character varying(50) | phone | character varying(50) | creditcardtype | integer | not null creditcard | character varying(50) | not null creditcardexpiration | character varying(50) | not null username | character varying(50) | not null password | character varying(50) | not null age | smallint | income | integer | gender | character varying(1) | Indexes: "customers_pkey" PRIMARY KEY, btree (customerid) "ix_cust_username" UNIQUE, btree (username) Referenced by: TABLE "cust_hist" CONSTRAINT "fk_cust_hist_customerid" FOREIGN KEY (customerid) REFERENCES customers(customerid) ON DELETE CASCADE TABLE "orders" CONSTRAINT "fk_customerid" FOREIGN KEY (customerid) REFERENCES customers(customerid) ON DELETE SET NULL edbstore=# select count(1) from customers; count ------- 20000 (1 row)
2.利用PostgreSQL的row_to_json函数将表结构导出并保存为json格式
edbstore=# Tuples only is on. edbstore=# o customer.json edbstore=# select row_to_json(r) from customers as r; edbstore=# q [postgres@sht-sgmhadoopcm-01 dba]$ ls -lh customer.json -rw-r--r-- 1 postgres appuser 7.7M Dec 7 22:37 customer.json $ head -1 customer.json {"customerid":1,"firstname":"VKUUXF","lastname":"ITHOMQJNYX","address1":"4608499546 Dell Way","address2":null,"city":"QSDPAGD","state":"SD","zip":24101,"country":"US","region":1,"email":"ITHOMQJNYX@dell.com","phone":"4608499546","creditcardtype":1,"creditcard":"1979279217775911","creditcardexpiration":"2012/03","username":"user1","password":"password","age":55,"income":100000,"gender":"M"}
此时customer表虽然转储为json格式文件,但是并不能直接导入到elasticsearch,否则会报错如下
$ curl -H "Content-Type: application/json" -XPOST "172.16.101.55:9200/bank/_bulk?pretty&refresh" --data-binary "@customer.json" { "error" : { "root_cause" : [ { "type" : "illegal_argument_exception", "reason" : "Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_NUMBER]" } ], "type" : "illegal_argument_exception", "reason" : "Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_NUMBER]" }, "status" : 400 }
根据文档https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html说明,我们的json数据里并未指定每行数据唯一的文档id值
3.为json格式的表数据添加id字段
因为之前我们看到该customer表共有2000行,所以我们需要生成对应的20000个id值,我们借助python实现,新建build_id.py文件,并写入如下内容,看清楚是20001,因为包头不包尾原则,1-20000实际打印出来是1-19999,所以我们写1-20001
for i in range(1,20001): print(‘{"index":{"_id":"%s"}}‘ %i )
为该文件添加可执行权限,然后执行即可
$ python build_id.py > build_id.txt $ head -3 build_id.txt {"index":{"_id":"1"}} {"index":{"_id":"2"}} {"index":{"_id":"3"}}
利用linux “paste"命令,将id文件和表文件合并
$ paste -d‘ ‘ build_id.txt customer.json > customer_new.json $ head -4 customer_new.json {"index":{"_id":"1"}} {"customerid":1,"firstname":"VKUUXF","lastname":"ITHOMQJNYX","address1":"4608499546 Dell Way","address2":null,"city":"QSDPAGD","state":"SD","zip":24101,"country":"US","region":1,"email":"ITHOMQJNYX@dell.com","phone":"4608499546","creditcardtype":1,"creditcard":"1979279217775911","creditcardexpiration":"2012/03","username":"user1","password":"password","age":55,"income":100000,"gender":"M"} {"index":{"_id":"2"}} {"customerid":2,"firstname":"HQNMZH","lastname":"UNUKXHJVXB","address1":"5119315633 Dell Way","address2":null,"city":"YNCERXJ","state":"AZ","zip":11802,"country":"US","region":1,"email":"UNUKXHJVXB@dell.com","phone":"5119315633","creditcardtype":1,"creditcard":"3144519586581737","creditcardexpiration":"2012/11","username":"user2","password":"password","age":80,"income":40000,"gender":"M"}
4.此时处理过的json格式的表文件就可以正常导入到elasticsearch中了,测试
$ curl -H "Content-Type: application/json" -XPOST "172.16.101.55:9200/customer/_bulk?pretty&refresh" --data-binary "@customer_new.json"
$ curl http://172.16.101.55:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open customer DvLoM7NjSYyjTwD5BSkK3A 1 1 20000 0 10mb 10mb
以上是关于将PostgreSQL数据库的表导入到elasticsearch中的主要内容,如果未能解决你的问题,请参考以下文章