电商数仓sqoop
Posted 今夜月色很美
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了电商数仓sqoop相关的知识,希望对你有一定的参考价值。
1、sqoop安装
1.1、下载并解压
sqoop官网地址:http://sqoop.apache.org
下载地址:http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/
解压sqoop安装包到指定目录,如:
tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/
修改文件夹名称:
mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
1.2、修改配置文件
进入到/opt/module/sqoop/conf目录,重命名配置文件
mv sqoop-env-template.sh sqoop-env.sh
修改配置文件内容
vim sqoop-env.sh
增加如下内容
export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3
export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7
export ZOOCFGDIR=/opt/module/zookeeper-3.5.7/conf
进入到/opt/software/路径,拷贝jdbc驱动到sqoop的lib目录下。
cp mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop/lib/
1.3、验证sqoop
我们可以通过某一个command来验证sqoop配置是否正确:
bin/sqoop help
测试Sqoop是否能够成功连接数据库
bin/sqoop list-databases --connect jdbc:mysql://h103:3310/ --username root --password 123456
1.4、Sqoop基本使用
通过sqoop --help查看语法格式,例如:将mysql中user_info表数据导入到HDFS的/test路径
bin/sqoop import \\
--connect jdbc:mysql://h103:3310/gmall \\
--username root \\
--password 123456 \\
--table user_info \\
--columns id,login_name \\
--where "id>=10 and id<=30" \\
--target-dir /test \\
--delete-target-dir \\
--fields-terminated-by '\\t' \\
--num-mappers 2 \\
--split-by id
等价于
bin/sqoop import \\
--connect jdbc:mysql://h103:3310/gmall \\
--username root \\
--password 123456 \\
--query 'select id,login_name from user_info where id>=10 and id<=30 and $CONDITIONS' \\
--target-dir /test \\
--delete-target-dir \\
--fields-terminated-by '\\t' \\
--num-mappers 2 \\
--split-by id
2、同步策略
数据同步策略的类型包括:全量同步、增量同步、新增及变化同步、特殊情况
全量表:存储完整的数据。
增量表:存储新增加的数据。
新增及变化表:存储新增加的数据和变化的数据。
特殊表:只需要存储一次。
2.1、全量同步策略
每日全量,就是每天存储一份完整数据,作为一个分区。
适用于表数据量不大,且每天既会有新数据插入,也会有旧数据的修改的场景。
例如:编码字典表、品牌表、商品三级分类、商品二级分类、商品一级分类、优惠规则表、活动表、活动参与商品表、加购表、商品收藏表、优惠券表、SKU商品表、SPU商品表。
2.2、增量同步策略
每日增量,就是每天存储一份增量数据,作为一个分区。
适用于表数据量大,且每天只会有新数据插入的场景。例如:退单表、订单状态表、支付流水表、订单详情表、活动与订单关联表、商品评论表。
2.3、新增及变化策略
每日新增及变化,就是存储创建时间和操作时间都是今天的数据。
适用场景为,表的数据量大,既会有新增,又会有变化。例如:用户表、订单表、优惠券领用表。
2.4、特殊策略
某些特殊的表,可不必遵循上述同步策略。例如某些不会发生变化的表(地区表,省份表,民族表)可以只存一份固定值。
3、业务数据导入hdfs
3.1、业务数据首日同步脚本
在/root/bin目录下创建
vim mysql_to_hdfs_init.sh
添加如下内容:
#! /bin/bash
APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
import_data()
$sqoop import \\
--connect jdbc:mysql://h103:3310/$APP \\
--username root \\
--password 123456 \\
--target-dir /origin_data/$APP/db/$1/$do_date \\
--delete-target-dir \\
--query "$2 where \\$CONDITIONS" \\
--num-mappers 1 \\
--fields-terminated-by '\\t' \\
--compress \\
--compression-codec lzop \\
--null-string '\\\\N' \\
--null-non-string '\\\\N'
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
import_order_info()
import_data order_info "select
id,
total_amount,
order_status,
user_id,
payment_way,
delivery_address,
out_trade_no,
create_time,
operate_time,
expire_time,
tracking_no,
province_id,
activity_reduce_amount,
coupon_reduce_amount,
original_total_amount,
feight_fee,
feight_fee_reduce
from order_info"
import_coupon_use()
import_data coupon_use "select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from coupon_use"
import_order_status_log()
import_data order_status_log "select
id,
order_id,
order_status,
operate_time
from order_status_log"
import_user_info()
import_data "user_info" "select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time
from user_info"
import_order_detail()
import_data order_detail "select
id,
order_id,
sku_id,
sku_name,
order_price,
sku_num,
create_time,
source_type,
source_id,
split_total_amount,
split_activity_amount,
split_coupon_amount
from order_detail"
import_payment_info()
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
payment_type,
trade_no,
total_amount,
subject,
payment_status,
create_time,
callback_time
from payment_info"
import_comment_info()
import_data comment_info "select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time
from comment_info"
import_order_refund_info()
import_data order_refund_info "select
id,
user_id,
order_id,
sku_id,
refund_type,
refund_num,
refund_amount,
refund_reason_type,
refund_status,
create_time
from order_refund_info"
import_sku_info()
import_data sku_info "select
id,
spu_id,
price,
sku_name,
sku_desc,
weight,
tm_id,
category3_id,
is_sale,
create_time
from sku_info"
import_base_category1()
import_data "base_category1" "select
id,
name
from base_category1"
import_base_category2()
import_data "base_category2" "select
id,
name,
category1_id
from base_category2"
import_base_category3()
import_data "base_category3" "select
id,
name,
category2_id
from base_category3"
import_base_province()
import_data base_province "select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from base_province"
import_base_region()
import_data base_region "select
id,
region_name
from base_region"
import_base_trademark()
import_data base_trademark "select
id,
tm_name
from base_trademark"
import_spu_info()
import_data spu_info "select
id,
spu_name,
category3_id,
tm_id
from spu_info"
import_favor_info()
import_data favor_info "select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from favor_info"
import_cart_info()
import_data cart_info "select
id,
user_id,
sku_id,
cart_price,
sku_num,
sku_name,
create_time,
operate_time,
is_ordered,
order_time,
source_type,
source_id
from cart_info"
import_coupon_info()
import_data coupon_info "select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from coupon_info"
import_activity_info()
import_data activity_info "select
id,
activity_name,
activity_type,
start_time,
end_time,
create_time
from activity_info"
import_activity_rule()
import_data activity_rule "select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from activity_rule"
import_base_dic()
import_data base_dic "select
dic_code,
dic_name,
parent_code,
create_time,
operate_time
from base_dic"
import_order_detail_activity()
import_data order_detail_activity "select
id,
order_id,
order_detail_id,
activity_id,
activity_rule_id,
sku_id,
create_time
from order_detail_activity"
import_order_detail_coupon()
import_data order_detail_coupon "select
id,
order_id,
order_detail_id,
coupon_id,
coupon_use_id,
sku_id,
create_time
from order_detail_coupon"
import_refund_payment()
import_data refund_payment "select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
total_amount,
subject,
refund_status,
create_time,
callback_time
from refund_payment"
import_sku_attr_value()
import_data sku_attr_value "select
id,
attr_id,
value_id,
sku_id,
attr_name,
value_name
from sku_attr_value"
import_sku_sale_attr_value()
import_data sku_sale_attr_value "select
id,
sku_id,
spu_id,
sale_attr_value_id,
sale_attr_id,
sale_attr_name,
sale_attr_value_name
from sku_sale_attr_value"
case $1 in
"order_info")
import_order_info
;;
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"base_province")
import_base_province
;;
"base_region")
import_base_region
;;
"base_trademark")
import_base_trademark
;;
"activity_info")
import_activity_info
;;
"cart_info")
import_cart_info
;;
"comment_info")
import_comment_info
;;
"coupon_info")
import_coupon_info
;;
"coupon_use")
import_coupon_use
;;
"favor_info")
import_favor_info
;;
"order_refund_info")
import_order_refund_info
;;
"order_status_log")
import_order_status_log
;;
"spu_info")
import_spu_info
;;
"activity_rule")
import_activity_rule
;;
"base_dic")
import_base_dic
;;
"order_detail_activity")
import_order_detail_activity
;;
"order_detail_coupon")
import_order_detail_coupon
;;
"refund_payment")
import_refund_payment
;;
"sku_attr_value")
import_sku_attr_value
;;
"sku_sale_attr_value")
import_sku_sale_attr_value
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
import_base_region
import_base_province
import_base_trademark
import_activity_info
import_cart_info
import_comment_info
import_coupon_use
import_coupon_info
import_favor_info
import_order_refund_info
import_order_status_log
import_spu_info
import_activity_rule
import_base_dic
import_order_detail_activity
import_order_detail_coupon
import_refund_payment
import_sku_attr_value
import_sku_sale_attr_value
;;
esac
说明:
[ -n 变量值 ] 判断变量的值,是否为空
-- 变量的值,非空,返回true
-- 变量的值,为空,返回false
增加脚本执行权限
chmod +x mysql_to_hdfs_init.sh
使用脚本
mysql_to_hdfs_init.sh all 2022-04-01
3.2、业务数据每日同步脚本
在/home/atguigu/bin目录下创建
vim mysql_to_hdfs.sh
添加如下内容:
#! /bin/bash
APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d '-1 day' +%F`
fi
import_data()
$sqoop import \\
--connect jdbc:mysql://h103:3310/$APP \\
--username root \\
--password 123456 \\
--target-dir /origin_data/$APP/db/$1/$do_date \\
--delete-target-dir \\
--query "$2 and \\$CONDITIONS" \\
--num-mappers 1 \\
--fields-terminated-by '\\t' \\
--compress \\
--compression-codec lzop \\
--null-string '\\\\N' \\
--null-non-string '\\\\N'
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
import_order_info()
import_data order_info "select
id,
total_amount,
order_status,
user_id,
payment_way,
delivery_address,
out_trade_no,
create_time,
operate_time,
expire_time,
tracking_no,
province_id,
activity_reduce_amount,
coupon_reduce_amount,
original_total_amount,
feight_fee,
feight_fee_reduce
from order_info
where (date_format(create_time,'%Y-%m-%d')='$do_date'
or date_format(operate_time,'%Y-%m-%d')='$do_date')"
import_coupon_use()
import_data coupon_use "select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from coupon_use
where (date_format(get_time,'%Y-%m-%d')='$do_date'
or date_format(using_time,'%Y-%m-%d')='$do_date'
or date_format(used_time,'%Y-%m-%d')='$do_date'
or date_format(expire_time,'%Y-%m-%d')='$do_date')"
import_order_status_log()
import_data order_status_log "select
id,
order_id,
order_status,
operate_time
from order_status_log
where date_format(operate_time,'%Y-%m-%d')='$do_date'"
import_user_info()
import_data "user_info" "select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time
from user_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"
import_order_detail()
import_data order_detail "select
id,
order_id,
sku_id,
sku_name,
order_price,
sku_num,
create_time,
source_type,
source_id,
split_total_amount,
split_activity_amount,
split_coupon_amount
from order_detail
where DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'"
import_payment_info()
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
payment_type,
trade_no,
total_amount,
subject,
payment_status,
create_time,
callback_time
from payment_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"
import_comment_info()
import_data comment_info "select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time
from comment_info
where date_format(create_time,'%Y-%m-%d')='$do_date'"
import_order_refund_info()
import_data order_refund_info "select
id,
user_id,
order_id,
sku_id,
refund_type,
refund_num,
refund_amount,
refund_reason_type,
refund_status,
create_time
from order_refund_info
where date_format(create_time,'%Y-%m-%d')='$do_date'"
import_sku_info()
import_data sku_info "select
id,
spu_id,
price,
sku_name,
sku_desc,
weight,
tm_id,
category3_id,
is_sale,
create_time
from sku_info where 1=1"
import_base_category1()
import_data "base_category1" "select
id,
name
from base_category1 where 1=1"
import_base_category2()
import_data "base_category2" "select
id,
name,
category1_id
from base_category2 where 1=1"
import_base_category3()
import_data "base_category3" "select
id,
name,
category2_id
from base_category3 where 1=1"
import_base_province()
import_data base_province "select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from base_province
where 1=1"
import_base_region()
import_data base_region "select
id,
region_name
from base_region
where 1=1"
import_base_trademark()
import_data base_trademark "select
id,
tm_name
from base_trademark
where 1=1"
import_spu_info()
import_data spu_info "select
id,
spu_name,
category3_id,
tm_id
from spu_info
where 1=1"
import_favor_info()
import_data favor_info "select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from favor_info
where 1=1"
import_cart_info()
import_data cart_info "select
id,
user_id,
sku_id,
cart_price,
sku_num,
sku_name,
create_time,
operate_time,
is_ordered,
order_time,
source_type,
source_id
from cart_info
where 1=1"
import_coupon_info()
import_data coupon_info "select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from coupon_info
where 1=1"
import_activity_info()
import_data activity_info "select
id,
activity_name,
activity_type,
start_time,
end_time,
create_time
from activity_info
where 1=1"
import_activity_rule()
import_data activity_rule "select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from activity_rule
where 1=1"
import_base_dic()
import_data base_dic "select
dic_code,
dic_name,
parent_code,
create_time,
operate_time
from base_dic
where 1=1"
import_order_detail_activity()
import_data order_detail_activity "select
id,
order_id,
order_detail_id,
activity_id,
activity_rule_id,
sku_id,
create_time
from order_detail_activity
where date_format(create_time,'%Y-%m-%d')='$do_date'"
import_order_detail_coupon()
import_data order_detail_coupon "select
id,
order_id,
order_detail_id,
coupon_id,
coupon_use_id,
sku_id,
create_time
from order_detail_coupon
where date_format(create_time,'%Y-%m-%d')='$do_date'"
import_refund_payment()
import_data refund_payment "select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
total_amount,
subject,
refund_status,
create_time,
callback_time
from refund_payment
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"
import_sku_attr_value()
import_data sku_attr_value "select
id,
attr_id,
value_id,
sku_id,
attr_name,
value_name
from sku_attr_value
where 1=1"
import_sku_sale_attr_value()
import_data sku_sale_attr_value "select
id,
sku_id,
spu_id,
sale_attr_value_id,
sale_attr_id,
sale_attr_name,
sale_attr_value_name
from sku_sale_attr_value
where 1=1"
case $1 in
"order_info")
import_order_info
;;
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"base_province")
import_base_province
;;
"activity_info")
import_activity_info
;;
"cart_info")
import_cart_info
;;
"comment_info")
import_comment_info
;;
"coupon_info")
import_coupon_info
;;
"coupon_use")
import_coupon_use
;;
"favor_info")
import_favor_info
;;
"order_refund_info")
import_order_refund_info
;;
"order_status_log")
import_order_status_log
;;
"spu_info")
import_spu_info
;;
"activity_rule")
import_activity_rule
;;
"base_dic")
import_base_dic
;;
"order_detail_activity")
import_order_detail_activity
;;
"order_detail_coupon")
import_order_detail_coupon
;;
"refund_payment")
import_refund_payment
;;
"sku_attr_value")
import_sku_attr_value
;;
"sku_sale_attr_value")
import_sku_sale_attr_value
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
import_base_trademark
import_activity_info
import_cart_info
import_comment_info
import_coupon_use
import_coupon_info
import_favor_info
import_order_refund_info
import_order_status_log
import_spu_info
import_activity_rule
import_base_dic
import_order_detail_activity
import_order_detail_coupon
import_refund_payment
import_sku_attr_value
import_sku_sale_attr_value
;;
esac
增加脚本执行权限
chmod +x mysql_to_hdfs.sh
4、项目经验
Hive中的Null在底层是以“\\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。在导出数据时采用–input-null-string和–input-null-non-string两个参数。导入数据时采用–null-string和–null-non-string。
以上是关于电商数仓sqoop的主要内容,如果未能解决你的问题,请参考以下文章