用户消费行为分析（MapReduce实现，Hive分析）

Posted 2021-07-19 一研为定

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了用户消费行为分析（MapReduce实现，Hive分析）相关的知识，希望对你有一定的参考价值。

mapper处理空值和分隔逗号有三种方法：

2、使用hive解决问题

三、总结

一、题目要求

数据来自CDNow网站的一份用户购买CD明细，本数据集共有 6 万条左右数据，数据为 CDNow 网站 1997年1月至1998年6月的用户行为数据。

1、数据

数据共分4个字段，样例如下：

00001 19970101 1 11.77

00002 19970112 1 12.00

00002 19970112 5 77.00

00003 19970102 2 20.76

00003 19970330 2 20.76

00003 19970402 2 19.54

user_id: 用户ID ，order_dt: 购买日期，order_products: 购买产品数， order_amount: 购买金额

2、数据清洗

编写MapReduce程序实现数据清洗，将不完整的数据过滤掉，输出数据以‘，’分割。

3、用户消费趋势分析

使用hive对清洗后的数据进行分析通过对用户行为数据进行分析，计算用户消费趋势：

1.每月各用户的消费次数、购买产品数量、消费总额。

2.每月的消费总次数、购买产品总数量、消费总额、消费人数。结合分析结果，给出消费趋势结论。

二、解题过程

1、MapReduce程序实现数据清洗

思路：在mapper中将不完整数据去除，并将数据以逗号分隔，在reducer中去除重复数据。

mapper处理空值和分隔逗号有三种方法：

方法一、

使用for (String m : msgs)是java的一种循环遍历的方法，表示遍历列表msgs里面的每一个对象，而对象的类型是String，对象名是msgs。如果遍历的对象非空，那么就添加到newmsgs字符串里，并以逗号分隔。

定义新的字符串newmsgs2，并将newmsgs以逗号分隔，赋值给msgs2。令newmsgs2等于msgs的第1列，取第1列数字（0，6），左闭右开，即将日期精确到月。

最后定义字符串newmsgs3， newmsgs3 = msgs2[0]+","+newmsgs2+","+msgs2[2]+","+msgs2[3];即第0，1，2列后加逗号，第三列后不加逗号。

public class ShoppingMapper extends Mapper<LongWritable,Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //得到数据，字符串类型转换成java类型
        String data = value.toString();
        //用空格分割数据
        String[] msgs = data.split(" ");
        //定义newmsgs字符串
        String newmsgs = "";
        //去除数据里的空值
        for (String m:msgs){
            if (!m.equals("")){
                newmsgs += m + ',';
            }
        }
        //将值放入newmsgs中
        newmsgs = newmsgs.substring(0,newmsgs.length()-1);
        //定义新的字符串newmsgs2
        String newmsgs2 = "";
        //以逗号分隔
        String[] msgs2 = newmsgs.split(",");
        //将数据精确到月份
        newmsgs2 = msgs2[1].substring(0,6);
        //定义新字符串
        String newmsgs3 = "";
        //将数据以逗号分隔
        newmsgs3 = msgs2[0]+","+newmsgs2+","+msgs2[2]+","+msgs2[3];
        //输出结果
        context.write(new Text(newmsgs3),NullWritable.get());
    }
}

方法二、

首先将数据从字符串类型转为java类型，将原数据以空格分割，放入字符串msgs。定义一个空列表list，使用for循环msgs，如果这一行的数据不为空，则将这一行数据放入list列表中。

定义一个空字符串，使用for循环遍历列表，因为原数据有4列，所以list.size()=4,当i=list.size()-1,即当i=3的时候，是原数据的第4列，因为i是从0开始数的，所以i=3是第4列。第4列是最后一列，后面不需要加逗号，所以直接将数据放到字符串data里。

当i=1的时候，即原数据的第二列，第二列的数据是精确到日的日期，如19970101，而为了下面做题方便，我们需要将日期精确到月，所以使用substring函数，将数据切割成199701，后面加上逗号。

其余的两列是原数据的第一列和第三列，导入的时候后面加上逗号。

因为我们在mapper里一直对key进行操作，没有对value进行操作，所以k2的类型是Text，而v2的类型是NullWritable。

public class ConMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String Msg =value.toString();
        //分割数据
        String[] msgs=Msg.split(" ");
        //以空格分隔
        List list=new ArrayList();
        //定义一个空列表
        for (String i:msgs){
            if(!i.equals("")){
                list.add(i);
            }
            //判断被空格分隔的数据是否为空，如果不为空则添加到列表中
        }
        String data="";
        //定义一个空字符串
        for (int i=0;i<list.size();i++){
            if(i==list.size()-1) {
                data += list.get(i);
            }else if (i==1){
                String s=(String)list.get(i);
                data+=s.substring(0,6)+",";
            }else{
                data+=list.get(i)+",";
            }
        }
        context.write(new Text(data),NullWritable.get());
    }
}

方法三、

个人很喜欢最后这种方法，但前提是要知道\\\\s+是什么

\\\\s+ 表示一个或多个空格，因为原数据数据与数据之间并不是以一个空格分隔的，有的是1个空格，有的是2个或3个，所以用\\\\s+就可以解决空格的所有情况。

然后判断为空，通过判断数组中每个数据的长度是否大于0来判空，如果有等于0的情况，则跳过不执行。

String data = value.toString();
String[] words = data.split("\\\\s+");
        if(words[1].length()>0&&words[2].length()>0&&words[3].length()>0&&words[4].length()>0){
     String word = words[1]+","+words[2]+","+words[3]+","+words[4];
     context.write(new Text(word),NullWritable.get());
}

具体代码如下：

mapper：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class ShoppingMapper extends Mapper<LongWritable,Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //得到数据，字符串类型转换成java类型
        String data = value.toString();
        //用空格分割数据
        String[] msgs = data.split(" ");
        //定义newmsgs字符串
        String newmsgs = "";
        //去除数据里的空值
        for (String m:msgs){
            if (!m.equals("")){
                newmsgs += m + ',';
            }
        }
        //将值放入newmsgs中
        newmsgs = newmsgs.substring(0,newmsgs.length()-1);
        //定义新的字符串newmsgs2
        String newmsgs2 = "";
        //以逗号分隔
        String[] msgs2 = newmsgs.split(",");
        //将数据精确到月份
        newmsgs2 = msgs2[1].substring(0,6);
        //定义新字符串
        String newmsgs3 = "";
        //将数据以逗号分隔
        newmsgs3 = msgs2[0]+","+newmsgs2+","+msgs2[2]+","+msgs2[3];
        //输出结果
        context.write(new Text(newmsgs3),NullWritable.get());
    }
}

reducer：reducer的作用是去除重复值。

import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class ShoppingReducer extends Reducer<Text,NullWritable,Text, NullWritable> {
    //去除重复值
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
       for (NullWritable i:values){
           context.write(key,NullWritable.get());
       }
    }
}

Main

import org.apache.commons.lang.ObjectUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ShoppingMain {
    public static void main(String[] args) throws Exception{
        Job job = Job.getInstance(new Configuration());
        //主程序入口
        job.setJarByClass(ShoppingMain.class);
        //指定map和map的输出类型
        job.setMapperClass(ShoppingMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        //设置输入，输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //提交任务，并等待任务运行完成
        job.waitForCompletion(true);
    }
}

2、使用hive解决问题

首先创建数据库和表,并将mapreduce处理好的数据导入hive

use hadoop_hive;

create table tb1 (user_id int,order_dt int,order_products int,order_amount float)
row format delimited fields terminated by ',';

load data local inpath '/home/ubuntu/Desktop/text2_update2' into table tb2;

1.每月各用户的消费次数、购买产品数量、消费总额。
（1）select order_dt,user_id,count(user_id) from tb2 group by order_dt,user_id;
（2）select order_dt,user_id,sum(order_products) from tb2 group by order_dt,user_id;
（3）select order_dt,user_id,sum(order_amount) from tb2 group by order_dt,user_id;

2.每月的消费总次数、购买产品总数量、消费总额、消费人数。结合分析结果，给出消费趋势结论。
（1）select order_dt,count(user_id) from tb2 group by order_dt;
（2）select order_dt,sum(order_products) from tb2 group by order_dt;
（3）select order_dt,sum(order_amount) from tb2 group by order_dt;
（4）每月消费人数 DISTINCT 去重
select order_dt,count(DISTINCT user_id) from tb2 group by order_dt;

三、总结

通过这次实验，我明白了怎样使用mapreduce去除空值，以空格分列，并使数据以逗号分隔；并学习了使用hive处理数据，使用group by分组，使用count（）求有多少条数据，sum（）将数据求和加起来，这些都让我收获很多。

以上是关于用户消费行为分析（MapReduce实现，Hive分析）的主要内容，如果未能解决你的问题，请参考以下文章

用户消费行为分析（MapReduce实现，Hive分析）

基于MapReduce实现用户基础数据统计