hive 的日志处理统计网站的 PV UV案例与给合 python的数据清洗数据案例

Posted 2020-10-30

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive 的日志处理统计网站的 PV UV案例与给合 python的数据清洗数据案例相关的知识，希望对你有一定的参考价值。

一：hive 清理日志处理统计PV、UV 访问量

二： hive 数据python 的数据清洗

一：日志处理

统计每个时段网站的访问量:

1.1 在hive 上面创建表结构：

在创建表时不能直接导入问题
create table db_bflog.bf_log_src (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
)
STORED AS TEXTFILE;

技术分享图片

1.2 加载数据到 hive 表当中：

load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;

技术分享图片

1.3 自定义UDF函数

1.3.1：udf函数去除相关引号

package org.apache.hadoop.udf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * * New UDF classes need to inherit from this UDF class.
 * 
 * @author zhangyy
 *
 */
public class RemoveQuotesUDF extends UDF {

    /*
    1. Implement one or more methods named "evaluate" which will be called by Hive.
    2."evaluate" should never be a void method. However it can return "null" if needed.
    */
    public Text evaluate(Text str){
        if(null == str){
            return null;
        }

        // validate 
        if(StringUtils.isBlank(str.toString())){
            return null ;
        }

        // lower
        return new Text(str.toString().replaceAll("\"", ""));
    }

    public static void main(String[] args) {
        System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));
    }
}

1.3.2：udf函数时间格式进行转换

package org.apache.hadoop.udf;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * * New UDF classes need to inherit from this UDF class.
 * 
 * @author zhangyy
 *
 */
public class DateTransformUDF extends UDF {

    private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yy:HH:mm:ss", Locale.ENGLISH) ;
    private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") ;

    /*
    1. Implement one or more methods named "evaluate" which will be called by Hive.
    2."evaluate" should never be a void method. However it can return "null" if needed.
    */
    /**
     * input:
     *      31/Aug/2015:00:04:37 +0800
     * output:
     *      2015-08-31 00:04:37
     */
    public Text evaluate(Text str){
        Text output = new Text() ;

        if(null == str){
            return null;
        }

        // validate 
        if(StringUtils.isBlank(str.toString())){
            return null ;
        }

        try{
            // 1) parse 
            Date parseDate = inputFormat.parse(str.toString().trim());
            // 2) transform
            String outputDate = outputFormat.format(parseDate) ;
            // 3) set
            output.set(outputDate);
        }catch(Exception e){
            e.printStackTrace();
        }

        // lower
        return output;
    }

    public static void main(String[] args) {
        System.out.println(new DateTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800")));
    }
}

将RemoveQuotesUDF 与 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目录下面：

技术分享图片

1.4 去hive 上面生成 udf 函数

  RemoveQuotesUDF 加载成udf函数 ：

  add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;

  create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;

  DateTransformUDF 加载成udf 函数：

  add jar /home/hadoop/jars/DateTransformUDF.jar ;

  create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;

技术分享图片

1.5 创建生成所要要求表：

create table db_bflog.bf_log_comm(
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

技术分享图片

从原有表中提取相关的数据处理：

insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;

技术分享图片

执行sql 统计每小时的pv 访问量：

select t.hour,count(*) cnt
from
(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) t
group by t.hour order by cnt desc ;

技术分享图片

二： hive 数据python 的数据清洗

  统计国外一家影院的每周看电影的人数
  测试数据下载地址：

 wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
 unzip ml-100k.zip

2.1 创建hive 的数据表

 CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
STORED AS TEXTFILE;

技术分享图片

2.2 加载数据：

LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘
OVERWRITE INTO TABLE u_data;

技术分享图片

2.3 创建weekday_mapper.py 脚本

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split(‘\t‘)
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print ‘\t‘.join([userid, movieid, rating, str(weekday)])

2.4 创建临时hive 表用于提取数据：

 CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘;

增加python 脚本到hive

add FILE /home/hadoop/weekday_mapper.py;

技术分享图片

2.5 从旧表中数据提取

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING ‘python weekday_mapper.py‘
  AS (userid, movieid, rating, weekday)
FROM u_data;

技术分享图片

2.6 查找所需要的数据：

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

技术分享图片

以上是关于hive 的日志处理统计网站的 PV UV案例与给合 python的数据清洗数据案例的主要内容，如果未能解决你的问题，请参考以下文章

网站pv uv统计

NGINX: 统计网站的PVUV独立IP

Nginx日志数据的清洗，及所需数据的汇总与导出存储

网站流量统计之PV和UV

Zabbix 监控网站的访问量

Flume+Kafka+Storm+Redis构建大数据实时处理系统：实时统计网站PVUV+展示

hive 的日志处理统计网站的 PV UV案例 与 给合 python的数据清洗数据案例

一： 日志处理

1.1 在hive 上面创建表结构：

1.2 加载数据到 hive 表当中：

1.3 自定义UDF函数

1.3.1：udf函数去除相关引号

1.3.2：udf函数时间格式进行转换

1.4 去hive 上面 生成 udf 函数

1.5 创建生成所要要求表：

从原有表中提取 相关的数据处理：

执行sql 统计每小时的pv 访问量：

二： hive 数据python 的数据清洗

2.1 创建hive 的数据表

2.2 加载数据：

2.3 创建weekday_mapper.py 脚本

2.4 创建临时hive 表 用于提取数据：

2.5 从旧表中数据提取

2.6 查找所需要的数据：

hive 的日志处理统计网站的 PV UV案例与给合 python的数据清洗数据案例

一：日志处理

1.4 去hive 上面生成 udf 函数

从原有表中提取相关的数据处理：

2.4 创建临时hive 表用于提取数据：