Spark 加强版WordCount ,统计日志中文件访问数量

Posted 赵侠客

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark 加强版WordCount ,统计日志中文件访问数量相关的知识,希望对你有一定的参考价值。

原文地址:http://blog.csdn.net/whzhaochao/article/details/72416956

写在前面

学习Scala和Spark基本语法比较枯燥无味,搞搞简单的实际运用可以有效的加深你对基本知识点的记忆,前面我们完成了最基本的WordCount功能的http://blog.csdn.net/whzhaochao/article/details/72358215,这篇主要是结合实际生产情况编写一个简单的功能,功能就是通过分析CDN或者nginx的日志文件,统计出访问的PV、UV、IP地址、访问来源等相关数据,这里只是提供一种练习思路,实际运用可能还需要复杂点

统计文件请求数

如下图所示为七牛CDN请求的日志

223.93.159.226 HIT 203 [15/Feb/2017:11:14:35 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 5444007 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 62 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4866645 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 15 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 4854183 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 91 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 4751957 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
61.164.41.226 HIT 2537 [15/Feb/2017:11:13:54 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 200 5173432 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.236.173.95 HIT 1 [15/Feb/2017:11:17:49 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "http://v.abc.com.cn/video/iframe/player.html?id=139067&auto=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 QQ/6.6.9.412 V1_IPH_SQ_6.6.9_1_APP_A Pixel/1080 Core/UIWebView NetType/WIFI"
183.129.251.218 HIT 486 [15/Feb/2017:11:18:40 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4845881 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 34 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4976817 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 27 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 37 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 43 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 19 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5304429 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.228.161.136 HIT 1 [15/Feb/2017:11:16:51 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140994&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Mobile/14D27 MicroMessenger/6.5.4 NetType/WIFI Language/zh_CN"
202.107.208.102 HIT 1226 [15/Feb/2017:11:19:10 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.231.248.162 HIT 34 [15/Feb/2017:11:17:56 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 1208743 "http://www.abc.com.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; GWX:DOWNLOADED; GWX:RESERVED)"
221.234.216.142 HIT 744 [15/Feb/2017:11:17:09 +0800] "GET http://v-cdn.abc.com.cn/140995.mp4 HTTP/1.1" 206 4194896 "https://v.abc.com.cn/video/iframe/player.html?id=140995&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12B411 MicroMessenger/6.3.31 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
112.17.240.97 HIT 1440 [15/Feb/2017:11:20:31 +0800] "GET http://v-cdn.abc.com.cn/140941.mp4 HTTP/1.1" 206 6284261 "https://v.abc.com.cn/video/iframe/player.html?id=140941&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508 (zjxw;3.5.1;iPhone6,2;8.2;zh;bianfeng;b541b2039c2c00c66c14c7fb7e26df19fccd9cf4)"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 1637949 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 31 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5042489 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 40 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4911485 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 30 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4583601 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
60.190.59.200 HIT 1741 [15/Feb/2017:11:20:05 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5173425 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

日志的格式为

IP 命中率 响应时间 请求时间 请求方法 请求URL    请求协议 状态吗 响应大小 referer 用户代理
ClientIP Hit/Miss ResponseTime [Time Zone] Method URL Protocol StatusCode TrafficSize Referer UserAgent

计算独立IP数

计算思路

计算独立IP数主要是两步
1. 从每行日志中筛选出IP地址
2. 去除重复的IP得到独立IP数

统计独立IP代码

   //匹配IP地址正则
  val  IPPattern="((?:(?:25[0-5]|2[0-4]\\\\d|((1\\\\d{2})|([1-9]?\\\\d)))\\\\.){3}(?:25[0-5]|2[0-4]\\\\d|((1\\\\d{2})|([1-9]?\\\\d))))".r

    //1.统计独立IP数
    val ipNums=input.flatMap(x=>IPPattern findFirstIn(x)).map(x=>(x,1)).reduceByKey((x,y)=>x+y).sortBy(_._2,false)
    //输出IP访问数前量前10位
    ipNums.take(10).foreach(println)
    println("独立IP数:"+ipNums.count())

计算过程

  • flatMap(x=>IPPattern findFirstIn(x)) 通过正则取出每行日志中的IP地址
  • map(x=>(x,1)) 将每行中的IP映射成 (IP,1),形成一个Pair RDD
  • reduceByKey((x,y)=>x+y) 将相同的IP合并,得到 (IP,数量)
  • sortBy(_._2,false) 按IP大小排序

统计结果

(114.55.227.102,9348)
(220.191.255.197,2640)
(115.236.173.94,2476)
(183.129.221.102,2187)
(112.53.73.66,1794)
(115.236.173.95,1650)
(220.191.254.129,1278)
(218.88.25.200,751)
(183.129.221.104,569)
(115.236.173.93,529)
独立IP数:43649

统计每个视频独立IP数

有时我们不但需要知道全网访问的独立IP数,更想知道每个视频访问的独立IP数

计算思路

计算过程主要分为三步
1. 筛选视频文件将每行日志拆分成 (文件名,IP地址)形式
2. 按文件名分组,相当于数据库的Group by 这时RDD的结构为(文件名,[IP1,IP1,IP2,…]),这时IP有重复
3. 将每个文件名中的IP地址去重,这时RDD的结果为(文件名,[IP1,IP2,…]),这时IP没有重复

计算代码


  //匹配文件名
  val  fileNamePattern="([0-9]+).mp4".r
  def getFileNameAndIp(line:String)={
    (fileNamePattern.findFirstIn(line).mkString,IPPattern.findFirstIn(line).mkString)
  }
  //2.统计每个视频独立IP数
    input.filter(x=>x.matches(".*([0-9]+)\\\\.mp4.*")).map(x=>getFileNameAndIp(x)).groupByKey().map(x=>(x._1,x._2.toList.distinct)).
      sortBy(_._2.size,false).take(10).foreach(x=>println("视频:"+x._1+" 独立IP数:"+x._2.size))

计算过程

  • filter(x=>x.matches(“.([0-9]+)\\.mp4.“)) 筛选日志中的视频请求
  • map(x=>getFileNameAndIp(x)) 将每行日志格式化成 (文件名,IP)这种格式
  • groupByKey() 按文件名分组,这时RDD 结构为 (文件名,[IP1,IP1,IP2….]),IP有重复
  • map(x=>(x._1,x._2.toList.distinct)) 去除value中重复的IP地址
  • sortBy(_._2.size,false) 按IP数排序

计算结果

视频:141081.mp4 独立IP:2393
视频:140995.mp4 独立IP:2050
视频:141027.mp4 独立IP:1784
视频:141090.mp4 独立IP:1702
视频:141032.mp4 独立IP:1528
视频:89973.mp4 独立IP:1523
视频:141080.mp4 独立IP:1425
视频:141035.mp4 独立IP:1321
视频:141082.mp4 独立IP:1272
视频:140938.mp4 独立IP:816

统计一天中每个小时间的流量

有时我想知道网站每小时视频的观看流量,看看用户都喜欢在什么时间段过来看视频

计算思路

  1. 将日志中的访问时间及请求大小两个数据提取出来形成 RDD (访问时间,访问大小),这里要去除404之类的非法请求
  2. 按访问时间分组形成 RDD (访问时间,[大小1,大小2,….])
  3. 将访问时间对应的大小相加形成 (访问时间,总大小)

计算代码

  //[15/Feb/2017:11:17:13 +0800]  匹配 2017:11 按每小时播放量统计
  val  timePattern=".*(2017):([0-9]{2}):[0-9]{2}:[0-9]{2}.*".r
  //匹配 http 响应码和请求数据大小
  val httpSizePattern=".*\\\\s(200|206|304)\\\\s([0-9]+)\\\\s.*".r

  def  isMatch(pattern:Regex,str:String)={
    str match {
      case pattern(_*) => true
      case _ => false
    }
  }

//3.统计一天中每个小时间的流量
    input.filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)).map(x=>getTimeAndSize(x)).groupByKey()
      .map(x=>(x._1,x._2.sum)).sortByKey().foreach(x=>println(x._1+"时 CDN流量="+x._2/(1024*1024*1024)+"G"))

计算过程

  • filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)) 过滤非法请求
  • map(x=>getTimeAndSize(x)) 将日志格式化成 RDD(请求小时,请求大小)
  • groupByKey() 按请求时间分组形成 RDD(请求小时,[大小1,大小2,….])
  • map(x=>(x._1,x._2.sum)) 将每小时的请求大小相加,形成 RDD(请求小时,总大小)

计算结果

00时 CDN流量=14G
01时 CDN流量=3G
02时 CDN流量=5G
03时 CDN流量=3G
04时 CDN流量=3G
05时 CDN流量=4G
06时 CDN流量=11G
07时 CDN流量=22G
08时 CDN流量=43G
09时 CDN流量=52G
10时 CDN流量=61G
11时 CDN流量=45G
12时 CDN流量=46G
13时 CDN流量=51G
14时 CDN流量=55G
15时 CDN流量=45G
16时 CDN流量=45G
17时 CDN流量=44G
18时 CDN流量=45G
19时 CDN流量=51G
20时 CDN流量=55G
21时 CDN流量=53G
22时 CDN流量=42G
23时 CDN流量=25G

学习数据及源代码

http://git.oschina.net/whzhaochao/spark-learning

以上是关于Spark 加强版WordCount ,统计日志中文件访问数量的主要内容,如果未能解决你的问题,请参考以下文章

用Spark写一个简单的wordcount词频统计程序

spark ---词频统计

03WordCount案例

03WordCount案例

SparkStreaming wordcount demo

SparkStreaming wordcount demo