Spark 加强版WordCount ,统计日志中文件访问数量
Posted 赵侠客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark 加强版WordCount ,统计日志中文件访问数量相关的知识,希望对你有一定的参考价值。
原文地址:http://blog.csdn.net/whzhaochao/article/details/72416956
写在前面
学习Scala和Spark基本语法比较枯燥无味,搞搞简单的实际运用可以有效的加深你对基本知识点的记忆,前面我们完成了最基本的WordCount功能的http://blog.csdn.net/whzhaochao/article/details/72358215,这篇主要是结合实际生产情况编写一个简单的功能,功能就是通过分析CDN或者nginx的日志文件,统计出访问的PV、UV、IP地址、访问来源等相关数据,这里只是提供一种练习思路,实际运用可能还需要复杂点
统计文件请求数
如下图所示为七牛CDN请求的日志
223.93.159.226 HIT 203 [15/Feb/2017:11:14:35 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 5444007 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 62 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4866645 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 15 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 4854183 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 91 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 4751957 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
61.164.41.226 HIT 2537 [15/Feb/2017:11:13:54 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 200 5173432 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.236.173.95 HIT 1 [15/Feb/2017:11:17:49 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "http://v.abc.com.cn/video/iframe/player.html?id=139067&auto=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 QQ/6.6.9.412 V1_IPH_SQ_6.6.9_1_APP_A Pixel/1080 Core/UIWebView NetType/WIFI"
183.129.251.218 HIT 486 [15/Feb/2017:11:18:40 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4845881 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 34 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4976817 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 27 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 37 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 43 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 19 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5304429 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.228.161.136 HIT 1 [15/Feb/2017:11:16:51 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140994&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Mobile/14D27 MicroMessenger/6.5.4 NetType/WIFI Language/zh_CN"
202.107.208.102 HIT 1226 [15/Feb/2017:11:19:10 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.231.248.162 HIT 34 [15/Feb/2017:11:17:56 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 1208743 "http://www.abc.com.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; GWX:DOWNLOADED; GWX:RESERVED)"
221.234.216.142 HIT 744 [15/Feb/2017:11:17:09 +0800] "GET http://v-cdn.abc.com.cn/140995.mp4 HTTP/1.1" 206 4194896 "https://v.abc.com.cn/video/iframe/player.html?id=140995&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12B411 MicroMessenger/6.3.31 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
112.17.240.97 HIT 1440 [15/Feb/2017:11:20:31 +0800] "GET http://v-cdn.abc.com.cn/140941.mp4 HTTP/1.1" 206 6284261 "https://v.abc.com.cn/video/iframe/player.html?id=140941&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508 (zjxw;3.5.1;iPhone6,2;8.2;zh;bianfeng;b541b2039c2c00c66c14c7fb7e26df19fccd9cf4)"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 1637949 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 31 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5042489 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 40 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4911485 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 30 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4583601 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
60.190.59.200 HIT 1741 [15/Feb/2017:11:20:05 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5173425 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
日志的格式为
IP 命中率 响应时间 请求时间 请求方法 请求URL 请求协议 状态吗 响应大小 referer 用户代理
ClientIP Hit/Miss ResponseTime [Time Zone] Method URL Protocol StatusCode TrafficSize Referer UserAgent
计算独立IP数
计算思路
计算独立IP数主要是两步
1. 从每行日志中筛选出IP地址
2. 去除重复的IP得到独立IP数
统计独立IP代码
//匹配IP地址正则
val IPPattern="((?:(?:25[0-5]|2[0-4]\\\\d|((1\\\\d{2})|([1-9]?\\\\d)))\\\\.){3}(?:25[0-5]|2[0-4]\\\\d|((1\\\\d{2})|([1-9]?\\\\d))))".r
//1.统计独立IP数
val ipNums=input.flatMap(x=>IPPattern findFirstIn(x)).map(x=>(x,1)).reduceByKey((x,y)=>x+y).sortBy(_._2,false)
//输出IP访问数前量前10位
ipNums.take(10).foreach(println)
println("独立IP数:"+ipNums.count())
计算过程
- flatMap(x=>IPPattern findFirstIn(x)) 通过正则取出每行日志中的IP地址
- map(x=>(x,1)) 将每行中的IP映射成 (IP,1),形成一个Pair RDD
- reduceByKey((x,y)=>x+y) 将相同的IP合并,得到 (IP,数量)
- sortBy(_._2,false) 按IP大小排序
统计结果
(114.55.227.102,9348)
(220.191.255.197,2640)
(115.236.173.94,2476)
(183.129.221.102,2187)
(112.53.73.66,1794)
(115.236.173.95,1650)
(220.191.254.129,1278)
(218.88.25.200,751)
(183.129.221.104,569)
(115.236.173.93,529)
独立IP数:43649
统计每个视频独立IP数
有时我们不但需要知道全网访问的独立IP数,更想知道每个视频访问的独立IP数
计算思路
计算过程主要分为三步
1. 筛选视频文件将每行日志拆分成 (文件名,IP地址)形式
2. 按文件名分组,相当于数据库的Group by 这时RDD的结构为(文件名,[IP1,IP1,IP2,…]),这时IP有重复
3. 将每个文件名中的IP地址去重,这时RDD的结果为(文件名,[IP1,IP2,…]),这时IP没有重复
计算代码
//匹配文件名
val fileNamePattern="([0-9]+).mp4".r
def getFileNameAndIp(line:String)={
(fileNamePattern.findFirstIn(line).mkString,IPPattern.findFirstIn(line).mkString)
}
//2.统计每个视频独立IP数
input.filter(x=>x.matches(".*([0-9]+)\\\\.mp4.*")).map(x=>getFileNameAndIp(x)).groupByKey().map(x=>(x._1,x._2.toList.distinct)).
sortBy(_._2.size,false).take(10).foreach(x=>println("视频:"+x._1+" 独立IP数:"+x._2.size))
计算过程
- filter(x=>x.matches(“.([0-9]+)\\.mp4.“)) 筛选日志中的视频请求
- map(x=>getFileNameAndIp(x)) 将每行日志格式化成 (文件名,IP)这种格式
- groupByKey() 按文件名分组,这时RDD 结构为 (文件名,[IP1,IP1,IP2….]),IP有重复
- map(x=>(x._1,x._2.toList.distinct)) 去除value中重复的IP地址
- sortBy(_._2.size,false) 按IP数排序
计算结果
视频:141081.mp4 独立IP数:2393
视频:140995.mp4 独立IP数:2050
视频:141027.mp4 独立IP数:1784
视频:141090.mp4 独立IP数:1702
视频:141032.mp4 独立IP数:1528
视频:89973.mp4 独立IP数:1523
视频:141080.mp4 独立IP数:1425
视频:141035.mp4 独立IP数:1321
视频:141082.mp4 独立IP数:1272
视频:140938.mp4 独立IP数:816
统计一天中每个小时间的流量
有时我想知道网站每小时视频的观看流量,看看用户都喜欢在什么时间段过来看视频
计算思路
- 将日志中的访问时间及请求大小两个数据提取出来形成 RDD (访问时间,访问大小),这里要去除404之类的非法请求
- 按访问时间分组形成 RDD (访问时间,[大小1,大小2,….])
- 将访问时间对应的大小相加形成 (访问时间,总大小)
计算代码
//[15/Feb/2017:11:17:13 +0800] 匹配 2017:11 按每小时播放量统计
val timePattern=".*(2017):([0-9]{2}):[0-9]{2}:[0-9]{2}.*".r
//匹配 http 响应码和请求数据大小
val httpSizePattern=".*\\\\s(200|206|304)\\\\s([0-9]+)\\\\s.*".r
def isMatch(pattern:Regex,str:String)={
str match {
case pattern(_*) => true
case _ => false
}
}
//3.统计一天中每个小时间的流量
input.filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)).map(x=>getTimeAndSize(x)).groupByKey()
.map(x=>(x._1,x._2.sum)).sortByKey().foreach(x=>println(x._1+"时 CDN流量="+x._2/(1024*1024*1024)+"G"))
计算过程
- filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)) 过滤非法请求
- map(x=>getTimeAndSize(x)) 将日志格式化成 RDD(请求小时,请求大小)
- groupByKey() 按请求时间分组形成 RDD(请求小时,[大小1,大小2,….])
- map(x=>(x._1,x._2.sum)) 将每小时的请求大小相加,形成 RDD(请求小时,总大小)
计算结果
00时 CDN流量=14G
01时 CDN流量=3G
02时 CDN流量=5G
03时 CDN流量=3G
04时 CDN流量=3G
05时 CDN流量=4G
06时 CDN流量=11G
07时 CDN流量=22G
08时 CDN流量=43G
09时 CDN流量=52G
10时 CDN流量=61G
11时 CDN流量=45G
12时 CDN流量=46G
13时 CDN流量=51G
14时 CDN流量=55G
15时 CDN流量=45G
16时 CDN流量=45G
17时 CDN流量=44G
18时 CDN流量=45G
19时 CDN流量=51G
20时 CDN流量=55G
21时 CDN流量=53G
22时 CDN流量=42G
23时 CDN流量=25G
学习数据及源代码
以上是关于Spark 加强版WordCount ,统计日志中文件访问数量的主要内容,如果未能解决你的问题,请参考以下文章