Spark SQL两道超经典练习题!必会!
Posted 郎er
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark SQL两道超经典练习题!必会!相关的知识,希望对你有一定的参考价值。
Spark SQL 练习题
题目1:有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
1)每个店铺的UV(访客数)
2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
数据参考:jd_visit.log文件
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u6 a
答案:
def main(args: Array[String]): Unit = {
val spark: SparkSession =
SparkSession.builder()
.appName("test1")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val list4: RDD[String] = spark.sparkContext.textFile("F:\\\\学习资料\\\\8 大数据\\\\大数据预热视频\\\\8.Spark\\\\jd_visit.log")
val jdvisit: DataFrame = list4.map(_.split(" ")).map(t => (t(0), t(1))).toDF("userId", "shop")
jdvisit.createOrReplaceTempView("visit")
// 1)每个店铺的UV(访客数)
//第一种
var k=
"""
|select shop,count(*)
|from
|(select distinct shop,userId
|from visit) k
|group by k.shop
|""".stripMargin
//第二种
var v=
"""
|select shop,count(*)
|from (
|select shop,userId
|from visit
|group by shop,userId) k
|group by k.shop
|""".stripMargin
//第三种
var sql1=
"""
|select v2.shop,count(*) count
|from (
|select v1.userId,v1.shop
|from
|(
|select
| userId,
| shop,
| row_number() over(partition by userId,shop order by 1) as d
|from visit
|)v1 where v1.d = 1
|) v2
|group by v2.shop
|""".stripMargin
spark.sql(sql1).show()
//结果:
+----+-----+
|shop|count|
+----+-----+
| c| 3|
| b| 4|
| a| 5|
+----+-----+
// 2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
var sql=
"""
|select shop,userId,count
|from(
|select shop,userId,count(*) count,row_number()over(partition by shop order by count(*) desc) as k
|from visit
|group by shop,userId) w
|where k <4
|""".stripMargin
spark.sql(sql).show()
//结果
+----+------+-----+
|shop|userId|count|
+----+------+-----+
| c| u2| 2|
| c| u6| 1|
| c| u3| 1|
| b| u4| 2|
| b| u1| 2|
| b| u2| 1|
| a| u1| 3|
| a| u2| 2|
| a| u5| 1|
+----+------+-----+
题目2:我们有如下的用户访问数据:
userID visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
u02 2017/1/23 6
u01 2017/2/22 4
要求使用SQL统计出每个用户的累计访问次数,如下所示:
用户 月份 小计 累计
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
解释:小计为单月访问次数,累计为在原有单月访问次数基础上累加
将计算结果写入到mysql的表中,自己设计对应的表结构
答案:
def main(args: Array[String]): Unit = {
val spark: SparkSession =
SparkSession.builder()
.appName("test1")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//读取数据 这里直接手打了
val list111: DataFrame = List(("u01", "2017-1-21", 5), ("u02", "2017-1-23", 6), ("u03", "2017-1-22", 8), ("u04", "2017-1-20", 3),
("u01", "2017-1-23", 6), ("u01", "2017-3-21", 8), ("u02", "2017-1-23", 6), ("u01", "2017-2-22", 4)).toDF("userId", "visitDate", "visitCount")
list111.createOrReplaceTempView("visit")
var sql =
"""
|select t1.userId ,dt ,sum ,sum(sum) over(partition by userid order by dt ) as sum1
|from (
|select userId,
|Date_format(visitDate, 'yyyy-MM') dt,
|sum(visitCount) sum
|from visit
|group by userId,Date_format(visitDate,'yyyy-MM')
|) t1
|""".stripMargin
val p=new Properties()
p.put("user","root")
p.put("password","123456")
spark.sql(sql).write.mode(SaveMode.Append).jdbc("jdbc:mysql://hadoop10:3306/spark","user_count",p)
}
以上是关于Spark SQL两道超经典练习题!必会!的主要内容,如果未能解决你的问题,请参考以下文章
spark关于join后有重复列的问题(org.apache.spark.sql.AnalysisException: Reference '*' is ambiguous)(代码片段