如何从不同的数据框中添加一列:Scala Frame
Posted
技术标签:
【中文标题】如何从不同的数据框中添加一列:Scala Frame【英文标题】:How to add a column from different data frame : Scala Frame 【发布时间】:2016-06-30 22:11:23 【问题描述】:如何从不同的数据框中添加/附加列?我正在尝试查找评价为 3 及以上的 placeName 的百分位数。
// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"requestId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()
result1 的行具有总计,而 result2 具有过滤后的总计。现在我正在寻找:
sqlContext.sql("select top/total as percentile from temp groupBy placeName")
但 finalResult 具有重复的列 placeName 和 customerId。有人可以告诉我我在这里做错了什么吗?还有一种方法可以做到这一点而不做 join 吗?
我的架构:
"country": "France",
"customerId": "France001",
"visited": [
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
,
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
,
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
]
US top = 1 count = 3
Canada top = 1 count = 3
"country": "Canada",
"customerId": "Canada012",
"visited": [
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
,
]
UK top = 1 count = 1
"country": "France",
"customerId": "France001",
"visited": [
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
,
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
,
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
]
美国顶部 = 2 计数 = 3 加拿大最高 = 1 计数 = 3
PlaceName percnetile 美国 (1+1+2)/(3+1+3) *100 加拿大 (1+1)/(3+3) *100 英国 1 *100
架构:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
【问题讨论】:
【参考方案1】:还有没有办法在不加入的情况下做到这一点?
没有
谁能告诉我我在这里做错了什么?
什么都没有。如果您不需要两者都使用:
result1.join(result2, List("placeName","customerId"), "outer")
【讨论】:
我得到:以上是关于如何从不同的数据框中添加一列:Scala Frame的主要内容,如果未能解决你的问题,请参考以下文章
如何在我的数据框中添加一列,说明每行来自哪个工作表名称? Python
如何在 data.table 中添加一列并返回多列而不修改基础数据?