如何从不同的数据框中添加一列：Scala Frame

Posted 2023-04-17

技术标签:

【中文标题】如何从不同的数据框中添加一列：Scala Frame【英文标题】：How to add a column from different data frame : Scala Frame 【发布时间】：2016-06-30 22:11:23 【问题描述】：

如何从不同的数据框中添加/附加列？我正在尝试查找评价为 3 及以上的 placeName 的百分位数。

// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()


val res =  df.withColumn("visited", explode($"visited"))

val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))

val result2 = res
.filter($"visited.rating" < 4)
  .groupBy($"requestId", $"visited.placeName")  
  .agg(count("*").alias("top"))

result1.show()

result2.show()

val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()

result1 的行具有总计，而 result2 具有过滤后的总计。现在我正在寻找：

 sqlContext.sql("select top/total as percentile from temp groupBy placeName")

但 finalResult 具有重复的列 placeName 和 customerId。有人可以告诉我我在这里做错了什么吗？还有一种方法可以做到这一点而不做 join 吗？

我的架构：

 
        "country": "France",
        "customerId": "France001",
        "visited": [
            
                "placeName": "US",
                "rating": "2",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            ,
              
                "placeName": "US",
                "rating": "3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            ,
              
                "placeName": "Canada",
                "rating": "3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

                    
    ]


US top = 1 count = 3
Canada top = 1 count = 3



        "country": "Canada",
        "customerId": "Canada012",
        "visited": [
            
                "placeName": "UK",
                "rating": "3",
                "famousRest": "N/A",
                "placeId": "XSdce2"

            ,


    ]

UK top = 1 count = 1



        "country": "France",
        "customerId": "France001",
        "visited": [
            
                "placeName": "US",
                "rating": "4.3",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            ,
              
                "placeName": "US",
                "rating": "3.3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            ,
              
                "placeName": "Canada",
                "rating": "4.3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

                    
    ]

美国顶部 = 2 计数 = 3 加拿大最高 = 1 计数 = 3

PlaceName percnetile 美国 (1+1+2)/(3+1+3) *100 加拿大 (1+1)/(3+3) *100 英国 1 *100

架构：

root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |   |-- placeId: string (nullable = true)
|    |   |-- placeName: string (nullable = true) 
|    |   |-- famousRest: string (nullable = true)
|    |   |-- rating: string (nullable = true)

【问题讨论】：

【参考方案1】：

还有没有办法在不加入的情况下做到这一点？

没有

谁能告诉我我在这里做错了什么？

什么都没有。如果您不需要两者都使用：

result1.join(result2, List("placeName","customerId"), "outer")

【讨论】：

我得到：:43: error: type mismatch;找到：列表 [String] 必需：org.apache.spark.sql.Column 我使用的是 spark 1.5

以上是关于如何从不同的数据框中添加一列：Scala Frame的主要内容，如果未能解决你的问题，请参考以下文章