k-means 使用 Spark/Scala 对地理定位数据进行聚类

Posted

技术标签:

【中文标题】k-means 使用 Spark/Scala 对地理定位数据进行聚类【英文标题】:k-means Clustering geolocated data using Spark/Scala 【发布时间】:2017-11-20 19:34:27 【问题描述】:

How to Handle geolocated data using k-means cluster algorithm here, 有人可以在这里分享您的意见吗,在此先感谢。

 Project_2_Dataset.txt file entries look like this 
 =================================================

            33.68947543 -117.5433083
            37.43210889 -121.4850296
            39.43789083 -120.9389785
            39.36351868 -119.4003347
            33.19135811 -116.4482426
            33.83435437 -117.3300009

    Please review my Code here:
    ============================         
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.clustering.KMeans
    val data = sc.textFile("Project_2_Dataset.txt")             
    val parsedData = data.map( line => Vectors.dense(line.split(',').map(_.toDouble)))
    val kmmodel= KMeans.train(parsedData,3,5) --- 3 clusters,4 Iterations.
    17/06/17 13:12:20 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
    java.lang.NumberFormatException: For input string: "33.68947543 -117.5433083"
            at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
            at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
            at java.lang.Double.parseDouble(Double.java:538)
            at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)

谢谢 阿米特·K

【问题讨论】:

【参考方案1】:

我认为这是因为您尝试将每一行拆分为 char ',' 而不是 ' '

@ "33.19135811 -116.4482426".toDouble 
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
  ...

@ "33.19135811 -116.4482426".split(',').map(_.toDouble) 
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
  ...

@ "33.19135811 -116.4482426".split(' ').map(_.toDouble) 
res3: Array[Double] = Array(33.19135811, -116.4482426)

【讨论】:

嗨 Dnomyar,是的,这是在遵循您使用 " " ... "33.19135811 -116.4482426".split(' ').map(_.toDouble) res3: Array[Double] 的建议后工作的= 数组(33.19135811,-116.4482426) 我有另一个查询,假设我们有一组包含多个列的数据,类似这样... 2014-03-15:10:10:20 Sorrento 8cc3b47e-bd01-4482-b500 -28f2342679af 33.68947543 -117.5433083 2014-03-15:10:10:20 MeeToo ef8c7564-0a1a-4650-A655-c8bbd5f8f943 37.43210889 -121.4850296 2014-03-15:10:10:20 MeeToo 23eba027-b95a-4729-9a4b-a3cca51c5548 39.43789083 -120.9389785 2014-03-15:10:10:20 Sorrento 707daba1-5640-4d60-a6d9-1d6fa0645be0 39.36351868 -119.4003347 如果我只选择纬度和经度列并应用 k-means 模型在 Scala 中? 也许这可以帮助你:***.com/questions/6647166/…【参考方案2】:
In the previous case where were able to apply the split on a set of data("33.19135811 -116.4482426".split(' ').map(_.toDouble)) , But it seems that when we are applying the same split on multiple set of data, Am getting this error: 

                33.68947543 -117.5433083
                37.43210889 -121.4850296
                39.43789083 -120.9389785
                39.36351868 -119.4003347

    scala> val kmmodel= KMeans.train(parsedData,3,5)
    17/06/29 19:14:36 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 8)
    java.lang.NumberFormatException: empty String

【讨论】:

以上是关于k-means 使用 Spark/Scala 对地理定位数据进行聚类的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 spark/scala 解析 YAML

如何在 Spark/Scala 中使用 countDistinct?

Spark scala 模拟 spark.implicits 用于单元测试

将 spark.sql 查询转换为 spark/scala 查询

databricks、spark、scala,不能长时间使用 lag()

使用 Java 类的 Spark Scala 数据集