如何使用apache spark通过列表对文本中的特定单词进行去标识化？

Posted 2023-03-12

技术标签:

【中文标题】如何使用apache spark通过列表对文本中的特定单词进行去标识化？【英文标题】：How to de-identify specific words in the text by list using apache spark? 【发布时间】：2017-08-02 04:06:32 【问题描述】：

我想识别那些包含特定单词的句子。正如您将在我的代码中看到的那样，我已经定义了一些术语和句子。我想打印所有具有这些定义术语的句子。

****这是我的代码：****

import scala.math.random
import org.apache.spark._
object Clasifying 

def main(args: Array[String]) 
val conf = new SparkConf().setAppName("Classification")
.setMaster("local")

val sc = new SparkContext(conf)

val terms = Array("this", "is", "my", "pen")

val sentences = Array("this Date is mine", 
                  "is there something", 
                  "there are big dogs",
                  "The Date is mine", 
                  "there may be something", 
                  "where are pen", 
                  "there is a dog",
                  "there are big cats",
                  "I am not able to to do it")

val rdd = sc.parallelize(sentences) // create RDD
val keys = terms.toSet            // words required as keys.

val result = rdd.flatMap sen => 
                val words = sen.split(" ").toSet; 
                val common = keys & words;       // intersect
                common.map(x => (x, sen))        // map as key -> sen
            
            .groupByKey.mapValues(_.toArray)     // group values for a key
            .collect

println("*********************************")
result.foreach(println)
println("*********************************")
sc.stop()

我的代码给出的结果为：

*********************************
(pen,[Ljava.lang.String;@4cc76301)
(this,[Ljava.lang.String;@2f08c4b)
(is,[Ljava.lang.String;@3f19b8b3)
*********************************

虽然我想要这样的结果：

 *********************************
 this, is,(this Date is mine)
 is,(is there something)
 is,(the Date is mine)
 is,(is there something)
 pen,where are pen)
 *********************************

在此先感谢，因为我是 spark 和 stack Overflow 的新手，所以请原谅我的错误并随时编辑我的问题。

我还想要一件事，如果不是定义简单的术语和句子，而是使用一些真实的 terms.txt 文件和 ducoment.txt 来表示句子，该怎么办？这种饱和的代码将如何？

【问题讨论】：

【参考方案1】：

这主要取决于文档的大小和单词表的大小。

如果您能够将完整的单词表保存在内存中，并将完整的文档保存在每个容器中，那么您可以通过仅使用地图的 UDF 轻松完成。如果没有，那么您可以先收集每个文档中的所有单词，然后将它们与您的单词表连接起来，以使单词“匿名化”。

注意不要烫伤自己 :D

【讨论】：

两个文档的大小会很钝

以上是关于如何使用apache spark通过列表对文本中的特定单词进行去标识化？的主要内容，如果未能解决你的问题，请参考以下文章