如何用R提取点和括号之间的句子?

Posted

技术标签:

【中文标题】如何用R提取点和括号之间的句子?【英文标题】:How to extract sentences between point and brackets with R? 【发布时间】:2019-03-01 11:02:11 【问题描述】:

我有:

Stringa=" 这不同于研究人员专门为反映更高阶和更抽象的概念而创建的原始数据(Lee,1991;Walsham,1995)。鉴于大数据和研究收集数据之间的主要区别,令人惊讶的是,关于使用大数据应该如何改变基于理论的 IS 研究实践的讨论很少。一些学者指出,鉴于大数据集、先进的算法和强大的计算能力可以在没有人为干预的情况下启动和完善问题(Agarwal & Dhar, 2014)。其他评论家认为,科学方法可能会过时,因为“大量数据的可用性以及处理这些数据的统计工具数字……即使没有连贯的模型、统一的理论或真正的任何机械解释,科学也能进步”(安德森,2008 年)。也许“科学家不再需要做出有根据的猜测,构建假设和模型,在基于数据的实验和示例中对其进行测试。相反,他们可以挖掘完整的数据集以揭示影响的模式,无需进一步实验即可得出科学结论”(Prensky,2009 年)。 "

期望的输出:

[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)

我用:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\(")) 但它不起作用

我不想摘录“鉴于大数据和研究收集的数据之间的主要差异,令人惊讶的是,关于使用大数据应该如何改变基于理论的 IS 研究实践的讨论很少。” ’和‘也许’科学家不再需要做出有根据的猜测、构建假设和模型、在基于数据的实验和示例中进行测试。 '

【问题讨论】:

【参考方案1】:

如果文中没有缩写,可以使用

regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"                                                                                                                                                                         
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"                                                                           
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"                                                                                                                                                                    

请参阅regex demo 和R demo。

详情

[^.?!\\s] - 任何字符,但 .?! 和空格 [^.!?]*? - 除.?! 之外的任何 0+ 个字符尽可能少 \([^()]*\) - 一个(,除() 之外还有0+ 个字符,然后是)

【讨论】:

【参考方案2】:

我们可以使用grepexprregmatches 来处理这个问题,使用下面的正则表达式模式:

.*?\([^)]+\).*?(?=\w|$)

这将捕获第一个括号之前的任何内容,然后是(...) 术语。下面的脚本将捕获源文本中的所有此类匹配项。

m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)

[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."                                                                                                                                                                                                                                                                                                              
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "

【讨论】:

以上是关于如何用R提取点和括号之间的句子?的主要内容,如果未能解决你的问题,请参考以下文章

答果子问R语言如何用正则表达式提取特定的字符串

如何用SQL语句,在数字加中文的混合数据中提取出中文

如何用Python将一句话中一个单词前后的两个单词提取出来

如何用R语言提取股票行情数据

如何用R语言在数据中提取指定列数据,并且形成一个新的数据表

使用正则表达式从句子中的方括号中提取剩余的子字符串