通过 dplyr 在第一个遇到的数字上使用单独的（tidyr）分离列

Posted 2023-02-14

技术标签:

【中文标题】通过 dplyr 在第一个遇到的数字上使用单独的（tidyr）分离列【英文标题】：Separating column using separate (tidyr) via dplyr on a first encountered digit 【发布时间】：2016-04-22 21:41:41 【问题描述】：

我试图将一个相当混乱的列分成两列，分别包含句号和 description。我的数据类似于下面的摘录：

set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                              "some text 20022008", "another indicator 2003"),
                  values = runif(n = 4))

期望的结果

想要的结果应该是这样的：

          indicator   period    values
1     someindicator     2001 0.2655087
2     someindicator     2011 0.3721239
3         some text 20022008 0.5728534
4 another indicator     2003 0.9082078

特点

代码

require(dplyr); require(tidyr); require(magrittr)
dta %<>%
  separate(col = indicator, into = c("indicator", "period"),
           sep = "^[^\\d]*(2+)", remove = TRUE)

这当然不行：

> head(dta, 2)
  indicator period    values
1              001 0.2655087
2              011 0.3721239

其他尝试

我也尝试了默认的分隔方法sep = "[^[:alnum:]]"，但它会将列分解为太多列，因为它似乎匹配所有可用的数字。 sep = "2*" 也不起作用，因为有时 2 太多（例如：20032006）。

我想要做的归结为：

识别字符串中的第一个数字根据该章程分开。 事实上，我也很乐意保留这个特殊的角色。

【问题讨论】：

【参考方案1】：

我认为这可能会做到。

library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
#           indicator   period    values
# 1     someindicator     2001 0.2655087
# 2     someindicator     2011 0.3721239
# 3         some text 20022008 0.5728534
# 4 another indicator     2003 0.9082078

下面是正则表达式的解释，由regex101为大家带来。

(?<=[a-z]) 是一个积极的后视 - 它断言 [a-z]（匹配 a 和 z 之间的单个字符（区分大小写））可以匹配 ? 匹配它前面的空格字符，从零到一次，尽可能多次，根据需要返回 (?=[0-9]) 是一个正向预测 - 它断言 [0-9]（匹配 0 到 9 范围内的单个字符）可以匹配

【讨论】：

谢谢，这太棒了，它seems to be 正确匹配了结果，非常感谢您的解释。我突然想到解决这个问题可能涉及lookbehind/forward，但我觉得它们不容易使用。【参考方案2】：

你也可以使用unglue::unnest()：

dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                              "some text 20022008", "another indicator 2003"),
                  values = runif(n = 4))

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "indicator=\\s*period=\\d*")
#>       values         indicator   period
#> 1 0.43234262     someindicator     2001
#> 2 0.65890900     someindicator     2011
#> 3 0.93576805         some text 20022008
#> 4 0.01934736 another indicator     2003

^{由reprex package (v0.3.0) 于 2019-09-14 创建}

【讨论】：

以上是关于通过 dplyr 在第一个遇到的数字上使用单独的（tidyr）分离列的主要内容，如果未能解决你的问题，请参考以下文章