R中的收集函数以匹配字符串中的模式

Posted 2023-02-19

技术标签:

【中文标题】R中的收集函数以匹配字符串中的模式【英文标题】：gather function in R to match patterns in character strings 【发布时间】：2020-06-22 21:35:15 【问题描述】：

我想收集 reshape 宽表到长表。我想收集的列有一个模式。现在我只能通过他们的位置来收集他们。我怎样才能改变它以通过列名中的模式来收集它们？请仅使用收集功能。

我已经包含了一个示例数据集，但是在实际数据集中还有更多列。因此，我想收集所有列：

f

m

后跟一个OR两个数字

输入（头（test1，1））结构（列表（开始日期 = “2019-11-06”，id = “POL55”，m0_9 = NA_real_， m10_19 = NA_real_，m20_29 = NA_real_，m30_39 = NA_real_， m40_49 = 32，m50_59 = NA_real_，m60_69 = NA_real_，m70 = NA_real_， f0_9 = 32, f10_19 = NA_real_, f20_29 = NA_real_, f30_39 = NA_real_, f40_49 = NA_real_，f50_59 = NA_real_，f60_69 = NA_real_， f70 = NA_real_), row.names = c(NA, -1L), class= c("tbl_df", "tbl", "data.frame"))

df_age2 % 聚集（age_cat，计数，m0_9:f70） df_age2

预期输出（将有更多未收集的列）。 count 当然应该算...

 startdate  id    age_cat count
   <chr>      <chr> <chr>      <dbl>
 1 2019-11-06 POL55 m0_9          NA
 2 2019-11-06 POL56 m0_9          NA
 3 2019-11-06 POL57 m0_9          NA
 4 2019-11-06 POL58 m0_9          NA
 5 2019-11-06 POL59 m0_9          NA
 6 2019-11-06 POL60 m0_9          NA
 7 2019-11-06 POL61 m0_9          NA
 8 2019-11-06 POL62 m0_9          NA
 9 2019-11-06 POL63 m0_9          NA
10 2019-11-06 POL64 m0_9          NA

【问题讨论】：

【参考方案1】：

我们可以从tidyr使用pivot_longer

 library(dplyr)
 library(tidyr)
 test1 %>% 
    pivot_longer(cols = -c(startdate, id), names_to = c('.value', 'grp'), names_sep="_")

也可以

test1 %>% 
  pivot_longer(cols = -c(startdate, id),
      names_to = c( '.value', 'grp'), names_pattern = "^([a-z])(.*)")
# A tibble: 8 x 5
#  startdate  id    grp       m     f
#  <chr>      <chr> <chr> <dbl> <dbl>
#1 2019-11-06 POL55 0_9      NA    32
#2 2019-11-06 POL55 10_19    NA    NA
#3 2019-11-06 POL55 20_29    NA    NA
#4 2019-11-06 POL55 30_39    NA    NA
#5 2019-11-06 POL55 40_49    32    NA
#6 2019-11-06 POL55 50_59    NA    NA
#7 2019-11-06 POL55 60_69    NA    NA
#8 2019-11-06 POL55 70       NA    NA

也许是

test1 %>% 
  pivot_longer(cols = -c(startdate, id), 
     names_to = c( 'grp',  '.value'), names_pattern = "^([a-z])(.*)")
# A tibble: 2 x 11
#   startdate  id    grp   `0_9` `10_19` `20_29` `30_39` `40_49` `50_59` `60_69`  `70`
#  <chr>      <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
#1 2019-11-06 POL55 m        NA      NA      NA      NA      32      NA      NA    NA
#2 2019-11-06 POL55 f        32      NA      NA      NA      NA      NA      NA    NA

也可以

test1 %>% 
  pivot_longer(cols = matches("^(f|m)\\d+_?\\d*$"), names_to = 'age_bucket',
        values_to = 'count')
# A tibble: 16 x 4
#   startdate  id    age_bucket count
#   <chr>      <chr> <chr>      <dbl>
# 1 2019-11-06 POL55 m0_9          NA
# 2 2019-11-06 POL55 m10_19        NA
# 3 2019-11-06 POL55 m20_29        NA
# 4 2019-11-06 POL55 m30_39        NA
# 5 2019-11-06 POL55 m40_49        32
# 6 2019-11-06 POL55 m50_59        NA
# 7 2019-11-06 POL55 m60_69        NA
# 8 2019-11-06 POL55 m70           NA
# 9 2019-11-06 POL55 f0_9          32
#10 2019-11-06 POL55 f10_19        NA
#11 2019-11-06 POL55 f20_29        NA
#12 2019-11-06 POL55 f30_39        NA
#13 2019-11-06 POL55 f40_49        NA
#14 2019-11-06 POL55 f50_59        NA
#15 2019-11-06 POL55 f60_69        NA
#16 2019-11-06 POL55 f70           NA

【讨论】：

这仅基于列名以 f 或 m 开头对吗？ @Danka 最后一个选项“是” 正如我在问题中提到的，我将有更多我不想匹配的列，其中一些将以 m 或 f 开头 @Danka 根据您的示例尚不清楚预期的输出是什么，即您的描述是start with an f or m @Danka 避免非特定匹配，如果模式在 'f' 或 'm' 之后跟随一个或多个数字 test1 %>% pivot_longer(cols = matches("^(f|m)\\d+_?\\d*$"), names_to = 'age_bucket', values_to = 'count')【参考方案2】：

使用starts_with:

test1 %>% 
  gather(age_bucket, count, c(starts_with("m"), starts_with("f")))

【讨论】：

我已经更新了这个问题，因为它不完整且具有误导性

以上是关于R中的收集函数以匹配字符串中的模式的主要内容，如果未能解决你的问题，请参考以下文章