如何将数据因子划分为R中的子集

Posted 2023-02-14

技术标签:

【中文标题】如何将数据因子划分为R中的子集【英文标题】：How to divide a data factor into subsets in R 【发布时间】：2022-01-09 05:55:15 【问题描述】：

我有一个像这样的因子列

id               
2000.1.ABC0123
2010.11.BCD3652

逻辑是“年”“月”和“标识符” 我只是想将它们纠缠成两列，其中我需要一个数字月份，前面有一个“0”，如下所示。

Identifier     yearmonth    
AB0123         200001 
BCD3652        201011

我玩过“paste0”和“substra”，但无法使用。

非常感谢任何帮助。

谢谢，最好的， D.

【问题讨论】：

类似这样的 str_split('2000.1.ABC0123',pattern = "\\.(?![:digit:])") 【参考方案1】：

在基础 R 中：

read.table(text=df$id, sep='.', col.names = c('year', 'month', 'Identifier')) |>
  transform(yearmonth = sprintf("%d%02d", year, month))

  year month Identifier yearmonth
1 2000     1    ABC0123    200001
2 2010    11    BCD3652    201011

Tidyverse：

df %>%
  separate(id, c('year', 'month', 'Identifier'), convert = TRUE) %>%
  mutate(month = sprintf('%02d', month)) %>%
  unite('yearmonth', year, month, sep='')

yearmonth Identifier
  <chr>     <chr>     
1 200001    ABC0123   
2 201011    BCD3652

【讨论】：

【参考方案2】：

这里是一个使用 stringr 和 dplyr 的例子

library(tidyverse)

df_example <- tribble(~id,
                      '2000.1.ABC0123',
                      '2010.11.BCD3652')


df_example |> 
  mutate(split_cols = str_split(id,pattern = "\\.(?![:digit:])"),
         yearmonth  = split_cols |> map_chr(pluck(1)) |>  str_remove('\\.'),
         Identifier =  split_cols |> map_chr(pluck(2))
         )
#> # A tibble: 2 x 4
#>   id              split_cols yearmonth Identifier
#>   <chr>           <list>     <chr>     <chr>     
#> 1 2000.1.ABC0123  <chr [2]>  20001     ABC0123   
#> 2 2010.11.BCD3652 <chr [2]>  201011    BCD3652

^{由reprex package (v2.0.1) 于 2021-12-02 创建}

【讨论】：

【参考方案3】：

strsplit 和 sprintf 的组合应该可以为您提供所需的输出。

x = unlist(strsplit('2000.1.ABC0123', split='\\.'))
y = as.numeric(x[1:2])
sprintf('%4d%02d', y[1], y[2])
x[3]

【讨论】：

【参考方案4】：

library(data.table)
DT <- fread("id               
2000.1.ABC0123
2010.11.BCD3652 ")

DT[, c("year", "month", "Identifier") := tstrsplit(id, ".", fixed = TRUE)]
DT[, yearmonth := paste0(year, sprintf("%02d", as.numeric(month)))]
#                 id year month Identifier yearmonth
# 1:  2000.1.ABC0123 2000     1    ABC0123    200001
# 2: 2010.11.BCD3652 2010    11    BCD3652    201011

【讨论】：

以上是关于如何将数据因子划分为R中的子集的主要内容，如果未能解决你的问题，请参考以下文章