通过分隔符将列文本拆分为R中的多个不同列

Posted

技术标签:

【中文标题】通过分隔符将列文本拆分为R中的多个不同列【英文标题】:Splitting column text by delimiter into multiple different columns in R 【发布时间】:2022-01-17 02:37:56 【问题描述】:

我有以下问题。我必须将下面的文本分成单独的列。数据是通过网络抓取提取的,我需要对其进行转换以进行分析。例如,我复制了一行,我只需要 "id":357 和 "slug":"journalism/audio" 作为信息。你知道我怎样才能在 R 中转换它吗?下面的代码来自df栏:

"id":357,"name":"Audio","analytics_name":"Audio","slug":"journalism/audio","position":1,"parent_id":13,"parent_name":"Journalism","color":1228010,"urls":"web":"discover":"http://www.kickstarter.com/discover/categories/journalism/audio"

A screenshot sample table of the data I want to transform

【问题讨论】:

【参考方案1】:

以这样的字符串开头

stri
[1] "\"id\":357,\"name\":\"Audio\",\"analytics_name\":\"Audio\",\"slug\":\"journalism/audio\",\"position\":1,\"parent_id\":13,\"parent_name\":\"Journalism\",\"color\":1228010,\"urls\":\"web\":\"discover\":\"http://www.kickstarter.com/discover/categories/journalism/audio"

先将strsplit字符串用逗号分块并去掉引号

d <- gsub( "\"","", strsplit(stri, ",")[[1]] )
[1] "id:357"
[2] "name:Audio"
[3] "analytics_name:Audio"
[4] "slug:journalism/audio"
[5] "position:1"
[6] "parent_id:13"
[7] "parent_name:Journalism"
[8] "color:1228010"
[9] "urls:web:discover:http://www.kickstarter.com/discover/categories/journalism/audio"

最后构建数据框

dat <- data.frame( strsplit( d[grep("^id|^slug",d)], ":" ) )[2,]

colnames( dat ) <- data.frame( strsplit( d[grep("^id|^slug",d)], ":" ) )[1,]
dat
   id             slug
2 357 journalism/audio

数据

stri <- "\"id\":357,\"name\":\"Audio\",\"analytics_name\":\"Audio\",\"slug\":\"journalism/audio\",\"position\":1,\"parent_id\":13,\"parent_name\":\"Journalism\",\"color\":1228010,\"urls\":\"web\":\"discover\":\"http://www.kickstarter.com/discover/categories/journalism/audio"

【讨论】:

谢谢,该解决方案适用于我的桌子。【参考方案2】:

离开您在此处提供的一行数据,可能是这样的吗?

library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)

#Toy data.
df <- data.frame(category = '"id":357,"name":"Audio","analytics_name":"Audio","slug":"journalism/audio","position":1,"parent_id":13,"parent_name":"Journalism","color":1228010,"urls":"web":"discover":"http://www.kickstarter.com/discover/categories/journalism/audio"')
df[2, ] <- df[1, ] 


df %>% 
  mutate(ucol = row_number()) %>%
  separate_rows(category, sep = ",") %>% 
  mutate(category = str_replace_all(category, '[\\"\\\\]', "")) %>% 
  filter(str_detect(category, "^id|^slug")) %>%
  separate(category, sep = ":", into = c("key", "val")) %>%
  pivot_wider(names_from = key, values_from = val)

# # A tibble: 2 × 3
#    ucol id    slug            
#   <int> <chr> <chr>           
# 1     1 357   journalism/audio
# 2     2 357   journalism/audio

【讨论】:

谢谢,这个解决方案对我很有帮助。未来我会努力提供更好的数据进行测试。您的代码帮助我建立了使用管道并将数据转换为有意义信息的知识。 @user16268585 不客气!!

以上是关于通过分隔符将列文本拆分为R中的多个不同列的主要内容,如果未能解决你的问题,请参考以下文章

如何基于多个空格字符将文本文件拆分为 2 列作为 scala spark 的分隔符

将列拆分为多行

读取具有不同列宽但在 R 中固定分隔符的文本文件

基于SQL Server中的分隔符将文本拆分为多列

R拆分由不同数量的空格分隔的一列字符

使用逗号分隔符将单个 CSV 列批量转换为多个