识别模式并将它们作为新列
Posted
技术标签:
【中文标题】识别模式并将它们作为新列【英文标题】:Identify patterns and turn them as a new column 【发布时间】:2018-05-18 12:26:03 【问题描述】:我在一个项目中工作,其中包含大量存储在 html 中的表。在抓取的过程中,我不得不处理以下问题。
Some of the tables that I am scraping look like this
在我导入 DF 时,我必须在此代码中为那些合并单元格的行(“鸡”和“没有骨头的鸡”)输入 fill = TRUE
参数:
read_html(link) %>%
html_nodes(node) %>%
html_table(fill = T, header = T, dec = ",")
但这为我生成了这样的表格:
df <- data.frame(year = c("chicken",2000,2001,2002,"chicken without bones",2003,2004,2005, "chicken without bones and feet", 2006, 2007, 2008),
weight = c("chicken",5,6,4,"chicken without bones",2,1,3,"chicken without bones and feet", 1, 1.5, 2)
)
试图找到一种方法让我的表格看起来像这样:
df2 <- data.frame(year = c(2000,2001,2002, 2003, 2004, 2005,2006,2007, 2008), number = c(5,6,4,2,1,3,1,1.5, 2),
new_variable = c("chicken","chicken","chicken","chicken without bones","chicken without bones",
"chicken without bones","chicken without bones and feet","chicken without bones and feet","chicken without bones and feet" )
)
我在 R 上苦苦挣扎,但仍然不知道如何在我的 1.028.974 表被刮掉的情况下做到这一点。 Obs.:这些表没有这种发生的模式;因此,我需要一个代码来识别填充节点,将它们的值作为字符获取并将其转换为新的列值,直到下一次填充发生。
感谢关注!!
【问题讨论】:
欢迎来到 ***!请阅读有关how to ask a good question 的信息以及如何提供reproducible example。这将使其他人更容易帮助您。 哦,谢谢您的关注,并为我糟糕的英文打字感到抱歉。尝试修复并制作可重现的示例! 这可能与您提供的示例过度拟合,但请尝试cbind(df[c(FALSE, TRUE, TRUE, TRUE),], new_var = rep(as.character(df[c(TRUE, FALSE, FALSE, FALSE),]$year), each = 3))
哦,谢谢!它有效,但正如你所说,它对于这个例子非常具体!我需要一些东西来自动识别重复并将其变成一个新的列;因为我有很多表,每一个都有自己的格式。我怎样才能获得这个向量 c(FALSE, TRUE, TRUE, TRUE) ?
【参考方案1】:
你可以试试这个-
library(dplyr)
library(zoo)
df %>%
mutate_if(is.factor, as.character) %>%
mutate(new_variable = ifelse(grepl("\\D+", year), year, NA),
new_variable = na.locf(new_variable)) %>%
filter(!grepl("\\D+", year))
输出为:
year weight new_variable
1 2000 5 chicken
2 2001 6 chicken
3 2002 4 chicken
4 2003 2 chicken without bones
5 2004 1 chicken without bones
6 2005 3 chicken without bones
7 2006 1 chicken without bones and feet
8 2007 1.5 chicken without bones and feet
9 2008 2 chicken without bones and feet
样本数据:
df <- structure(list(year = structure(c(10L, 1L, 2L, 3L, 11L, 4L, 5L,
6L, 12L, 7L, 8L, 9L), .Label = c("2000", "2001", "2002", "2003",
"2004", "2005", "2006", "2007", "2008", "chicken", "chicken without bones",
"chicken without bones and feet"), class = "factor"), weight = structure(c(8L,
6L, 7L, 5L, 9L, 3L, 1L, 4L, 10L, 1L, 2L, 3L), .Label = c("1",
"1.5", "2", "3", "4", "5", "6", "chicken", "chicken without bones",
"chicken without bones and feet"), class = "factor")), class = "data.frame", row.names = c(NA,
-12L))
# year weight
#1 chicken chicken
#2 2000 5
#3 2001 6
#4 2002 4
#5 chicken without bones chicken without bones
#6 2003 2
#7 2004 1
#8 2005 3
#9 chicken without bones and feet chicken without bones and feet
#10 2006 1
#11 2007 1.5
#12 2008 2
【讨论】:
@Pedro 也许你应该 accept the answer 如果它帮助你解决了你的问题,那么这个问题可以被认为是关闭以上是关于识别模式并将它们作为新列的主要内容,如果未能解决你的问题,请参考以下文章