在 data.frame 中查找字符串

Posted 2023-03-29

技术标签:

【中文标题】在 data.frame 中查找字符串【英文标题】：Find string in data.frame 【发布时间】：2017-01-19 20:57:03 【问题描述】：

如何在 data.frame 中搜索字符串？作为一个最小的例子，我如何在这个 data.frame 中找到“马”的位置（列和行）？

> df = data.frame(animal=c('goat','horse','horse','two', 'five'), level=c('five','one','three',30,'horse'), length=c(10, 20, 30, 'horse', 'eight'))
> df
  animal level length
1   goat  five     10
2  horse   one     20
3  horse three     30
4    two    30  horse
5   five horse  eight

...所以第 4 行和第 5 行的顺序错误。任何能让我识别出“马”已转移到第 5 行的 level 列和第 4 行的 length 列的输出都很好。也许：

> magic_function(df, 'horse')
col       row
'animal', 2
'animal', 3
'length', 4
'level',  5

这就是我想要使用它的目的：我有一个非常大的数据框（大约 60 列，20.000 行），其中一些列被一些行弄乱了。它太大而无法识别顺序可能错误的不同方式，因此搜索会很好。我将使用此信息将数据移动到这些行的正确列。

【问题讨论】：

什么顺序是正确的，你想要什么结果？ @TimBiegeleisen 我更新了这个问题。 “位置”是指列和行（数据框是二维的）。我不是在问如何更改订单-我知道该怎么做。这只是为了介绍我的问题的背景。我想粘贴我的实际 60 列没有任何价值，因为每个人都会要求一个最小的例子；-) 但是在我的最小例子中，假设前三行的顺序是正确的，并且数据在错误的列中第 4 行和第 5 行。 re "hat 'horse' has shift to the level column in row 5": 在您的示例的第 5 行中没有 horse。我退出了那里。 @RonakShah，我已经更新了一个示例输出。 @Tensibai，感谢您的关注。我已经更新了示例。 【参考方案1】：

怎么样：

which(df == "horse", arr.ind = TRUE)
#      row col
# [1,]   2   1
# [2,]   3   1
# [3,]   5   2
# [4,]   4   3

【讨论】：

不错。考虑到 OP 的问题，我会将它与 count( df$animal) 结合起来，为数据框中的每一列调用。这将返回列中每个级别出现的次数，从而更容易检测异常值。要使用count，必须先使用library(plyr)。 @larsen, colSums(df=='horse') 是一种更简洁的方式。 @Jonas 谢谢。但我认为这会解决一个稍微不同的问题。也许我误解了这个问题，但我想对于格式错误的数据，人们不会事先知道哪些列值放错了位置。我提出的技术旨在检测那些错误的值。【参考方案2】：

另一种方式：

l <- sapply(colnames(df), function(x) grep("horse", df[,x]))

$animal
[1] 2 3

$level
[1] 5

$length
[1] 4

如果你希望输出是矩阵：

sapply(l,'[',1:max(lengths(l)))

     animal level length
[1,]      2     5      4
[2,]      3    NA     NA

【讨论】：

【参考方案3】：

另一种方法如下：

library(data.table)
library(zoo)
library(dplyr)
library(timeDate)
library(reshape2)
data frame name = tbl_account

首先，转置它：

temp = t(tbl_Account)

然后，将其放入列表中：

temp = list(temp)

这实质上是将数据框中的每一个观察结果放入一个大字符串中，让您可以一次性搜索整个数据框。

然后进行搜索：

temp[[1]][grep("Horse",temp[[1]])] #brings back the actual value occurrences
grep("Horse", temp[[1]]) # brings back the position of the element in a list it occurs in

希望这会有所帮助:)

【讨论】：

【参考方案4】：

我们可以得到值等于horse 的索引。除以行数 (nrow) 得到列索引，除以列 (ncol) 得到行索引。

我们使用colnames 来获取列名而不是索引。

data.frame(col = colnames(df)[floor(which(df == "horse") / (nrow(df) + 1)) + 1], 
           row = floor(which(df == "horse") / ncol(df)) + 1)

#   col   row
#1 animal   1
#2 animal   2
#3  level   4
#4 length   5

【讨论】：

以上是关于在 data.frame 中查找字符串的主要内容，如果未能解决你的问题，请参考以下文章