R语言：如何根据两列返回和打印缺失条目的列表

Posted 2023-02-14

技术标签:

【中文标题】R语言：如何根据两列返回和打印缺失条目的列表【英文标题】：R language: how to return and print a list of missing entries based on two columns 【发布时间】：2022-01-21 15:18:17 【问题描述】：

我正在努力编写 R 代码以打印 “在给定开始日期和结束日期之间没有数据的日期列表，用于表中另一个变量/列的所有可能值” .用语言解释起来有点困难，所以我将举一个非常简化的例子，希望能清楚地说明我想要做什么。

您是一家宠物店的经理，负责检查宠物食品销售数据的质量。数据来自一个四列的 csv 文件；日期、动物食品类型、销售价格和销售数量。 animal_type 列可以有 3 个可能的值；字符串格式的狗、猫或鸟。

我在下面以非常简化的方式模拟了 12 月份前三天的数据。价格和数量列不相关，因此我将它们留空。

date	animal_type	price	quantity
2021-12-01	dog
2021-12-01	dog
2021-12-01	cat
2021-12-01	bird
2021-12-02	dog
2021-12-02	bird
2021-12-03	cat
2021-12-03	cat
2021-12-03	cat

我要做的是打印/返回在 animal_type 列中没有所有可能值条目的日期。因此，对于我的示例，我要打印的内容类似于...

2021-12-02  :  ['cat']
2021-12-03  :  ['dog', 'bird']

因为 [2021-12-02] 没有“猫”的条目，而 [2021-12-03] 的数据中没有“狗”或“鸟”的条目。但是，到目前为止，我只能使用以下函数计算每个日期的唯一 animal_type 值的数量。

import(tidyverse)
import(dplyr)

df %>% group_by(date) %>% summarise(n = n_distinct(unique(animal_type))) # sums the number of unique animal_type appearing in all the entries for every date
df %>% group_by(animal_type) %>% summarise(n = n_distinct(unique(date))) # sums the number of unique dates that appear in all the entries for every animal_type

# output for "sums the number of unique animal_type appearing in all the entries for every date"
   date            n
   <date>       <int>
1 2021-12-01       3
2 2021-12-02       2
3 2021-12-03       1

# output for "sums the number of unique dates that appear in all the entries for every animal_type"
  animal_type   num_dates
  <chr>         <int>
1 dog             2
2 cat             2
3 bird            2

这可以告诉我哪些日期缺少 animal_type 值，但不知道具体是哪个日期。我试过环顾四周，但找不到很多类似的问题，所以我想知道这有多可行。我也对使用 R 和重新学习大部分语法、包和库感到生疏。所以我可能会遗漏一些简单的东西。正如您可能从我的代码中看到的那样，我对 tidyverse / dplyr 和 base r 建议持开放态度。我将不胜感激任何帮助，并感谢你们的时间！

【问题讨论】：

【参考方案1】：

您可以同时使用 tidyr::complete 函数和反连接。

首先，您必须完成隐式缺失值，然后将已完成的 tibble 与您当前拥有的 tibble 反连接。

请看下面的例子

library(tidyverse)
example <- crossing("Date"=c("2021-12-01", "2021-12-02", "2021-12-03"), 
         "Pet"=c("Bird", "Cat", "Dog"))

op_example <- example %>% slice(-c(5, 7, 9))
op_example %>% complete(Date, Pet) %>% 
  anti_join(op_example)

【讨论】：

感谢您的帮助！我尝试了你的解决方案，它奏效了。但是我对op_example <- example %>% slice(-c(5, 7, 9)) 正在做什么有点困惑。如果可能的话，你能解释一下吗？我只是在复制您提供的数据！你不需要使用那部分。如果答案符合您的需求，AS 随时可以投票。你知道如何概括代码中op_example <- example %>% slice(-c(5, 7, 9)) 部分的作用吗？我一直在使用您的解决方案，但我觉得使用 slice 手动删除丢失的条目对于示例数据来说太具体了。因为您已经知道丢失的条目与试图找到它们的区别。我想看看是否可以在其他类似但更大的数据集上找到缺失的条目。我试过使用 group_by() 但语法不能很好地与 complete() 和 anti_join() 没关系。即使丢失的部分是随机的，方法也是一样的。您对要查找所有组合的列在数据集上使用完整。然后使用该数据框与原始数据框进行反连接。

以上是关于R语言：如何根据两列返回和打印缺失条目的列表的主要内容，如果未能解决你的问题，请参考以下文章