在 R 中,通过在嵌套列表中用 NA 替换 NULL 来防止 unlist 删除 NULL 值

Posted

技术标签:

【中文标题】在 R 中,通过在嵌套列表中用 NA 替换 NULL 来防止 unlist 删除 NULL 值【英文标题】:In R, prevent unlist from removing NULL values, by replacing NULL with NA in nested list 【发布时间】:2021-08-17 18:49:13 【问题描述】:

这是我们从体育 API 中获取的 1 行数据,该 API 作为嵌套列表进入我们。我们的fetch_results$data 是一个列表,其中包含许多比赛中的每一场这样的嵌套列表,因为该数据适用于许多足球比赛。 list-of-list 嵌套可以深入 3-4 层,内部列表用于 scorestimevisitorTeam 下面等等。

> dput(fetch_results$data[1])
list(list(id = 11984409L, league_id = 1326L, season_id = 15733L, 
    stage_id = 77442469L, round_id = 186274L, group_id = 225400L, 
    aggregate_id = NULL, venue_id = 7189L, referee_id = NULL, 
    localteam_id = 18716L, visitorteam_id = 18658L, winner_team_id = NULL, 
    weather_report = NULL, commentaries = FALSE, attendance = NULL, 
    pitch = NULL, details = "Match 1", neutral_venue = FALSE, 
    winning_odds_calculated = FALSE, formations = list(localteam_formation = NULL, 
        visitorteam_formation = NULL), scores = list(localteam_score = 0L, 
        visitorteam_score = 0L, localteam_pen_score = NULL, visitorteam_pen_score = NULL, 
        ht_score = NULL, ft_score = NULL, et_score = NULL, ps_score = NULL), 
    time = list(status = "NS", starting_at = list(date_time = "2021-06-11 19:00:00", 
        date = "2021-06-11", time = "19:00:00", timestamp = 1623438000L, 
        timezone = "UTC"), minute = NULL, second = NULL, added_time = NULL, 
        extra_minute = NULL, injury_time = NULL), coaches = list(
        localteam_coach_id = 455836L, visitorteam_coach_id = 784486L), 
    standings = list(localteam_position = 3L, visitorteam_position = 1L), 
    assistants = list(first_assistant_id = NULL, second_assistant_id = NULL, 
        fourth_official_id = NULL), leg = "1/1", colors = NULL, 
    deleted = FALSE, is_placeholder = FALSE, localTeam = list(
        data = list(id = 18716L, legacy_id = 213L, name = "Turkey", 
            short_code = "TUR", twitter = NULL, country_id = 404L, 
            national_team = TRUE, founded = 1923L, logo_path = "https://cdn.sportmonks.com/images//soccer/teams/28/18716.png", 
            venue_id = 9634L, current_season_id = 15733L, is_placeholder = NULL)), 
    visitorTeam = list(data = list(id = 18658L, legacy_id = 205L, 
        name = "Italy", short_code = "ITA", twitter = NULL, country_id = 251L, 
        national_team = TRUE, founded = 1898L, logo_path = "https://cdn.sportmonks.com/images//soccer/teams/2/18658.png", 
        venue_id = 7189L, current_season_id = 15733L, is_placeholder = NULL))))

为了展平为数据框,我们使用:

zed <- fetch_results$data %>%
    purrr::map(unlist) %>% 
    purrr::map(t) %>% 
    purrr::map(as_tibble) %>% 
    dplyr::bind_rows() %>%
    readr::type_convert()

我们的数据框输出的一行如下所示:

如果您仔细查看列表列表,则会在主数据框中删除 许多 个值为 NULL 的对象。整个score 列表及其所有键都将被删除。根据this *** post,看起来 unlist() 丢弃 NULL 值是罪魁祸首...

该线程中发布的解决方案仅解决嵌套深度为 1 层的 NULL 值,但是上面的列表有许多嵌套列表,如果您在上面搜索 list() 可以看到。

在不删除任何具有 NULL 值的列的情况下展平此列表列表的最佳方法是什么?如果最好的方法是首先用 NA 替换 NULL,那么最好的方法是什么?我们现有的代码会进行展平并接近,但不会保留带有 NULL 的列。

【问题讨论】:

【参考方案1】:

一个选项是在我们做任何事情之前转换为NA。这可以使用rrapply以递归方式完成

library(rrapply)
library(purrr)
library(dplyr)
zed <-  rrapply(fetch_results$data, f = function(x) 
               replace(x, is.null(x), NA)) %>% 
        map(unlist) %>%
        map(t) %>%
        map(as_tibble) %>%
        bind_rows() %>% 
        type.convert(as.is = TRUE)

-输出

# A tibble: 1 x 75
        id league_id season_id stage_id round_id group_id aggregate_id venue_id referee_id localteam_id visitorteam_id winner_team_id weather_report
     <int>     <int>     <int>    <int>    <int>    <int> <lgl>           <int> <lgl>             <int>          <int> <lgl>          <lgl>         
1 11984409      1326     15733 77442469   186274   225400 NA               7189 NA                18716          18658 NA             NA            
# … with 62 more variables: commentaries <lgl>, attendance <lgl>, pitch <lgl>, details <chr>, neutral_venue <lgl>, winning_odds_calculated <lgl>,
#   formations.localteam_formation <lgl>, formations.visitorteam_formation <lgl>, scores.localteam_score <int>, scores.visitorteam_score <int>,
#   scores.localteam_pen_score <lgl>, scores.visitorteam_pen_score <lgl>, scores.ht_score <lgl>, scores.ft_score <lgl>, scores.et_score <lgl>,
#   scores.ps_score <lgl>, time.status <chr>, time.starting_at.date_time <chr>, time.starting_at.date <chr>, time.starting_at.time <chr>,
#   time.starting_at.timestamp <int>, time.starting_at.timezone <chr>, time.minute <lgl>, time.second <lgl>, time.added_time <lgl>, time.extra_minute <lgl>,
#   time.injury_time <lgl>, coaches.localteam_coach_id <int>, coaches.visitorteam_coach_id <int>, standings.localteam_position <int>,
#   standings.visitorteam_position <int>, assistants.first_assistant_id <lgl>, assistants.second_assistant_id <lgl>, assistants.fourth_official_id <lgl>,
#   leg <chr>, colors <lgl>, deleted <lgl>, is_placeholder <lgl>, localTeam.data.id <int>, localTeam.data.legacy_id <int>, localTeam.data.name <chr>,
#   localTeam.data.short_code <chr>, localTeam.data.twitter <lgl>, localTeam.data.country_id <int>, localTeam.data.national_team <lgl>,
#   localTeam.data.founded <int>, localTeam.data.logo_path <chr>, localTeam.data.venue_id <int>, localTeam.data.current_season_id <int>,
#   localTeam.data.is_placeholder <lgl>, visitorTeam.data.id <int>, visitorTeam.data.legacy_id <int>, visitorTeam.data.name <chr>,
#   visitorTeam.data.short_code <chr>, visitorTeam.data.twitter <lgl>, visitorTeam.data.country_id <int>, visitorTeam.data.national_team <lgl>,
#   visitorTeam.data.founded <int>, visitorTeam.data.logo_path <chr>, visitorTeam.data.venue_id <int>, visitorTeam.data.current_season_id <int>,
#   visitorTeam.data.is_placeholder <lgl>

【讨论】:

似乎是个不错的选择 注意取消嵌套也可以直接在rrapply()中使用how = "bind"完成,即rrapply(fetch_results$data[[1]], f = function(x) replace(x, is.null(x), NA), how = "bind")

以上是关于在 R 中,通过在嵌套列表中用 NA 替换 NULL 来防止 unlist 删除 NULL 值的主要内容,如果未能解决你的问题,请参考以下文章

根据名称将列表元素替换为另一个列表元素

如何在R中将NA替换为1?

使用dplyr将R中的所有NA值替换为0

用序列中的缺失值替换 NA (R)

在 R 的数据集中将 -inf、NaN 和 NA 值替换为零

如何使用scala在Apache spark中用空字符串(“”)替换空值[重复]