解析后无法取消列出 XML 文件

Posted

技术标签:

【中文标题】解析后无法取消列出 XML 文件【英文标题】:Can't unlist a XML file after parsing 【发布时间】:2021-11-16 20:29:21 【问题描述】:

我正在尝试从法国的燃料价格开放数据库中获取数据。数据可用here 并采用xml 格式。变量类型(节点或属性)可以在here(第 4 部分)或下面的图片中找到。

我的问题是,当我解析数据然后将它们转换为列表时,节点不再被视为这样,因此数据变得不可读。这是我使用的代码(找到here):

library(XML)
temp <- XML::xmlParse("Z:/temp/PrixCarburants_annuel_2021.xml")
temp2 <- XML::xmlToList(temp)

有没有人知道一个解决方案来获得正确形状的数据?我知道XML 包中有一种方法可以指定节点,但我找不到这样做的方法。如果我可以将数据检索为数据表或数据框而不是列表,那将是理想的。

非常感谢!

【问题讨论】:

【参考方案1】:

像这样整理嵌套列表总是一个烦人的问题。我的方法是构建一个适用于每个元素的自定义函数,然后使用purrr::map() 单独整理每个元素。

我在下面构建了一个自定义函数来帮助您入门。它适用于"instantanee" data from the link you provided,因为这是下载速度最快的。相同的原则(甚至可能是相同的代码)应该适用于其他数据集。

这里有一些代码可以加载前五个加油站的数据:

data_list <- list(pdv = structure(list(adresse = list("RD 93 GRANDE RUE"), 
    ville = list("Camphin-en-Pévèle"), horaires = structure(list(
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"), 
        jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"), 
    services = list(service = list("Station de gonflage"), service = list(
        "Laverie"), service = list("Lavage automatique"), service = list(
        "Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-21 13:38:39", valeur = "1.443"), 
    prix = structure(list(), nom = "E85", id = "3", maj = "2021-08-17 11:35:16", valeur = "0.659"), 
    prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-21 13:38:39", valeur = "1.526"), 
    prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-21 13:38:39", valeur = "1.607")), id = "59780003", latitude = "5059477.455", longitude = "325781.84717474", cp = "59780", pop = "R"), 
    pdv = structure(list(adresse = list("AIRE DE LACQ AUDEJOS SUD"), 
        ville = list("LACQ AUDEJOS SUD"), horaires = structure(list(
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "1", nom = "Lundi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "2", nom = "Mardi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "3", nom = "Mercredi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "4", nom = "Jeudi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "5", nom = "Vendredi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "6", nom = "Samedi", ferme = ""), 
            jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "7", nom = "Dimanche", ferme = "")), "`automate-24-24`" = ""), 
        services = list(service = list("Carburant additivé"), 
            service = list("Toilettes publiques"), service = list(
                "Bar"), service = list("Boutique alimentaire"), 
            service = list("Station de gonflage"), service = list(
                "Espace bébé"), service = list("Piste poids lourds")), 
        prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-23 00:01:00", valeur = "1.689"), 
        prix = structure(list(), nom = "GPLc", id = "4", maj = "2021-09-23 00:01:00", valeur = "0.969"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-23 00:01:00", valeur = "1.789"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-23 00:01:00", valeur = "1.899")), id = "64170012", latitude = "4342142.6", longitude = "-59899.6", cp = "64170", pop = "A"), 
    pdv = structure(list(adresse = list("52 Avenue Léo Lagrange"), 
        ville = list("THIERS"), horaires = structure(list(jour = structure(list(
            horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"), 
            jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"), 
        services = list(service = list("DAB (Distributeur automatique de billets)"), 
            service = list("Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-07-01 18:00:00", valeur = "1.530"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-09 18:00:00", valeur = "1.654")), id = "63300003", latitude = "4584800", longitude = "353000", cp = "63300", pop = "R"), 
    pdv = structure(list(adresse = list("Avenue de Garossos"), 
        ville = list("Beauzelle"), services = list(service = list(
            "Boutique alimentaire"), service = list("Station de gonflage"), 
            service = list("Vente de gaz domestique (Butane, Propane)"), 
            service = list("Piste poids lourds"), service = list(
                "DAB (Distributeur automatique de billets)"), 
            service = list("Lavage automatique"), service = list(
                "Lavage manuel"), service = list("Vente de fioul domestique"), 
            service = list("Vente de pétrole lampant")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-19 06:17:34", valeur = "1.432"), 
        prix = structure(list(), nom = "E85", id = "3", maj = "2021-09-19 06:17:35", valeur = "0.649"), 
        prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-19 06:17:35", valeur = "1.559"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-19 06:17:35", valeur = "1.639")), id = "31700007", latitude = "4366800", longitude = "136500", cp = "31700", pop = "R"), 
    pdv = structure(list(adresse = list("Avenue de Brommat"), 
        ville = list("Mur-de-Barrez"), services = list(service = list(
            "Carburant additivé"), service = list("DAB (Distributeur automatique de billets)")), 
        prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-22 14:43:59", valeur = "1.510"), 
        prix = structure(list(), nom = "SP95", id = "2", maj = "2021-09-22 14:44:00", valeur = "1.690"), 
        prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-22 14:44:00", valeur = "1.740")), id = "12600002", latitude = "4484071", longitude = "266470", cp = "12600", pop = "R"))

真是一团糟。

这是一个函数,当应用于列表的每个元素时,将返回一个整洁的结果:

# This function will be applied to each entry in the big list, extracting the 
# data you're interested in and returning it in a tidy data frame.
# I've showed you how to extract a few values to get you started.
# You will need to build the rest of this function by hand, based
# on the specific structure of the data.
parse_vals <-  function(x)
  
  # get the address for this gas station
  address <- pluck(x, "adresse", 1)
  
  # get the lat and longitude
  lat <- attr(x, "latitude")
  lon <- attr(x, "longitude")
  
  # get gas data in a data frame
  # note that for some gas stations there are several list items with the same 
  # name ("prix" in this case) so we need to index in the way done below--just
  # doing `x$prix` will return only the first entry named `prix`
  gas <- purrr::map_dfr(x[names(x) == "prix"], attributes)

  # put all of our results together
  tibble(address = address,
         lat = lat,
         lon = lon) %>%
    bind_cols(gas)
  

我使用标准的tidyverse 套件和包xml2 来加载文件。然后你可以像这样使用它:

library(tidyverse)
library(xml2)

# Note this is how I loaded the full dataset: if you're using the definition of data_list I posted above using `dput()`, keep this commented out.
#data <- xml2::read_xml(filename)
#data_list <- xml2::as_list(data)[[1]]

data_list %>%
  head(5) %>%
  purrr::map_dfr(parse_vals)

它应该会给你一个像这样的漂亮输出:

# A tibble: 17 x 7
   address                  lat         lon             nom    id    maj                 valeur
   <chr>                    <chr>       <chr>           <chr>  <chr> <chr>               <chr> 
 1 RD 93 GRANDE RUE         5059477.455 325781.84717474 Gazole 1     2021-09-21 13:38:39 1.443 
 2 RD 93 GRANDE RUE         5059477.455 325781.84717474 E85    3     2021-08-17 11:35:16 0.659 
 3 RD 93 GRANDE RUE         5059477.455 325781.84717474 E10    5     2021-09-21 13:38:39 1.526 
 4 RD 93 GRANDE RUE         5059477.455 325781.84717474 SP98   6     2021-09-21 13:38:39 1.607 
 5 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        Gazole 1     2021-09-23 00:01:00 1.689 
 6 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        GPLc   4     2021-09-23 00:01:00 0.969 
 7 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        E10    5     2021-09-23 00:01:00 1.789 
 8 AIRE DE LACQ AUDEJOS SUD 4342142.6   -59899.6        SP98   6     2021-09-23 00:01:00 1.899 
 9 52 Avenue Léo Lagrange   4584800     353000          Gazole 1     2021-07-01 18:00:00 1.530 
10 52 Avenue Léo Lagrange   4584800     353000          E10    5     2021-09-09 18:00:00 1.654

如果你想要更多的数据,你可以检查data_list的结构并添加到函数parse_vals()

请注意,对于 R,此数据格式不正确,因为它返回的列表包含许多具有相同名称的条目,例如 prix。因此,如果您只执行x$prix,您将只获得名为prix 的第一个条目。这就是我使用x[names(x) == "prix"] 对其进行索引的原因。您可能需要再次使用此技巧。

【讨论】:

该死的......多么好的答案!它适用于数据的一部分,所以我想将它扩展到完整数据的其余部分会很容易!非常感谢!

以上是关于解析后无法取消列出 XML 文件的主要内容,如果未能解决你的问题,请参考以下文章

nfs共享盘无法解析xml

无法使用 LINQ 解析 XML 文件中的属性

导入 XML 时无法从属性文件中解析占位符

『XML』XML/XSD命名空间解析

IIS中Log无法生成解析

解锁后无法在 JavaCard 中列出或安装 CAP 文件。为啥?以及如何解决?