解析后无法取消列出 XML 文件
Posted
技术标签:
【中文标题】解析后无法取消列出 XML 文件【英文标题】:Can't unlist a XML file after parsing 【发布时间】:2021-11-16 20:29:21 【问题描述】:我正在尝试从法国的燃料价格开放数据库中获取数据。数据可用here 并采用xml 格式。变量类型(节点或属性)可以在here(第 4 部分)或下面的图片中找到。
我的问题是,当我解析数据然后将它们转换为列表时,节点不再被视为这样,因此数据变得不可读。这是我使用的代码(找到here):
library(XML)
temp <- XML::xmlParse("Z:/temp/PrixCarburants_annuel_2021.xml")
temp2 <- XML::xmlToList(temp)
有没有人知道一个解决方案来获得正确形状的数据?我知道XML
包中有一种方法可以指定节点,但我找不到这样做的方法。如果我可以将数据检索为数据表或数据框而不是列表,那将是理想的。
非常感谢!
【问题讨论】:
【参考方案1】:像这样整理嵌套列表总是一个烦人的问题。我的方法是构建一个适用于每个元素的自定义函数,然后使用purrr::map()
单独整理每个元素。
我在下面构建了一个自定义函数来帮助您入门。它适用于"instantanee" data from the link you provided,因为这是下载速度最快的。相同的原则(甚至可能是相同的代码)应该适用于其他数据集。
这里有一些代码可以加载前五个加油站的数据:
data_list <- list(pdv = structure(list(adresse = list("RD 93 GRANDE RUE"),
ville = list("Camphin-en-Pévèle"), horaires = structure(list(
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"),
services = list(service = list("Station de gonflage"), service = list(
"Laverie"), service = list("Lavage automatique"), service = list(
"Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-21 13:38:39", valeur = "1.443"),
prix = structure(list(), nom = "E85", id = "3", maj = "2021-08-17 11:35:16", valeur = "0.659"),
prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-21 13:38:39", valeur = "1.526"),
prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-21 13:38:39", valeur = "1.607")), id = "59780003", latitude = "5059477.455", longitude = "325781.84717474", cp = "59780", pop = "R"),
pdv = structure(list(adresse = list("AIRE DE LACQ AUDEJOS SUD"),
ville = list("LACQ AUDEJOS SUD"), horaires = structure(list(
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "1", nom = "Lundi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "2", nom = "Mardi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "3", nom = "Mercredi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "4", nom = "Jeudi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "5", nom = "Vendredi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "6", nom = "Samedi", ferme = ""),
jour = structure(list(horaire = structure(list(), ouverture = "00.00", fermeture = "23.59")), id = "7", nom = "Dimanche", ferme = "")), "`automate-24-24`" = ""),
services = list(service = list("Carburant additivé"),
service = list("Toilettes publiques"), service = list(
"Bar"), service = list("Boutique alimentaire"),
service = list("Station de gonflage"), service = list(
"Espace bébé"), service = list("Piste poids lourds")),
prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-23 00:01:00", valeur = "1.689"),
prix = structure(list(), nom = "GPLc", id = "4", maj = "2021-09-23 00:01:00", valeur = "0.969"),
prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-23 00:01:00", valeur = "1.789"),
prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-23 00:01:00", valeur = "1.899")), id = "64170012", latitude = "4342142.6", longitude = "-59899.6", cp = "64170", pop = "A"),
pdv = structure(list(adresse = list("52 Avenue Léo Lagrange"),
ville = list("THIERS"), horaires = structure(list(jour = structure(list(
horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "1", nom = "Lundi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "2", nom = "Mardi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "3", nom = "Mercredi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "4", nom = "Jeudi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "5", nom = "Vendredi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "6", nom = "Samedi", ferme = "1"),
jour = structure(list(horaire = structure(list(), ouverture = "01.00", fermeture = "01.00")), id = "7", nom = "Dimanche", ferme = "1")), "`automate-24-24`" = "1"),
services = list(service = list("DAB (Distributeur automatique de billets)"),
service = list("Automate CB 24/24")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-07-01 18:00:00", valeur = "1.530"),
prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-09 18:00:00", valeur = "1.654")), id = "63300003", latitude = "4584800", longitude = "353000", cp = "63300", pop = "R"),
pdv = structure(list(adresse = list("Avenue de Garossos"),
ville = list("Beauzelle"), services = list(service = list(
"Boutique alimentaire"), service = list("Station de gonflage"),
service = list("Vente de gaz domestique (Butane, Propane)"),
service = list("Piste poids lourds"), service = list(
"DAB (Distributeur automatique de billets)"),
service = list("Lavage automatique"), service = list(
"Lavage manuel"), service = list("Vente de fioul domestique"),
service = list("Vente de pétrole lampant")), prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-19 06:17:34", valeur = "1.432"),
prix = structure(list(), nom = "E85", id = "3", maj = "2021-09-19 06:17:35", valeur = "0.649"),
prix = structure(list(), nom = "E10", id = "5", maj = "2021-09-19 06:17:35", valeur = "1.559"),
prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-19 06:17:35", valeur = "1.639")), id = "31700007", latitude = "4366800", longitude = "136500", cp = "31700", pop = "R"),
pdv = structure(list(adresse = list("Avenue de Brommat"),
ville = list("Mur-de-Barrez"), services = list(service = list(
"Carburant additivé"), service = list("DAB (Distributeur automatique de billets)")),
prix = structure(list(), nom = "Gazole", id = "1", maj = "2021-09-22 14:43:59", valeur = "1.510"),
prix = structure(list(), nom = "SP95", id = "2", maj = "2021-09-22 14:44:00", valeur = "1.690"),
prix = structure(list(), nom = "SP98", id = "6", maj = "2021-09-22 14:44:00", valeur = "1.740")), id = "12600002", latitude = "4484071", longitude = "266470", cp = "12600", pop = "R"))
真是一团糟。
这是一个函数,当应用于列表的每个元素时,将返回一个整洁的结果:
# This function will be applied to each entry in the big list, extracting the
# data you're interested in and returning it in a tidy data frame.
# I've showed you how to extract a few values to get you started.
# You will need to build the rest of this function by hand, based
# on the specific structure of the data.
parse_vals <- function(x)
# get the address for this gas station
address <- pluck(x, "adresse", 1)
# get the lat and longitude
lat <- attr(x, "latitude")
lon <- attr(x, "longitude")
# get gas data in a data frame
# note that for some gas stations there are several list items with the same
# name ("prix" in this case) so we need to index in the way done below--just
# doing `x$prix` will return only the first entry named `prix`
gas <- purrr::map_dfr(x[names(x) == "prix"], attributes)
# put all of our results together
tibble(address = address,
lat = lat,
lon = lon) %>%
bind_cols(gas)
我使用标准的tidyverse
套件和包xml2
来加载文件。然后你可以像这样使用它:
library(tidyverse)
library(xml2)
# Note this is how I loaded the full dataset: if you're using the definition of data_list I posted above using `dput()`, keep this commented out.
#data <- xml2::read_xml(filename)
#data_list <- xml2::as_list(data)[[1]]
data_list %>%
head(5) %>%
purrr::map_dfr(parse_vals)
它应该会给你一个像这样的漂亮输出:
# A tibble: 17 x 7
address lat lon nom id maj valeur
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 RD 93 GRANDE RUE 5059477.455 325781.84717474 Gazole 1 2021-09-21 13:38:39 1.443
2 RD 93 GRANDE RUE 5059477.455 325781.84717474 E85 3 2021-08-17 11:35:16 0.659
3 RD 93 GRANDE RUE 5059477.455 325781.84717474 E10 5 2021-09-21 13:38:39 1.526
4 RD 93 GRANDE RUE 5059477.455 325781.84717474 SP98 6 2021-09-21 13:38:39 1.607
5 AIRE DE LACQ AUDEJOS SUD 4342142.6 -59899.6 Gazole 1 2021-09-23 00:01:00 1.689
6 AIRE DE LACQ AUDEJOS SUD 4342142.6 -59899.6 GPLc 4 2021-09-23 00:01:00 0.969
7 AIRE DE LACQ AUDEJOS SUD 4342142.6 -59899.6 E10 5 2021-09-23 00:01:00 1.789
8 AIRE DE LACQ AUDEJOS SUD 4342142.6 -59899.6 SP98 6 2021-09-23 00:01:00 1.899
9 52 Avenue Léo Lagrange 4584800 353000 Gazole 1 2021-07-01 18:00:00 1.530
10 52 Avenue Léo Lagrange 4584800 353000 E10 5 2021-09-09 18:00:00 1.654
如果你想要更多的数据,你可以检查data_list
的结构并添加到函数parse_vals()
。
请注意,对于 R,此数据格式不正确,因为它返回的列表包含许多具有相同名称的条目,例如 prix
。因此,如果您只执行x$prix
,您将只获得名为prix
的第一个条目。这就是我使用x[names(x) == "prix"]
对其进行索引的原因。您可能需要再次使用此技巧。
【讨论】:
该死的......多么好的答案!它适用于数据的一部分,所以我想将它扩展到完整数据的其余部分会很容易!非常感谢!以上是关于解析后无法取消列出 XML 文件的主要内容,如果未能解决你的问题,请参考以下文章