如何抓取/抓取(使用 R)非表格 EPA CompTox 仪表板?
Posted
技术标签:
【中文标题】如何抓取/抓取(使用 R)非表格 EPA CompTox 仪表板?【英文标题】:How can I crawl/scrape (using R) the non-table EPA CompTox Dashboard? 【发布时间】:2022-01-12 15:52:35 【问题描述】:EPA CompTox Chemical Dashboard 收到了更新,我的旧代码不再能够获取化学品的沸点。有人能帮我刮一下实验平均沸点吗?我需要能够编写一个可以循环遍历多种化学物质的 R 代码。
示例网页: 丙酮:https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482 甲烷:https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545
我试过read_html()
和xmlParse()
都没有成功。实验平均沸点 (ExpAvBP) 值未显示在 XML 中。
我尝试过使用RCrawler
中的ContentScraper()
,但无论我尝试什么,它都只会返回NA。此外,这仅适用于列出的第一个网页,因为单元格 ID 会随每种化学品而变化。
ContentScraper(Url="https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482", XpathPatterns = "//*[@id='cell-225']")
我尝试过使用readLines()
,但信息都被塞进了最后一个脚本标签,我不确定如何只隔离ExpAvBP 值。看起来价值存储在其他地方?例如,下面是我认为最后一个脚本标签中的沸点信息。
丙酮:
unit:c_,name:"沸点",predicted:rawData:[value:c$,minValue:e,maxValue:e,source:am,description:an,modelName:"TEST_BP",modelId :T,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:value:B,link:"https:\u002F\u002Fs3.amazonaws.com\u002Fepa-comptox\u002Ftest-reports\u002FDTXCID101482-TEST_BP.html" ,showLink:a,qmrf:value:e,link:e,showLink:d,value:44.8,minValue:e,maxValue:e,source:ci,description:cj,modelName:"EPISUITE_BP", modelId:dV,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:value:M,link:e,showLink:d,qmrf:value:e,link:e,showLink:d, value:46.458,minValue:e,maxValue:e,source:ad,description:V,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:value:M,link :e,showLink:d,qmrf:value:e,link:e,showLink:d,value:da,minValue:e,maxValue:e,source:aL,description:bo,modelName:"OPERA_BP ",modelId:dS,hasOpera:a,globalApplicability:q,hasQmrfPdf:a,details:value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u0 02Fcalculation_details?model_id=27&search=21482",showLink:a,qmrf:value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink: a],count:bu,mean:47.06289999999999,min:c$,max:da,range:[c$,da],median:45.629,experimental:rawData:[value:db,minValue:e ,maxValue:e,source:aN,description:aO,experimentalDetails:[],value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[],value:ak ,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[],value:ak,minValue:ak,maxValue:ak,source:"联合国粮食及农业组织",description :“FAO\u002FWHO 食品添加剂联合专家委员会 (JECFA) 是一个国际专家科学委员会,由联合国粮食及农业组织 (FAO) 和世界卫生组织 (WHO) 共同管理。网站:\u003Ca href="http:\u002F\u002Fwww.fao.org\u002Fhome\u002F" target="_blank"\u003Ehttp:\u002F\u002Fwww.fao.org\u002Fhome\u002F\u003C\u002Fa\u003E",实验细节:[],值:56.05,minValue:e,maxValue:e,来源:“Abooali et al。诠释。 J.冷藏。 2014, 40, 282–293",描述:"Abooali, D.; Sobati, M. A. 预测纯制冷剂正常沸点下的正常沸点和蒸发焓的新方法:QSPR 方法。 (\u003Ca href="http:\u002F\u002Fdx.doi.org\u002F10.1016\u002Fj.ijrefrig.2013.12.007" target="_blank"\u003EInt. J. Refrig. 2014, 40, 282–293\u003C \u002Fa\u003E)\r\n",experimentalDetails:[],value:bO,minValue:bO,maxValue:bO,source:hI,description:hJ,experimentalDetails:[]],count:dK,mean :55.98518333333333,min:db,max:bO,range:[db,bO],median:ak,arrKey:"BOILING_POINT"
甲烷:
unit:cO,name:"沸点",predicted:rawData:[value:at,minValue:f,maxValue:f,source:bB,description:bb,modelName:"ACD_BP",modelId: 135,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:value:ag,link:f,showLink:d,qmrf:value:f,link:f,showLink:d,value: hl,minValue:f,maxValue:f,source:aF,description:ba,modelName:"OPERA_BP",modelId:dv,hasOpera:a,globalApplicability:s,hasQmrfPdf:a,details:value:O,link:" http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=25545",showLink:a,qmrf:value:O,link:"http:\u002F\u002Fcomptox-dev. epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a,value:cP,minValue:f,maxValue:f,source:bZ,description:b_,modelName:"EPISUITE_BP",modelId: dy,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:value:ag,link:f,showLink:d,qmrf:value:f,link:f,showLink:d],count: bH,mean:-129.25300000000001,min:at,max:cP,range:[at,cP],median:hl,experimental:rawData:[value:at,minValu e:at,maxValue:at,source:hm,description:hn,experimentalDetails:[],value:cQ,minValue:f,maxValue:f,source:bC,description:bD,experimentalDetails:[]], count:H,mean:ho,min:at,max:cQ,range:[at,cQ],median:ho,arrKey:"BOILING_POINT"
任何帮助或见解将不胜感激!
【问题讨论】:
【参考方案1】:由于数据不是表格格式,我们必须提取文本并通过匹配模式BoilingPoint
提取沸腾温度。
library(rvest)
library(dplyr)
library(RSelenium)
url = 'https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
df = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[@id="__layout"]/div/div[5]/div[2]/main/div/div[3]/div[2]/div/div[2]/div[2]/div[3]') %>%
html_text()
现在得到沸腾的温度。参考https://***.com/a/35936065/12135618
df1 = df %>% str_remove_all( '\n') %>% str_replace_all( ' ', '')
as.numeric(sub(".*?BoilingPoint.*?(\\d+).*", "\\1", df1))
[1] 163
您可能需要进一步微调才能获得沸腾温度的小数点。
【讨论】:
是否有可以用于不同网页的备用 xpath? (跨度 ID 并不总是“cell-225”) 修改答案以上是关于如何抓取/抓取(使用 R)非表格 EPA CompTox 仪表板?的主要内容,如果未能解决你的问题,请参考以下文章