如何抓取与特定期刊/论文的每位教授相关的隶属关系
Posted
技术标签:
【中文标题】如何抓取与特定期刊/论文的每位教授相关的隶属关系【英文标题】:How to scrape affiliation related to each professor of a particular journal/article paper 【发布时间】:2020-03-10 17:02:36 【问题描述】:我要抓取的网站是ScienceDirect。单击显示更多按钮后,该从属关系将可用。我可以点击它,但我无法抓取点击显示更多按钮后加载的附属关系 这是代码。 for 循环不打印包含隶属关系的 dl-tag
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get('https://www.sciencedirect.com/science/article/pii/S1571065308000656')
soup = BeautifulSoup(driver.page_source,'html.parser')
time.sleep(7)
try:
element = driver.find_element_by_css_selector('.show-hide-details.u-font-sans')
element.click()
time.sleep(15)
for data in soup.find(id='author-group'):
print(data)
print('---')
except NoSuchElementException:
pass
【问题讨论】:
【参考方案1】:我认为你需要将你的汤实例化移动到在你点击了“显示更多”按钮之后。
如果我运行以下代码:
driver = webdriver.Firefox()
driver.get('https://www.sciencedirect.com/science/article/pii/S1571065308000656')
time.sleep(3)
try:
element = driver.find_element_by_css_selector('.show-hide-details.u-font-sans')
element.click()
time.sleep(9)
soup = BeautifulSoup(driver.page_source,'html.parser')
for data in soup.find(id='author-group'):
print(data)
print('---')
except NoSuchElementException:
pass
我的输出是:
<span class="sr-only">Author links open overlay panel</span>
---
<a class="author size-m workspace-trigger" href="#!" name="baep-author-id6"><span class="content"><span class="text given-name">Ignaz</span><span class="text surname">Rutter</span><span class="author-ref" id="bfn001"><sup>1</sup></span><svg class="icon icon-envelope" focusable="false" viewbox="0 0 102 128" ><path d="m55.8 57.2c-1.78 1.31-5.14 1.31-6.9 0l-31.32-23.2h69.54l-31.32 23.19zm-55.8-24.78l42.94 32.62c2.64 1.95 6.02 2.93 9.4 2.93s6.78-0.98 9.42-2.93l40.24-30.7v-10.34h-102zm92 56.48l-18.06-22.74-8.04 5.95 17.38 21.89h-64.54l18.38-23.12-8.04-5.96-19.08 24.02v-37.58l-1e1 -8.46v61.1h102v-59.18l-1e1 8.46v35.62"></path></svg></span></a>
---
<dl class="affiliation"><dd>Fakultät für Informatik, Universität Karlsruhe, Germany</dd></dl>
---
【讨论】:
【参考方案2】:数据是从脚本标签加载的,这意味着您可以只使用请求并提取脚本内容并使用 json 库进行解析
import requests, json
from bs4 import BeautifulSoup as bs
headers = 'User-Agent':'Mozilla/5.0'
url = 'https://www.sciencedirect.com/science/article/pii/S1571065308000656'
r = requests.get(url, headers = headers)
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('[type="application/json"]').text)
for author in data['authors']['content']:
print(' '.join([author['$$'][0]['$$'][0]['_'],author['$$'][0]['$$'][1]['_']]))
print(author['$$'][1]['$$'][0]['_'])
【讨论】:
以上是关于如何抓取与特定期刊/论文的每位教授相关的隶属关系的主要内容,如果未能解决你的问题,请参考以下文章
SCI论文投稿里面的Affiliation一项填啥?是作者的单位还是作者的职称(教授之类)。