在python中使用scrapy执行Javascript提交表单函数
Posted
技术标签:
【中文标题】在python中使用scrapy执行Javascript提交表单函数【英文标题】:Executing Javascript Submit form functions using scrapy in python 【发布时间】:2012-05-25 19:01:32 【问题描述】:我正在使用scrapy 框架废弃一个网站,并且在单击一个javascript 链接以打开另一个页面时遇到问题。
我可以将页面上的代码识别为:
<a class="Page" title="Click to view job description" href="javascript:sysSubmitForm('frmSR1');">Accountant </a>
谁能建议我如何在 scaroy 中执行该 javascript 并通过我可以从该页面获取数据来获取另一个页面。
提前致谢
【问题讨论】:
【参考方案1】:查看以下关于如何将 scrapy 与 selenium 一起使用的片段。由于您不仅下载了 html,而且您将获得对 DOM 的完全访问权限,因此抓取速度会变慢。
注意:我已经复制粘贴了这个 sn-p,因为之前提供的链接不再有效。
# Snippet imported from snippets.scrapy.org (which no longer works)
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import selenium
class SeleniumSpider(CrawlSpider):
name = "SeleniumSpider"
start_urls = ["http://www.domain.com"]
rules = (
Rule(SgmlLinkExtractor(allow=('\.html', )),
callback='parse_page',follow=True),
)
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
self.selenium.start()
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
def parse_page(self, response):
item = Item()
hxs = HtmlXPathSelector(response)
#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel = self.selenium
sel.open(response.url)
#Wait for javscript to load in Selenium
time.sleep(2.5)
#Do some crawling of javascript created content with Selenium
sel.get_text("//div")
yield item
【讨论】:
这两个链接都不再有用,这就是为什么 *** 要求您至少在此处汇总页面的原因。你能多说一些,或者找到原始答案吗?谢谢!【参考方案2】:如果你想查看一个相当庞大的功能代码库,它使用了 scrapy 和 selenium,请查看 https://github.com/nicodjimenez/bus_catchers。这是一个更简单的例子。
# stripped down BoltBus script
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Response
from scrapy.http import TextResponse
import time
# set dates, origin, destination
cityOrigin="Baltimore"
cityDeparture="New York"
day_array=[0]
browser = webdriver.Firefox()
# we are going the day of the days of the month from 15,16,...,25
# there is a discrepancy between the index of the calendar days and the day itself: for example day[10] may correspond to Feb 7th
for day in day_array:
# Create a new instance of the Firefox driver
browser.get("http://www.boltbus.com")
# click on "region" tab
elem_0=browser.find_element_by_id("ctl00_cphM_forwardRouteUC_lstRegion_textBox")
elem_0.click()
time.sleep(5)
# select Northeast
elem_1=browser.find_element_by_partial_link_text("Northeast")
elem_1.click()
time.sleep(5)
# click on origin city
elem_2=browser.find_element_by_id("ctl00_cphM_forwardRouteUC_lstOrigin_textBox")
elem_2.click()
time.sleep(5)
# select origin city
elem_3=browser.find_element_by_partial_link_text(cityOrigin)
elem_3.click()
time.sleep(5)
# click on destination city
elem_4=browser.find_element_by_id("ctl00_cphM_forwardRouteUC_lstDestination_textBox")
elem_4.click()
time.sleep(5)
# select destination city
elem_5=browser.find_element_by_partial_link_text(cityDeparture)
elem_5.click()
time.sleep(5)
# click on travel date
travel_date_elem=browser.find_element_by_id("ctl00_cphM_forwardRouteUC_imageE")
travel_date_elem.click()
# gets day rows of table
date_rows=browser.find_elements_by_class_name("daysrow")
# select actual day (use variable day)
# NOTE: you must make sure these day elements are "clickable"
days=date_rows[0].find_elements_by_xpath("..//td")
days[day].click()
time.sleep(3)
# retrieve actual departure date from browser
depart_date_elem=browser.find_element_by_id("ctl00_cphM_forwardRouteUC_txtDepartureDate")
depart_date=str(depart_date_elem.get_attribute("value"))
# PARSE TABLE
# convert html to "nice format"
text_html=browser.page_source.encode('utf-8')
html_str=str(text_html)
# this is a hack that initiates a "TextResponse" object (taken from the Scrapy module)
resp_for_scrapy=TextResponse('none',200,,html_str,[],None)
# takes a "TextResponse" object and feeds it to a scrapy function which will convert the raw HTML to a XPath document tree
hxs=HtmlXPathSelector(resp_for_scrapy)
# the | sign means "or"
table_rows=hxs.select('//tr[@class="fareviewrow"] | //tr[@class="fareviewaltrow"]')
row_ct=len(table_rows)
for x in xrange(row_ct):
cur_node_elements=table_rows[x]
travel_price=cur_node_elements.select('.//td[@class="faresColumn0"]/text()').re("\d1,3\.\d\d")
# I use a mixture of xpath selectors to get me to the right location in the document, and regular expressions to get the exact data
# actual digits of time
depart_time_num=cur_node_elements.select('.//td[@class="faresColumn1"]/text()').re("\d1,2\:\d\d")
# AM or PM (time signature)
depart_time_sig=cur_node_elements.select('.//td[@class="faresColumn1"]/text()').re("[AP][M]")
# actual digits of time
arrive_time_num=cur_node_elements.select('.//td[@class="faresColumn2"]/text()').re("\d1,2\:\d\d")
# AM or PM (time signature)
arrive_time_sig=cur_node_elements.select('.//td[@class="faresColumn2"]/text()').re("[AP][M]")
print "Depart date: " + depart_date
print "Depart time: " + depart_time_num[0] + " " + depart_time_sig[0]
print "Arrive time: " + arrive_time_num[0] + " " + arrive_time_sig[0]
print "Cost: " + "$" + travel_price[0]
print "\n"
【讨论】:
嘿@nicodjimenez,感谢您的代码。我让它工作了,除了它似乎是为了选择日期。当您说“# we are going the day of the days from 15,16,...,25”时,我不明白。另外,您注意到:“您必须确保这些日期元素是“可点击的”。你能详细说明一下吗?【参考方案3】:据我所知,通过 urrlib2 和 urllib 实现的 scrappy 爬虫显然不适用于 js。例如,对于使用 js,您可以使用 qt webkit 或 selenium。或者您可以在页面上找到所有 ajax 链接,并查看如何实现与服务器的数据交换并间接向服务器 api 发送响应。
【讨论】:
以上是关于在python中使用scrapy执行Javascript提交表单函数的主要内容,如果未能解决你的问题,请参考以下文章
python3下scrapy爬虫(第十四卷:scrapy+scrapy_redis+scrapyd打造分布式爬虫之执行)
Python爬虫编程思想(145):使用Scrapy Shell抓取Web资源
Python爬虫编程思想(145):使用Scrapy Shell抓取Web资源
Python爬虫编程思想(145):使用Scrapy Shell抓取Web资源