使用 scrapy 抓取网站
Posted
技术标签:
【中文标题】使用 scrapy 抓取网站【英文标题】:Scrape websites using scrapy 【发布时间】:2013-05-09 17:20:46 【问题描述】:我正在尝试使用 scrapy 抓取 website,但我无法从该站点抓取所有产品,因为它使用无限滚动...
我只能抓取 52 个项目的以下数据,但它们是 3824 个项目。
hxs.select("//span[@class='itm-Catbrand strong']").extract()
hxs.select("//span[@class='itm-price ']").extract()
hxs.select("//span[@class='itm-title']").extract()
如果我使用hxs.select("//div[@id='content']/div/div/div").extract()
那么它会提取整个项目列表,但不会进一步过滤....如何抓取所有项目?
我已经尝试过了,但结果相同。我哪里错了?
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body
for n in [2,3,4,5,6]:
req = Request(url="http://www.jabong.com/men/shoes/?page=" + n,
headers = "Referer": "http://www.jabong.com/men/shoes/",
"X-Requested-With": response.header['X-Requested-With'])
return req
【问题讨论】:
我不确定response.header['X-Requested-With']
是否等于“XMLHttpRequest”,因此该网站可能会将您重定向到(或提供)原始项目页面。另外,您可能应该使用yield req
或将所有请求放在一个列表中。
如何设置您之前提到的 Header...并使用 yield/return 我收到此错误ERROR: Spider must return Request, BaseItem or None, got 'Request' in <GET http://www.jabong.com/men/shoes/>
【参考方案1】:
正如您所猜测的,当您滚动 页面。
使用我的浏览器中包含的开发人员工具(Ctrl-Maj i for chromium),我在“网络”选项卡中看到页面中包含的 javascript 脚本执行以下请求以加载更多项目:
GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...
Web 服务器响应以下类型的文档:
<li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 ">
<div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div>
<div class="itm-qlInsert tooltip-qlist highlightStar"
onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog');
return false;" >
<div class="starHrMsg">
<span class="starHrMsgArrow"> </span>
Save for later </div>
</div>
<a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html"
onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);">
<span class="lazyImage">
<span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img- itm-img- itm-img-sprites="4">
<noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" class="itm-img"></noscript>
</span>
</span>
<span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span>
<span class="itm-Catbrand strong">Phosphorus</span>
<span class="itm-title">
Black Moccasins </span>
这些文档包含更多项目。
因此,要获得完整的项目列表,您必须在 Spider 的 parse
方法中返回 Request
对象(参见 Spider class documentation),以告诉 scrapy 它应该加载更多数据:
def parse(self, response):
# ... Extract items in the page using extractors
n = number of the next "page" to parse
# You get get n by using response.url, extracting the number
# at the end and adding 1
# It is VERY IMPORTANT to set the Referer and X-Requested-With headers
# here because that's how the website detects if the request was made by javascript
# or direcly by following a link.
req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,
headers = "Referer": "http://www.website-your-are-crawling.com/men/shoes/",
"X-Requested-With": "XMLHttpRequest")
return req # and your items
哦,顺便说一下(如果您想测试),您不能只在浏览器中加载 http://www.website-your-are-crawling.com/men/shoes/?page=2
以查看它返回的内容,因为该网站会将您重定向到全局页面(即 http://www.website-your-are-crawling.com/men/shoes/
) 如果 X-Requested-With
标头不同于 XMLHttpRequest
。
【讨论】:
以上是关于使用 scrapy 抓取网站的主要内容,如果未能解决你的问题,请参考以下文章