Scrapy xpath在scrapy shell中返回空列表

Posted 2023-02-22

技术标签:

【中文标题】Scrapy xpath在scrapy shell中返回空列表【英文标题】：Scrapy xpath returns empty list in scrapy shell 【发布时间】：2021-11-24 16:59:57 【问题描述】：

我正在尝试使用以下 xpath 命令使用 scrapy shellscrape 关于此 page 的启动文章：

n = response.xpath('//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()').getall()

n
[]

该命令只返回 0 篇文章，而不是 18 篇，我在尝试时可以看到

//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()

在 Chrome 中的检查器上。如何在 scrapy shell 中获取文章？

【问题讨论】：

您提出了 4 个问题，但未接受任何答案。如果您对答案感到满意，请接受它，因为人们会花费时间和精力来回答您。谢谢。天哪，我要学的东西太多了。我不知道我必须接受一个答案。对此深表歉意。 【参考方案1】：

你可以从json获取：

scrapy shell

In [1]: url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePr
   ...: evention=0'

In [2]: headers = 
   ...: "Accept": "*/*",
   ...: "Accept-Encoding": "gzip, deflate, br",
   ...: "Accept-Language": "en-US,en;q=0.5",
   ...: "Cache-Control": "no-cache",
   ...: "Connection": "keep-alive",
   ...: "Content-Type": "application/json; charset=utf-8",
   ...: "DNT": "1",
   ...: "Host": "techcrunch.com",
   ...: "Pragma": "no-cache",
   ...: "Referer": "https://techcrunch.com/startups/",
   ...: "Sec-Fetch-Dest": "empty",
   ...: "Sec-Fetch-Mode": "cors",
   ...: "Sec-Fetch-Site": "same-origin",
   ...: "Sec-GPC": "1",
   ...: "TE": "trailers",
   ...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/74.0.372
   ...: 9.169 Safari/537.36",
   ...: "X-KL-Ajax-Request": "Ajax_Request",
   ...: "X-TC-EC-Auth-Token": "",
   ...: "X-TC-UUID": ""
   ...: 

In [3]: req = scrapy.Request(url=url, headers=headers)

In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePrevention=0> (referer: https://techcrunch.com/startups/)

In [5]: view(response)
Out[5]: True

In [6]: body = response.json()['body']

In [7]: for b in body:
   ...:     print(b['slug'])
   ...:
how-to-claim-a-student-discount-for-techcrunch
chimes-chris-britt-and-menlo-ventures-shawn-carolan-to-talk-fintech-on-techcrunch-live
graphwear-closes-20-5m-series-b-for-a-needle-free-nanotech-powered-glucose-monitor
investors-share-how-infrastructure-as-code-is-taking-over-devops
informaticas-ipo-will-test-public-markets-appetite-for-slower-growing-tech-offerings
index-sequoia-and-canvas-investors-weigh-in-on-how-to-raise-your-first-dollars
lawpath-gets-7-5m-aud-to-become-the-asia-pacifics-legalzoom
equity-monday-byjus-raises-more-money-somehow-as-tech-stocks-fall
stories-as-a-service-storyteller-lets-anyone-add-stories-to-their-own-apps-or-website
rich-and-worried-about-the-world-put-your-money-where-your-concern-is
made-of-air-a-maker-of-carbon-negative-thermoplastics-locks-in-5-8m
insurtech-stable-raises-46-5m-in-greycroft-led-round-to-help-businesses-manage-volatile-commodity-prices
as-apple-messes-with-attribution-what-does-growth-marketing-look-like-in-2021
yc-grads-wasp-land-1-5m-seed-to-help-developers-build-web-apps-faster
ladder-raises-100m-on-a-900m-valuation-for-a-platform-selling-flexible-term-life-insurance
devops-market-demand-drives-quick-series-c-turnaround-for-esper
elevate-launches-its-approach-to-managing-pre-tax-benefits-with-12m-series-a
to-the-market-takes-on-funding-to-create-ethical-sustainable-work-environments-for-women
indian-edtech-giant-byjus-valued-at-18-billion-in-new-funding
komunidad-a-philippines-based-environmental-intelligence-platform-lands-seed-round

【讨论】：

以上是关于Scrapy xpath在scrapy shell中返回空列表的主要内容，如果未能解决你的问题，请参考以下文章