Scrapy xpath在scrapy shell中返回空列表
Posted
技术标签:
【中文标题】Scrapy xpath在scrapy shell中返回空列表【英文标题】:Scrapy xpath returns empty list in scrapy shell 【发布时间】:2021-11-24 16:59:57 【问题描述】:我正在尝试使用以下 xpath 命令使用 scrapy shellscrape 关于此 page 的启动文章:
n = response.xpath('//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()').getall()
n
[]
`
该命令只返回 0 篇文章,而不是 18 篇,我在尝试时可以看到
//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()
在 Chrome 中的检查器上。如何在 scrapy shell 中获取文章?
【问题讨论】:
您提出了 4 个问题,但未接受任何答案。如果您对答案感到满意,请接受它,因为人们会花费时间和精力来回答您。谢谢。 天哪,我要学的东西太多了。我不知道我必须接受一个答案。对此深表歉意。 【参考方案1】:你可以从json获取:
scrapy shell
In [1]: url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePr
...: evention=0'
In [2]: headers =
...: "Accept": "*/*",
...: "Accept-Encoding": "gzip, deflate, br",
...: "Accept-Language": "en-US,en;q=0.5",
...: "Cache-Control": "no-cache",
...: "Connection": "keep-alive",
...: "Content-Type": "application/json; charset=utf-8",
...: "DNT": "1",
...: "Host": "techcrunch.com",
...: "Pragma": "no-cache",
...: "Referer": "https://techcrunch.com/startups/",
...: "Sec-Fetch-Dest": "empty",
...: "Sec-Fetch-Mode": "cors",
...: "Sec-Fetch-Site": "same-origin",
...: "Sec-GPC": "1",
...: "TE": "trailers",
...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/74.0.372
...: 9.169 Safari/537.36",
...: "X-KL-Ajax-Request": "Ajax_Request",
...: "X-TC-EC-Auth-Token": "",
...: "X-TC-UUID": ""
...:
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePrevention=0> (referer: https://techcrunch.com/startups/)
In [5]: view(response)
Out[5]: True
In [6]: body = response.json()['body']
In [7]: for b in body:
...: print(b['slug'])
...:
how-to-claim-a-student-discount-for-techcrunch
chimes-chris-britt-and-menlo-ventures-shawn-carolan-to-talk-fintech-on-techcrunch-live
graphwear-closes-20-5m-series-b-for-a-needle-free-nanotech-powered-glucose-monitor
investors-share-how-infrastructure-as-code-is-taking-over-devops
informaticas-ipo-will-test-public-markets-appetite-for-slower-growing-tech-offerings
index-sequoia-and-canvas-investors-weigh-in-on-how-to-raise-your-first-dollars
lawpath-gets-7-5m-aud-to-become-the-asia-pacifics-legalzoom
equity-monday-byjus-raises-more-money-somehow-as-tech-stocks-fall
stories-as-a-service-storyteller-lets-anyone-add-stories-to-their-own-apps-or-website
rich-and-worried-about-the-world-put-your-money-where-your-concern-is
made-of-air-a-maker-of-carbon-negative-thermoplastics-locks-in-5-8m
insurtech-stable-raises-46-5m-in-greycroft-led-round-to-help-businesses-manage-volatile-commodity-prices
as-apple-messes-with-attribution-what-does-growth-marketing-look-like-in-2021
yc-grads-wasp-land-1-5m-seed-to-help-developers-build-web-apps-faster
ladder-raises-100m-on-a-900m-valuation-for-a-platform-selling-flexible-term-life-insurance
devops-market-demand-drives-quick-series-c-turnaround-for-esper
elevate-launches-its-approach-to-managing-pre-tax-benefits-with-12m-series-a
to-the-market-takes-on-funding-to-create-ethical-sustainable-work-environments-for-women
indian-edtech-giant-byjus-valued-at-18-billion-in-new-funding
komunidad-a-philippines-based-environmental-intelligence-platform-lands-seed-round
【讨论】:
以上是关于Scrapy xpath在scrapy shell中返回空列表的主要内容,如果未能解决你的问题,请参考以下文章