使用 css 选择器使用 scrapy 抓取 Reactjs 页面上的嵌套标签

Posted 2023-03-06

技术标签:

【中文标题】使用 css 选择器使用 scrapy 抓取 Reactjs 页面上的嵌套标签【英文标题】：Using css selectors to grab nested tag on Reactjs page using scrapy 【发布时间】：2018-04-19 05:59:31 【问题描述】：

我正在尝试使用 CSS 选择器获取突出显示的 href 值，但目前没有成功。

我正在使用scrapy shell 并尝试了这个：

response.css('body > span > section') 但它返回[]

我还尝试response.css('div') 看看它是否可以抓取任何div 标签，但它仍然返回[]

使用chrome的devtool抓取返回的css选择器

#react-root > section > main > article > div > div._cmdpi > div:nth-child(1) > div:nth-child(2) > a

我对 chrome 提供的 css 使用了 response.css()，但它也返回了 []

但是，当我尝试时：

response.css('body, span, section, main, article, div, div, div')

我收到了这个：

[<Selector xpath='descendant-or-self::body | descendant-or-self::span | descendant-or-self::section | descendant-or-self::main | descendant-or-self::article | descendant-or-self::div | descendant-or-self::div | descendant-or-self::div' data='<body class="">\n        \n    <span id="r'>, <Selector xpath='descendant-or-self::body | descendant-or-self::span | descendant-or-self::section | descendant-or-self::main | descendant-or-self::article | descendant-or-self::div | descendant-or-self::div | descendant-or-self::div' data='<span id="react-root"></span>'>]

我很困惑为什么某些 css 选择器有效而其他选择器不有效。比如div 和body, span, section, main, article, div, div, div

【问题讨论】：

可以用bs4吗？ select_one('a[href*=taken-by]') 【参考方案1】：

我认为原因是因为您在浏览器中看到的 html 代码可能是在客户端使用 javascript 生成的。我建议您检查使用 scrapy 收到的 html（如果需要，您可以将 response.body 保存在文件中）或在 shell 中检查它。 css 选择器中的, 类似于or 语句。该 href 所需的数据可能在 html 代码中的 json 中。

【讨论】：

以上是关于使用 css 选择器使用 scrapy 抓取 Reactjs 页面上的嵌套标签的主要内容，如果未能解决你的问题，请参考以下文章