在 Scrapy 中使用 Xpath 选择段落下方的任何文本

Posted

技术标签:

【中文标题】在 Scrapy 中使用 Xpath 选择段落下方的任何文本【英文标题】:Using Xpath in Scrapy to select any text below paragraph 【发布时间】:2016-09-16 01:49:28 【问题描述】:

好吧,我的初始代码有效,但在网站中遗漏了一些奇怪的格式:

response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()


  <div id="body">
  <a name="main_content" id="main_content"></a>
  <!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a>  | <a href="../index.html">DEATH ROW</a>  | <a href="index.html">INFORMATION</a>  | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3  <!-- InstanceEndEditable -->  
  </div>

我拉第 1 行和第 2 行没有问题。但是第 3 行不是我的 P 类的兄弟。这只发生在我试图从表格中删除的某些页面上。

这里是链接:https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html

抱歉,Xpath 让我感到困惑,有没有办法提取条件 //*[contains(., 'Description:')] 之后的所有数据,而不是必须是兄弟?

提前致谢。

已编辑:更改示例以更反映实际情况。添加到原始页面的链接。

【问题讨论】:

你想从网页中得到什么? 【参考方案1】:

您可以选择包含“描述:”(following-sibling::node())的&lt;p&gt;之后的所有兄弟节点(元素和文本节点),然后获取所有文本节点(descendant-or-self::text()):

>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
...  <p> Name </p>
...  <p> Age  </p>
...  <p class="text-bold"> Description: </p>
...  <p> Line 1 </p>
...  <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
...      /following-sibling::node()
...         /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>> 

让我们分解一下。

所以,您已经知道如何找到包含“描述”的正确&lt;p&gt;(使用 XPath //div/p[contains(., 'Description:')]):

>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']

你想要&lt;p&gt;s 之后(following-sibling:: 轴 + p 元素选择):

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

这不会给你第三行。因此,您阅读了有关 XPath 的信息并尝试了“包罗万象”*

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

仍然没有运气。为什么?因为* 只选择元素(通常称为“标签”,为了简化)。

第三行是一个文本节点,它是父 &lt;div&gt; 元素的子节点。但是文本节点也是一个节点 (!),因此您可以选择它作为上面著名的 &lt;p&gt; 的兄弟:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']

好的,现在看来我们有了我们想要的节点(“标记”元素和文本节点)。但是您仍然在.extract() 的输出中得到那些“&lt;p&gt;”(XPath 选择了元素,而不是它们的“内部”文本)。

因此,您阅读了更多有关 XPath 的信息并使用了 .//text() 步骤(大致为“此处的所有子文本节点”)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']

呃,等等,第三行去哪儿了?

事实上这个///descendant-or-self::node()/ 的缩写,所以./descendant-or-self::node()/text() 将只选择下一个&lt;p&gt; 的子文本节点(文本节点没有子节点,@987654347 @ 永远不会匹配任何文本节点)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']

你可以在这里做的是使用方便的descendant-or-self轴+text()节点测试,所以如果following-sibling::node()到达一个文本节点,descendant-or-self中的“self”将匹配文本节点,和text() 节点测试为真

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']

使用 OP 已编辑问题中的示例 URL:

$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: 'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)

>>> t = response.xpath("""
...     //div/p[contains(., 'Last Statement:')]
...         /following-sibling::node()
...             /descendant-or-self::text()""").extract()
>>> 
>>> 
>>> print(''.join(t))

I would like to thank everyone that has showed up on my  behalf, Kathryn Cox, I love you dearly.  Thank you Randy Cannon for  showing up and being a lifelong friend.  Thank you Dr. Steve Ball for  trying to bring the right out.  There are a lot of injustices that are  happening with this.  This is wrong.  Thank you Reverend Leon  Harrison for showing me the grace of God.  Thank you for all of my friends  that are out there.  This is not a capital case.  I never had  intended to do anything.  I feel very grieved for the loss of Walker, and  for Donovan and Marissa Walker.  I hope they can find peace and be  productive in society.  I would like to thank all of my friends on the row  even though everything didn’t work, close isn’t good enough.  I hope that  positive change will come out of this.
I would like to thank my father and mother for everything  that they showed me.  I would like to apologize for putting them through  this.  I would like to ask for the truth to come out and make positive  changes.  Above all else Donovan and Marissa can find love and  peace.  I hope they overcome the loss of their father.  At no time  did I intend to hurt him.
When  the truth comes out I hope that they can find closure.  There are a lot of  things that are not right in this world, I have had to overcome them  myself.  I hope all that are on the row, I hope they find peace and solace  in their life. Everyone can find peace in a Christian God or whatever God they  believe in.  I thank you mom and dad for everything, I love you  dearly.  One last thing, I thank all of my friends that showed loyalty and  graced my life with more positive.  I would also like to thank Gustav’s  mother for having such a great son, and showing me much love.  I have met  good people on the row, not all of them are bad.  I hope everyone can see  that.  I just want to thank everybody that came to witness this.  I  thank everyone, I am sorry things didn’t work out.  May God forgive us  all?  I am sorry mother and I am sorry father.  I hope you find peace  and solace in your heart.  I know there is something else I need to  say.  I feel that.    

【讨论】:

您认为您可以根据提供的链接发表更多评论吗? 我可以在哪里获得细分的任何想法; /following-sibling::node() ... /descendant-or-self::text()""").extract() &gt;&gt;&gt; 同样对于那部分代码,如果它没有文本节点会不会出错?抱歉,我迫不及待想要测试,但我仍然没有可用的控制台 您是在询问following-sibling::node()descendant-or-self::text() 的含义吗?我做了一个talk on XPath 可以提供帮助。请注意,“...”只是 Python 解释器中用于连续行的显示工件。要使用的 XPath 可以内联到 //div/p[contains(., 'Last Statement:')]/following-sibling::node()/descendant-or-self::text()(空格对于 XPath 表达式不重要) 需要时间来消化。非常感谢您的解释!

以上是关于在 Scrapy 中使用 Xpath 选择段落下方的任何文本的主要内容,如果未能解决你的问题,请参考以下文章

在Scrapy中如何利用Xpath选择器从HTML中提取目标信息(两种方式)

Python中Scrapy框架元素选择器XPath的简单实例

scrapy xpath选择器多级选择错误

Scrapy模块

scrapy爬虫框架之Xpath选择器

Python爬虫从入门到放弃(十四)之 Scrapy框架中选择器的用法