Xpath 使用 Scrapy 在下一个兄弟标签中获取信息
Posted
技术标签:
【中文标题】Xpath 使用 Scrapy 在下一个兄弟标签中获取信息【英文标题】:Xpath to get information in next sibling tag using Scrapy 【发布时间】:2015-01-24 04:48:50 【问题描述】:我正在尝试使用 Scrapy,现在我尝试从一个词源网站中提取信息:http://www.etymonline.com 现在,我只想获取单词及其原始描述。这就是通常的 html 代码块在 etymonline 中的呈现方式:
<dt>
<a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a>
<a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com">
<img src="graphics/dictionary.gif" title="Look up address at Dictionary.com"/>
</a>
</dt>
<dd>
1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="cros-s-reference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).
</dd>
这个词包含在<dt>
标签和下一个兄弟<dd>
标签中的描述。
要获取像http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0 这样的页面上的单词列表,可以写成word = sel.xpath('//dl/dt/a/text()').extract()
。
然后我尝试遍历这个单词列表并使用这行代码info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd")
提取相关信息。但这似乎不起作用。有什么想法吗?
【问题讨论】:
你有word[i]
中<a>
元素的完整 内容吗(例如:“address (n.)”,而不是“address”?否则,您可能想使用starts-with
。
【参考方案1】:
使用跟随兄弟的解决方案。
class SingleSpider(scrapy.Spider):
name = "etym"
allowed_domains = ["etymonline.com"]
start_urls = [
"http://www.etymonline.com/index.php?l=d&allowed_in_frame=0"]
def parse(self, response):
for nodes in response.xpath('//dl'):
for i in nodes.xpath('dt'):
print i.xpath('a/text()').extract()
print i.xpath('following-sibling::dd[1]/text()').extract()
基本上:
你一个一个得到Dt元素 打印链接中包含的文本 移动到下一个兄弟并打印包含的文本 列表项这里是输出的摘录:
[u'daiquiri (n.)'] [u'类型的酒精饮料,1920 年(首次记录于 F. Scott Fitzgerald), 来自 ', u', 地区或村庄的名称 古巴东部。']
[u'dairy (n.)'] [u'late 13c., “用于制作黄油和奶酪的建筑; 奶牛场”,由英法 ', u' 组成,加在中古英语上 ', u' (in ', u' "dairymaid"), 源自古英语 ', u' "kneader of 面包、管家、女仆”(见 ', u' (n.1))。纯粹的 母语是', u'。']
[u'dais (n.)'] [u'mid-13c., 源自盎格鲁-法语', u', 古法语', u' “桌子,平台”源自拉丁语 ',u',“圆盘状物体”,也由 中世纪时期,“table”,源自希腊语 ',u',“quoit,disk,dish”(参见 ', 你'(名词))。死于英语 c.1600,保存在苏格兰,复活 19c。由古物学家。']
【讨论】:
【参考方案2】:要在<dt>
之后到达<dd>
,您可以使用following-sibling
轴,没错。
following-sibling::dd
选择上下文节点后的所有 dd
元素。因此,您需要使用位置谓词 [1]
将 XPath 限制为仅第一个。
对于您从//dl/dt
得到的每个dt
元素,您选择following-sibling::dd[1]
。
这是一个使用scrapy shell
表示“地址”一词的示例会话:
$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s] item
[s] request <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s] response <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s] settings <scrapy.settings.Settings object at 0x7f1397399bd0>
[s] spider <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: for dt in response.xpath('//dl/dt'):
print "Word:", dt.xpath('string(a)').extract()
print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
print
...:
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']
Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']
Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']
Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']
...
Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']
Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\'s speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']
In [2]:
【讨论】:
【参考方案3】:xpath 工作的想法不是loop
提取的列表,而是在 xpath 的父节点内。
目前我的 Mac 上没有 scrapy,但这里的技术应该同样适用,如下所示:
# I use lxml for loose html string parsing
from lxml import html
s = '''<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" title="Look up address at Dictionary.com" /></a></dt>
<dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="cros-s-reference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>'''
sel = html.fromstring(s)
# rather than extracting the words straight away, you loop from the parent xpath
for nodes in sel.xpath('//dt'):
# then access a node to get the text
print nodes.xpath('a/text()')
# and go back to parent and search the dd node
print nodes.xpath('../dd/text()')
# sample results
['address (n.)']
['1530s, "dutiful or courteous approach," from ', ' (v.) and from French ', '. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']
希望这会有所帮助。
【讨论】:
以上是关于Xpath 使用 Scrapy 在下一个兄弟标签中获取信息的主要内容,如果未能解决你的问题,请参考以下文章
Python中Scrapy框架元素选择器XPath的简单实例