Python爬虫之XPath

Posted wuwen19940508

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python爬虫之XPath相关的知识,希望对你有一定的参考价值。

基本语法

 表达式  描述
 node 选取此结点的所有子节点,tag或*选择任意的tag
从根节点选取,选择直接子节点,不包含更小的后代(例如孙、重孙)
 // 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置,包含所有后代 
 .  选取当前节点
 ..  选取当前节点的父节点
 @  选取属性

@属性

在DOM树,以路劲的方式查询节点

通过@符号来选取属性

<a rel="nofollow" class="external text" href="http://google.ac">goole<wbr/>.ac</a>

rel class href 都是属性,可以通过 "//*[@class=‘external text‘]" 来选取对应元素

= 符号要求属性完全匹配,可以用contains方法来部分匹配,例如

"//*[contains(@class,‘external‘)]" 可以匹配,而

"//*[@class=‘external‘]"则不能

运算符

and和or运算符:

  选择p或者span或者h1标签的元素

soup = tree.xpath(‘//td[@class="editor bbsDetailContainer"]//*[self::p or self::span or self::h1]‘) 

  选择class为editor或者tag的元素

soup = tree.xpath(‘//td[@class="editor" or @class="tag"]‘)

 Demo

import lxml
from lxml import html
from lxml import etree

from bs4 import BeautifulSoup

f = open(‘jd.com_2131674.html‘, ‘r‘)
content = f.read()

tree = etree.HTML(content.decode(‘utf-8‘))

print ‘--------------------------------------------‘
print ‘# different quote //*[@class="p-price J-p-2131674"‘
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘p-price J-p-2131674‘]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# partial match ‘ + "//*[@class=‘J-p-2131674‘]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘J-p-2131674‘]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# exactly match class string ‘ + ‘//*[@class="p-price J-p-2131674"]‘
print ‘--------------------------------------------‘
print tree.xpath(u‘//*[@class="p-price J-p-2131674"]‘)
print ‘‘

print ‘--------------------------------------------‘
print ‘# use contain ‘ + "//*[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[contains(@class, ‘J-p-2131674‘)]")
print ‘‘


print ‘--------------------------------------------‘
print ‘# specify tag name ‘ + "//strong[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# css selector with tag‘ + "cssselect(‘strong.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
print htree.cssselect(‘strong.J-p-2131674‘)
print ‘‘

print ‘--------------------------------------------‘
print ‘# css selector without tag, partial match‘ + "cssselect(‘.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
elements = htree.cssselect(‘.J-p-2131674‘)
print elements
print ‘‘

print ‘--------------------------------------------‘
print ‘# attrib and text‘
print ‘--------------------------------------------‘
for element in tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]"):
    print element.text
    print element.attrib
print ‘‘

print ‘--------------------------------------------‘
print ‘########## use BeautifulSoup ##############‘
print ‘--------------------------------------------‘
print ‘# loading content to BeautifulSoup‘
soup = BeautifulSoup(content, ‘html.parser‘)
print ‘# loaded, show result‘
print soup.find(attrs={‘class‘:‘J-p-2131674‘}).text

f.close()

 

以上是关于Python爬虫之XPath的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫之xpath的基本使用

python爬虫之xpath的基本使用

Python 爬虫开发之xpath使用

Python爬虫从入门到进阶之xpath的使用

Python爬虫之XPath

Python爬虫实战之xpath解析