基本语法
表达式 | 描述 |
node | 选取此结点的所有子节点,tag或*选择任意的tag |
/ | 从根节点选取,选择直接子节点,不包含更小的后代(例如孙、重孙) |
// | 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置,包含所有后代 |
. | 选取当前节点 |
.. | 选取当前节点的父节点 |
@ | 选取属性 |
@属性
在DOM树,以路劲的方式查询节点
通过@符号来选取属性
<a rel="nofollow" class="external text" href="http://google.ac">goole<wbr/>.ac</a>
rel class href 都是属性,可以通过 "//*[@class=‘external text‘]" 来选取对应元素
= 符号要求属性完全匹配,可以用contains方法来部分匹配,例如
"//*[contains(@class,‘external‘)]" 可以匹配,而
"//*[@class=‘external‘]"则不能
运算符
and和or运算符:
选择p或者span或者h1标签的元素
soup = tree.xpath(‘//td[@class="editor bbsDetailContainer"]//*[self::p or self::span or self::h1]‘)
选择class为editor或者tag的元素
soup = tree.xpath(‘//td[@class="editor" or @class="tag"]‘)
Demo
import lxml
from lxml import html
from lxml import etree
from bs4 import BeautifulSoup
f = open(‘jd.com_2131674.html‘, ‘r‘)
content = f.read()
tree = etree.HTML(content.decode(‘utf-8‘))
print ‘--------------------------------------------‘
print ‘# different quote //*[@class="p-price J-p-2131674"‘
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘p-price J-p-2131674‘]")
print ‘‘
print ‘--------------------------------------------‘
print ‘# partial match ‘ + "//*[@class=‘J-p-2131674‘]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘J-p-2131674‘]")
print ‘‘
print ‘--------------------------------------------‘
print ‘# exactly match class string ‘ + ‘//*[@class="p-price J-p-2131674"]‘
print ‘--------------------------------------------‘
print tree.xpath(u‘//*[@class="p-price J-p-2131674"]‘)
print ‘‘
print ‘--------------------------------------------‘
print ‘# use contain ‘ + "//*[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[contains(@class, ‘J-p-2131674‘)]")
print ‘‘
print ‘--------------------------------------------‘
print ‘# specify tag name ‘ + "//strong[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]")
print ‘‘
print ‘--------------------------------------------‘
print ‘# css selector with tag‘ + "cssselect(‘strong.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
print htree.cssselect(‘strong.J-p-2131674‘)
print ‘‘
print ‘--------------------------------------------‘
print ‘# css selector without tag, partial match‘ + "cssselect(‘.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
elements = htree.cssselect(‘.J-p-2131674‘)
print elements
print ‘‘
print ‘--------------------------------------------‘
print ‘# attrib and text‘
print ‘--------------------------------------------‘
for element in tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]"):
print element.text
print element.attrib
print ‘‘
print ‘--------------------------------------------‘
print ‘########## use BeautifulSoup ##############‘
print ‘--------------------------------------------‘
print ‘# loading content to BeautifulSoup‘
soup = BeautifulSoup(content, ‘html.parser‘)
print ‘# loaded, show result‘
print soup.find(attrs={‘class‘:‘J-p-2131674‘}).text
f.close()