python爬虫 Day 8
Posted 国民好姐姐
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬虫 Day 8相关的知识,希望对你有一定的参考价值。
XPath
基础知识
1.定义
XPath即为XML路径语言(XML Path Language),它是一种用来确定XML文档中某部分位置的语言,即在树状结构中寻找节点(元素或属性)进行导航寻找数据。
2.目的
用于数据解析,前提是网页结构比较清晰
3.html、xml、lxml
(1)html是指一种超文本标记语言
(2)xml 是指一种可扩展标记语言 html是xml的一个子集
(3)lxml是指python的一个第三方库,它包含了将html文本转成xml对象,和对对象执行xpath的功能
XPath入门
1.步骤
(1)导入库 from lxml import etree
(2)发送请求 获取相应 得到网页源代码(html)
(3)tree = etree.HTL(html) 得到对象
(4)tree.xpath(\'/\') 寻找路径
2.XPath-helper工具
只是一个辅助工具 用于验证数据的路径是否正确
(1)先解压到xpath-helper文件夹下面
(2)打开谷歌浏览器:点击右上角三个点-->更多工具-->扩展程序-->注意开发者模式得打开(按钮朝右)-->加载已解压扩展程序-->选择xpath-helper文件夹-->重启
(3)打开快捷键:ctrl+shift+x
代码
(1) 代码XPath_1
from lxml import etree
html = """
<html>
<head>
<title>测试</title>
</head>
<body>
<li class="item-0">first item</li>
<li class="item-1">second item</li>
<li class="item-inactive">third item</li>
<li class="item-1">fourth item</li>
<div>
<li class="item-0">fifth item</li>
</div>
<span>
<li class="item-0">sixth item</li>
<div>
<li class="item-0">eighth item</li>
</div>
</span>
</body>
</html>
"""
tree = etree.HTML(html)
result = tree.xpath(\'/html\') # /表示层级关系 第一个/表示根节点
print(result) # [<Element html at 0x173ca61f140>]
result2 = tree.xpath(\'/html/head/title/text()\') # text()表示获取文本
print(result2)
li_lst1 = tree.xpath(\'/html/body/li/text()\')
print(li_lst1) # 只能获得前四个 [\'first item\', \'second item\', \'third item\', \'fourth item\']
li_lst2 = tree.xpath(\'/html/body/div/li/text()\')
print(li_lst2) # 只能获得第五个 [\'fifth item\']
li_lst3 = tree.xpath(\'/html/body/span/li/text()\')
print(li_lst3) # 只能获得第六个 [\'sixth item\']
li_lst4 = tree.xpath(\'/html/body/span/div/li/text()\')
print(li_lst4) # 只能获得第八个 [\'eighth item\']
# / 直接子节点 // 获取所有的子孙节点
li_lst = tree.xpath(\'/html/body//li/text()\')
print(li_lst) # 可以获取全部的 [\'first item\', \'second item\', \'third item\', \'fourth item\', \'fifth item\', \'sixth item\', \'eighth item\']
# * 通配符
li_lst5 = tree.xpath(\'/html/body/*/li/text()\')
print(li_lst5) # 只能获取第五个和第六个 [\'fifth item\', \'sixth item\'] 相当于/html/body/div/li和/html/body/span/li
(2) 代码XPath_2
from lxml import etree
html = \'\'\'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>test2</title>
</head>
<body>
<div id="songs-list">
<h2 class="title">经典老歌</h2>
<p class="introduction">经典老歌列表</p>
<ul id="list" class="list-group">
<li>123<a class="new a" href="/1.mp3" singer="任贤齐">沧海一声笑</a></li>
<li>456<a href="/2.mp3" singer="齐秦">往事随风</a></li>
<li>789<a href="/3.mp3" singer="beyond">光辉岁月</a></li>
<li>012<a href="/4.mp3" singer="陈慧琳">记事本</a></li>
<li>345<a href="/5.mp3" singer="邓丽君">但愿人长久</a></li>
</ul>
</div>
</body>
</html>
\'\'\'
tree = etree.HTML(html)
# 获取标签中内容
a_lst = tree.xpath(\'/html/body/div/ul/li/a/text()\')
print(a_lst) # [\'沧海一声笑\', \'往事随风\', \'光辉岁月\', \'记事本\', \'但愿人长久\']
# 获取标签属性
href_lst = tree.xpath(\'/html/body/div/ul/li/a/@href\')
print(href_lst) # [\'/1.mp3\', \'/2.mp3\', \'/3.mp3\', \'/4.mp3\', \'/5.mp3\']
# 通过属性匹配标签(具体到哪一个)
result = tree.xpath(\'/html/body/div/ul/li/a[@href="/2.mp3"]/text()\')
print(result) # [\'往事随风\']
result2 = tree.xpath(\'/html/body/div/ul/li/a[@href="/2.mp3"]/@singer\')
print(result2) # [\'齐秦\']
result3 = tree.xpath(\'/html/body/div/ul/li/a[@class="new a"]/text()\')
print(result3) # [\'沧海一声笑\']
# 属性多值匹配 contains
result4 = tree.xpath(\'/html/body/div/ul/li/a[contains(@class, "n")]/text()\')
print(result4) # [\'沧海一声笑\']
# 多属性匹配 or and
result5 = tree.xpath(\'/html/body/div/ul/li/a[contains(@class, "n") or @href="/2.mp3"]/text()\')
print(result5) # [\'沧海一声笑\', \'往事随风\']
result6 = tree.xpath(\'/html/body/div/ul/li/a[contains(@class, "n") and @href="/1.mp3"]/text()\')
print(result6) # [\'沧海一声笑\']
li_lst = tree.xpath(\'/html/body/div/ul/li\')
# print(li_lst)
for li in li_lst:
# print(li)
# li.xpath ./ 相对路径
r = li.xpath(\'./a/text()\')
# print(r)
s = li.xpath(\'./a/@singer\')
print(s)
li_data = tree.xpath(\'/html/body/div/ul/li/text()\')
for li in li_data:
print(li) # 数字是后期加上的
以上是关于python爬虫 Day 8的主要内容,如果未能解决你的问题,请参考以下文章