Python 之lxml解析模块

Posted 样子2018

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python 之lxml解析模块相关的知识,希望对你有一定的参考价值。

lxml 是 一个HTML/XML的解析器,主要的功能是如何解析和提取 HTML/XML 数据。

一、lxml示例

1、初步

# 使用 lxml 的 etree 库
from lxml import etree 

text = ‘‘‘
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
     </ul>
 </div>
‘‘‘

#利用etree.HTML,将字符串解析为HTML文档
html = etree.HTML(text) 

# 按字符串序列化HTML文档
result = etree.tostring(html) 

print(result)

结果

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

2、从文件里读取内容

from lxml import etree

# 读取外部文件 hello.html
html = etree.parse(./hello.html)
result = etree.tostring(html, pretty_print=True)

print(result)

3、html内容

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

@1、获取所有的 <li> 标签

from lxml import etree

html = etree.parse(hello.html)
print type(html)  # 显示etree.parse() 返回类型

result = html.xpath(//li)

print result  # 打印<li>标签的元素集合
print len(result)
print type(result)
print type(result[0])


结果是
<type lxml.etree._ElementTree>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type list>
<type lxml.etree._Element>

@2、继续获取<li> 标签的所有 class属性

from lxml import etree

html = etree.parse(hello.html)
result = html.xpath(//li/@class)

print result

结果是
[item-0, item-1, item-inactive, item-1, item-0]

@3、继续获取<li>标签下href为 link1.html 的 <a> 标签

from lxml import etree

html = etree.parse(hello.html)
result = html.xpath(//li/a[@href="link1.html"])

print result

运行结果

[<Element a at 0x10ffaae18>]

 @4、获取<li> 标签下的所有 <span> 标签

from lxml import etree

html = etree.parse(hello.html)

#result = html.xpath(‘//li/span‘)
#注意这么写是不对的:
#因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠

result = html.xpath(//li//span)

print result

运行结果

[<Element span at 0x10d698e18>]

@5、获取 <li> 标签下的<a>标签里的所有 class

from lxml import etree

html = etree.parse(hello.html)
result = html.xpath(//li/a//@class)

print result

运行结果

[blod]

@6、获取最后一个 <li> 的 <a> 的 href

from lxml import etree

html = etree.parse(hello.html)

result = html.xpath(//li[last()]/a/@href)
# 谓语 [last()] 可以找到最后一个元素

print result

运行结果

[link5.html]

@7、获取倒数第二个元素的内容

from lxml import etree

html = etree.parse(hello.html)
result = html.xpath(//li[last()-1]/a)

# text 方法可以获取元素内容
print result[0].text

运行结果

fourth item

@8、获取 class 值为 bold 的标签名

from lxml import etree

html = etree.parse(hello.html)

result = html.xpath(//*[@class="bold"])

# tag方法可以获取标签名
print result[0].tag

运行结果

span

 

以上是关于Python 之lxml解析模块的主要内容,如果未能解决你的问题,请参考以下文章

python模块--BeautifulSoup4 和 lxml

python爬虫模块之HTML解析模块

python之lxml 解析问题总结

Python 之BeautifulSoup4解析模块

python HTML解析之 - lxml

Lxml