初阶爬虫
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了初阶爬虫相关的知识,希望对你有一定的参考价值。
来一段:
import requests
url="https://en.wikipedia.org/wiki/Steve_Jobs"
res=requests.get(url)
print(res.status_code)
with open(‘a.html‘, ‘w‘, encoding=‘utf-8‘) as f:
f.write(res.text)
保存一个网页,由于windows和python编码的原因,所以在open的时候要指定encoding=‘utf-8‘
再来一段:
import requests
import re
from lxml import etree
with open("a.html","r",encoding="utf-8") as f:
c=f.read()
tree=etree.HTML(c)
table_element=tree.xpath("//table[@class=‘infobox biography vcard‘]")
table_row=tree.xpath("//table[@class=‘infobox biography vcard‘][1]/tbody/tr")
pattern_attrib=re.compile("<.*?>")
# print(table_element)
# infobox biography vcard
for row in table_row:
try:
thead=row.xpath("th")[0]
title=etree.tostring(thead).decode("utf-8")
title=pattern_attrib.sub(" ",title)
desc=row.xpath("td")[0]
desc=etree.tostring(desc).decode("utf-8")
desc=pattern_attrib.sub(" ",desc)
print(title+":"+desc)
print("=========")
except Exception as err:
print(err)
# pass
content=tree.xpath("//div[@id=‘mw-content-text‘][1]//*[self::h2 or self::p]")
for line in content:
print(line.text)
以上是关于初阶爬虫的主要内容,如果未能解决你的问题,请参考以下文章