python提取页面信息beautifulsoup正则lxml
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python提取页面信息beautifulsoup正则lxml相关的知识,希望对你有一定的参考价值。
beautifulsoup正则lxml
# -*- coding: utf-8 -*- import re from urllib.request import urlopen from urllib.request import Request from bs4 import BeautifulSoup from lxml import etree #添加模拟浏览器协议头 headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘} url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&sm=0&p=1" req_timeout = 5 req = Request(url=url,headers=headers) f = urlopen(req,None,req_timeout) s = f.read() s = s.decode(‘utf-8‘) ss = str(s) #lxml提取 selector = etree.HTML(ss) links = selector.xpath(‘//tr/td[@class="zwmc"]/div/a/@href|//tr/td[@class="zwmc"]/div/a/text()‘) for link in links: print(link) ‘‘‘ #beautifulsoup提取 soup = BeautifulSoup(ss,‘html.parser‘) aList = soup.find_all("tr") for item in aList: aList1 = item.find_all("a") for item1 in aList1: print(item1.get(‘href‘)) print(item1.get_text()) break #print(item) #print(item.get(‘href‘)) #print(item.get_text()) ‘‘‘ #正则提取 ‘‘‘ mm = re.findall(‘<div style="width: 224px;*width: 218px; _width:200px; float: left"><a style=\"font-weight: bold\" par=\"(.*)\" href=\"(.*)\" target=\"_blank\">(.*)</a>‘,ss) print(mm) ‘‘‘
以上是关于python提取页面信息beautifulsoup正则lxml的主要内容,如果未能解决你的问题,请参考以下文章
python爬虫之Beautiful Soup库,基本使用以及提取页面信息
python爬虫之Beautiful Soup库,基本使用以及提取页面信息
Python网络爬虫与信息提取—— BeautifulSoup