Beautifulsoup4学习文档解析流程图

Posted kmingspirit

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Beautifulsoup4学习文档解析流程图相关的知识,希望对你有一定的参考价值。

url = http://zst.aicai.com/ssq/openInfo/

体育彩票开奖信息:一种思路是正则html,另一种相当于一个框架xml解析html. 两种方法没有优缺点,不能说那个方便,那个代码少就是容易。有精力还是要有正则扎实的基础才好。

技术分享
import urllib.request
import urllib.parse
import re
import urllib.request,urllib.parse,http.cookiejar

def getHtml(url):
    cj=http.cookiejar.CookieJar()
    opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders=[(User-Agent,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36),(Cookie,4564564564564564565646540)]

    urllib.request.install_opener(opener)

    html_bytes = urllib.request.urlopen( url ).read()
    html_string = html_bytes.decode( utf-8 )
    return html_string

#url = http://zst.aicai.com/ssq/openInfo/
#最终输出结果格式如:2015075期开奖号码:6,11,13,19,21,32, 蓝球:4
html = getHtml("http://zst.aicai.com/ssq/openInfo/")
#<table class="fzTab nbt"> </table>

table = html[html.find(<table class="fzTab nbt">) : html.find(</table>)]
#print (table)
#<tr onmouseout="this.style.background=‘‘" onmouseover="this.style.background=‘#fff7d8‘">
#<tr \r\n\t\t                  onmouseout=
tmp = table.split(<tr \r\n\t\t                  onmouseout=,1)
#print(tmp)
#print(len(tmp))
trs = tmp[1]
tr = trs[: trs.find(</tr>)]
#print(tr)
number = tr.split(<td   >)[1].split(</td>)[0]
print(number + 期开奖号码:,end=‘‘)
redtmp = tr.split(<td  class="redColor sz12" >)
reds = redtmp[1:len(redtmp)-1]#去掉第一个和最后一个没用的元素
#print(reds)
for redstr in reds:
    print(redstr.split(</td>)[0] + ",",end=‘‘)
print(蓝球:,end=‘‘)
blue = tr.split(<td  class="blueColor sz12" >)[1].split(</td>)[0]
print(blue)
View Code
from bs4 import BeautifulSoup

import urllib.request
import urllib.parse
import urllib.request,http.cookiejar

def getHtml(url):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [(User-Agent,
                          Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36),
                         (Cookie, 4564564564564564565646540)]

    urllib.request.install_opener(opener)

    html_bytes = urllib.request.urlopen(url).read()
    html_string = html_bytes.decode(utf-8)
    return html_string

html_doc = getHtml("http://zst.aicai.com/ssq/openInfo/")
soup = BeautifulSoup(html_doc, html.parser)

# print(soup.title)
#table = soup.find_all(‘table‘, class_=‘fzTab‘)
#print(table)#<tr onmouseout="this.style.background=‘‘" 这种tr丢失了

tr = soup.find(tr,attrs={"onmouseout": "this.style.background=‘‘"})
#print(tr)
tds = tr.find_all(td)
opennum = tds[0].get_text()
#print(opennum)

reds = []
for i in  range(2,8):
    reds.append(tds[i].get_text())
#print(reds)
blue = tds[8].get_text()
#print(blue)

#把list转换为字符串:(‘,‘).join(list)
#最终输出结果格式如:2015075期开奖号码:6,11,13,19,21,32, 蓝球:4
print(opennum+期开奖号码:+ (,).join(reds)+", 蓝球:"+blue)

 

以上是关于Beautifulsoup4学习文档解析流程图的主要内容,如果未能解决你的问题,请参考以下文章

BeautifulSoup4文档示例不起作用

python爬虫(十九)BeautifulSoup4库

Python爬虫(十四)_BeautifulSoup4 解析器

BeautifulSoup4

python模块--BeautifulSoup4 和 lxml

Python 之BeautifulSoup4解析模块