Python3爬虫04(其他例子,如处理获取网页的内容)

Posted 新美好时代

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python3爬虫04(其他例子,如处理获取网页的内容)相关的知识,希望对你有一定的参考价值。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import os
import re
import requests
from bs4 import NavigableString
from bs4 import BeautifulSoup

res=requests.get("https://www.qiushibaike.com/")
qiushi=res.content
soup=BeautifulSoup(qiushi,"html.parser")
duanzis=soup.find_all(class_="content")
for i in duanzis:
duanzi=i.span.contents[0]
# duanzi=i.span.string
print(duanzi)
# print(i.span.string)


res=requests.get("http://699pic.com/sousuo-218808-13-1-0-0-0.html")
image=res.content
soup=BeautifulSoup(image,"html.parser")
images=soup.find_all(class_="lazy")

for i in images:
original=i["data-original"]
title=i["title"]
# print(title)
# print(original)
# print("")
try:
with open(os.getcwd()+"\\\\jpg\\\\"+title+\'.jpg\',\'wb\') as file:
file.write(requests.get(original).content)
except:
pass

r = requests.get("http://699pic.com/sousuo-218808-13-1.html")
fengjing = r.content
soup = BeautifulSoup(fengjing, "html.parser")
# 找出所有的标签
images = soup.find_all(class_="lazy")
# print images # 返回list对象

for i in images:
jpg_rl = i["data-original"] # 获取url地址
title = i["title"] # 返回title名称
print(title)
print(jpg_rl)
print("")

r = requests.get("https://www.qiushibaike.com/")
r=requests.get("http://www.cnblogs.com/nicetime/")
blog=r.content
soup=BeautifulSoup(blog,"html.parser")
soup=BeautifulSoup(blog,features="lxml")
print(soup.contents[0].contents)


tag=soup.find(\'div\')
tag=soup.find(class_="menu-bar menu clearfix")
tag=soup.find(id="menu")
print(list(tag))

tag01=soup.find(class_="c_b_p_desc")

print(len(list(tag01.contents)))
print(len(list(tag01.children)))
print(len(list(tag01.descendants)))

print(tag01.contents)
print(tag01.children)
for i in tag01.children:
print(i)


print(len(tag01.contents))

for i in tag01:
print(i)

print(tag01.contents[0].string)
print(tag01.contents[1])
print(tag01.contents[1].string)


url = "http://www.dygod.net/html/tv/oumeitv/109673.html"
s = requests.get(url)
print(s.text.encode("iso-8859-1").decode(\'gbk\'))
res = re.findall(\'href="(.*?)">ftp\',s.text)
for resi in res:
a=resi.encode("iso-8859-1").decode(\'gbk\')
print(a)

以上是关于Python3爬虫04(其他例子,如处理获取网页的内容)的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫怎么获取动态的网页源码

python3 网页爬虫图片下载无效链接处理 try except

自己写了一个爬虫,求教如何在网页上爬取动态加载的信息。

04-爬虫的基本原理

如何用python抓取js生成的数据

Python3简单爬虫抓取网页图片