python爬虫将在线html网页中的图片链接替换成本地链接并将html文件下载到本地

Posted 2023-03-25

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python爬虫将在线html网页中的图片链接替换成本地链接并将html文件下载到本地相关的知识，希望对你有一定的参考价值。

如题，我已经可以将html文件和图片下载下来了但是网页中的图片是以链接的方式存在的，怎么让它下载网页的同时将代码中的图片链接修改为本地链接

参考技术A import os,re
def check_flag(flag):
regex = re.compile(r'images\/')
result = True if regex.match(flag) else False
return result

#soup = BeautifulSoup(open('index.html'))
from bs4 import BeautifulSoup
html_content = '''
<a href="https://xxx.com">测试01</a>
<a href="https://yyy.com/123">测试02</a>
<a href="https://xxx.com">测试01</a>
<a href="https://xxx.com">测试01</a>
'''
file = open(r'favour-en.html','r',encoding="UTF-8")
soup = BeautifulSoup(file, 'html.parser')
for element in soup.find_all('img'):
if 'src' in element.attrs:
print(element.attrs['src'])
if check_flag(element.attrs['src']):
#if element.attrs['src'].find("png"):
element.attrs['src'] = "michenxxxxxxxxxxxx" +'/'+ element.attrs['src']

print("##################################")
with open('index.html', 'w',encoding="UTF-8") as fp:
fp.write(soup.prettify()) # prettify()的作⽤是将sp美化⼀下，有可读性参考技术B 这段Python代码的作用是对网页HTML文件进行解析，并对其中的img标签中的src属性进行修改。具体来说，该代码使用了Python内置的os和re库，以及第三方库BeautifulSoup。
首先，该代码通过打开一个名为'favour-en.html'的HTML文件，使用BeautifulSoup库对其进行解析，将其转化为一个BeautifulSoup对象，存储在变量soup中。然后，该代码遍历soup对象中的所有img标签，并检查其中是否包含'src'属性。
接下来，代码调用了check_flag函数，该函数使用正则表达式判断'src'属性中是否包含字符串'images/'，如果包含则返回True，否则返回False。如果check_flag函数返回True，则修改该img标签的'src'属性，在其前面添加字符串'michenxxxxxxxxxxxx'和'/'，以此对'src'属性进行修改。修改后的结果被打印出来，并将最终的HTML代码写入到名为'index.html'的文件中。
需要注意的是，在这段代码中，文件读取和写入时指定了编码方式为UTF-8，以确保能够正确读写中文字符。参考技术C import os,re
def check_flag(flag):
regex = re.compile(r'images\/')
result = True if regex.search(flag) else False
return result

#soup = BeautifulSoup(open('index.html'))
from bs4 import BeautifulSoup
html_content = '''
<a href="https://xxx.com">测试01</a>
<a href="https://yyy.com/123">测试02</a>
<a href="https://xxx.com">测试01</a>
<a href="https://xxx.com">测试01</a>
'''
file = open(r'E:\test\favour-fr.html','r',encoding="UTF-8")
soup = BeautifulSoup(file, 'html.parser')
for element in soup.find_all('img'):
if 'src' in element.attrs:
print(element.attrs['src'])
if check_flag(element.attrs['src']):
#if element.attrs['src'].find("png"):
element.attrs['src'] = "michenxxxxxxxxxxxx" +'/'+ element.attrs['src']

print("##################################")
with open('index.html', 'w',encoding="UTF-8") as fp:
fp.write(soup.prettify()) # prettify()的作⽤是将sp美化⼀下，有可读性参考技术D 正则匹配原链接替换为本地路径即可追问

要是每个图片都用replace会不会增加对网页的解析速度。 xpath不能像Beautifulsoup那样直接对文件进行更改吗

追答

这样的话，建议使用Beautiful Soup，从HTML或XML文件中提取数据比正则更快捷方便。

python3 网页爬虫图片下载无效链接处理 try except

代码比较粗糙，主要是备忘容易出错的地方。供自己以后查阅。

#图片下载

import re

import urllib.request #python3中模块名和2.x（urllib）的不一样

site=‘https://world.taobao.com/item/530762904536.htm?spm=a21bp.7806943.topsale_XX.4.jcjxZC‘

page=urllib.request.urlopen(site)

html=page.read()

html=html.decode(‘utf-8‘) #读取下来的网页源码需要转换成utf-8格式

reg=r‘src="//(gd.*?jpg)‘

imgre=re.compile(reg)

imglist=re.findall(imgre,html)

trueurls=[]

for i in imglist:

trueurls.append(i.replace(‘gd‘,‘http://gd‘))

trueurls[2]=‘http://wlgsad.com.jpg‘

print (trueurls)

x=200

for j in trueurls:

try:

urllib.request.urlretrieve(j,‘%s.jpg‘ %x)

except Exception : #except Exception as e:

pass # print (e)

# print (‘有无效链接‘)

x=x+1

在except子句可以打印出一些提示信息

下载图片的时候，如果有无效的链接，可以用try except跳过无效链接继续下一个图片的下载

以上是关于python爬虫将在线html网页中的图片链接替换成本地链接并将html文件下载到本地的主要内容，如果未能解决你的问题，请参考以下文章

python爬虫 将在线html网页中的图片链接替换成本地链接并将html文件下载到本地

python3 网页爬虫图片下载无效链接处理 try except

python爬虫将在线html网页中的图片链接替换成本地链接并将html文件下载到本地