批量爬取网站上的文本和图片，并保存至word中

Posted 2020-12-13 cooper-73

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了批量爬取网站上的文本和图片，并保存至word中相关的知识，希望对你有一定的参考价值。

 1 from pyquery import PyQuery as pq
 2 import requests as rs
 3 from docx import Document
 4 from docx.shared import RGBColor  
 5 
 6 
 7 html = ‘‘‘
 8 https://zhs.moo0.com/software/AudioTypeConverter/pad.xml
 9 https://zhs.moo0.com/software/AudioEffecter/pad.xml
10 https://zhs.moo0.com/software/AudioPlayer/pad.xml
11 https://zhs.moo0.com/software/Mp3InfoEditor/pad.xml
12 https://zhs.moo0.com/software/VoiceRecorder/pad.xml
13 https://zhs.moo0.com/software/Magnifier/pad.xml
14 https://zhs.moo0.com/software/MultiDesktop/pad.xml
15 https://zhs.moo0.com/software/ScreenShot/pad.xml
16 https://zhs.moo0.com/software/SimpleTimer/pad.xml
17 https://zhs.moo0.com/software/TransparentMenu/pad.xml
18 https://zhs.moo0.com/software/WindowMenuPlus/pad.xml
19 https://zhs.moo0.com/software/WorldTime/pad.xml
20 https://zhs.moo0.com/software/AntiRecovery/pad.xml
21 https://zhs.moo0.com/software/DiskCleaner/pad.xml
22 https://zhs.moo0.com/software/FileMonitor/pad.xml
23 https://zhs.moo0.com/software/FileShredder/pad.xml
24 https://zhs.moo0.com/software/HashCode/pad.xml
25 https://zhs.moo0.com/software/TimeStamp/pad.xml
26 https://zhs.moo0.com/software/ColorPicker/pad.xml
27 https://zhs.moo0.com/software/FontViewer/pad.xml
28 https://zhs.moo0.com/software/ImageInColors/pad.xml
29 https://zhs.moo0.com/software/ImageTypeConverter/pad.xml
30 https://zhs.moo0.com/software/ImageSharpener/pad.xml
31 https://zhs.moo0.com/software/ImageSizer/pad.xml
32 https://zhs.moo0.com/software/ImageThumbnailer/pad.xml
33 https://zhs.moo0.com/software/ImageViewer/pad.xml
34 https://zhs.moo0.com/software/ImageViewer/pad.xml
35 https://zhs.moo0.com/software/ConnectionWatcher/pad.xml
36 https://zhs.moo0.com/software/RightClicker/pad.xml
37 https://zhs.moo0.com/software/RightClicker/pad.xml
38 https://zhs.moo0.com/software/SystemCloser/pad.xml
39 https://zhs.moo0.com/software/SystemMonitor/pad.xml
40 https://zhs.moo0.com/software/XpDesktopHeap/pad.xml
41 https://zhs.moo0.com/software/VideoConverter/pad.xml
42 https://zhs.moo0.com/software/VideoCutter/pad.xml
43 https://zhs.moo0.com/software/VideoInfo/pad.xml
44 https://zhs.moo0.com/software/VideoMinimizer/pad.xml
45 https://zhs.moo0.com/software/VideoToAudio/pad.xml
46 https://zhs.moo0.com/software/FFmpeg/pad.xml
47 ‘‘‘
48 
49 def get_info(address):#爬取网页上的信息
50     add = rs.get(address)
51     res = pq(add.content, parser=‘html‘)
52     name = res(‘Program_Name‘).text()
53     versize = ‘软件版本：‘ + res(‘Program_Version‘).text() + ‘  	‘ + ‘软件大小：‘ + res(‘File_Size_MB‘).text()
54     introduce = ‘软件介绍：‘ + ‘
‘ + res(‘ChineseSimplified‘).find(‘Char_Desc_45‘).text() + ‘
‘ + res(
55             ‘ChineseSimplified‘).find(‘Char_Desc_2000‘).text()
56     photourl = res(‘Application_Screenshot_URL‘).text()
57     download = res(‘Primary_Download_URL‘).text()
58     return [name,versize,introduce,photourl,download]
59 
60 def get_pic(pic_url):#docx无法直接插入网络图片，因此在这里先将网络图片下载到本地
61     pic = rs.get(pic_url)
62     with open(‘pic_tmp.png‘, "wb")as f:
63         f.write(pic.content)
64 
65 def insert_doc(n,res): #将爬取的内容导入document。
66     name = str(n)+‘、‘+ res[0]
67     document.add_heading(name)
68     document.add_paragraph(res[1])
69     document.add_paragraph(res[2])
70     document.add_picture(‘pic_tmp.png‘)
71     p = document.add_paragraph()
72     run = p.add_run(‘下载地址：
‘)
73     run = p.add_run(res[4])
74     run.italic = True
75     run.underline = True
76     run.font.color.rgb = RGBColor(31,77,225)
77 
78 def main():
79     n = 0
80     for address in html.split():
81         n += 1
82         print(n)
83         respon = get_info(address)
84         get_pic(respon[3])
85         insert_doc(n, respon)
86 
87 if __name__== ‘__main__‘:
88     document = Document()
89     main()
90     document.save(‘./demo.docx‘)

效果预览：

技术图片

以上是关于批量爬取网站上的文本和图片，并保存至word中的主要内容，如果未能解决你的问题，请参考以下文章

Node批量爬取头条视频并保存方法

Python爬虫爬取网页上的所有图片

今日头条图片ajax异步加载爬取，并保存至mongodb，以及代码写法的改进

python爬取某个网站的图片并保存到本地

python爬取MM图片

《爬取知网文献信息》中代码的一些优化