爬虫3：pdf页面+pdfminer模块+demo

Posted 2020-07-09 rongyux

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫3：pdf页面+pdfminer模块+demo相关的知识，希望对你有一定的参考价值。

　　本文介绍下pdf页面的爬取，需要借助pdfminer模块

　　demo一般流程：

　　1）设置url

url = ‘http://www.------‘ + ‘.PDF‘

　　2)requests模块获取url

import requests
r = requests.get(inner_url)

　　3）写入.pdf文件

myFile = open("PDF/" +  i[u‘associateAnnouncement‘] + ‘.pdf‘, "wb")
myFile.write( r.content )
myFile.close()

　　4)引入pdfminer模块

import pdfminer

　　5）BeautifulSoup解析html

from bs4 import BeautifulSoup

html = open(‘PDF/1202268749.html‘).read()

未完待续，先睡觉，pdfminer把pdf页面解析成html页面，然后beautifulsoap解析html页面即可。

以上是关于爬虫3：pdf页面+pdfminer模块+demo的主要内容，如果未能解决你的问题，请参考以下文章