python_读取 doc,docx,pdf
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python_读取 doc,docx,pdf相关的知识,希望对你有一定的参考价值。
#!/usr/bin/env python # -*- coding: utf-8 -*- import docx from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO from win32com import client import sys reload(sys) sys.setdefaultencoding(‘gb2312‘) def readDocx(docxPath): fullText = [] doc = docx.Document(docxPath) paras = doc.paragraphs for p in paras: fullText.append(p.text.strip()) return ‘\n‘.join(fullText) def readPdf(pdfPath): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = ‘utf-8‘ laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = file(pdfPath, ‘rb‘) interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) fp.close() device.close() str = retstr.getvalue() retstr.close() return str def readDoc(docPath): fullText = [] word = client.Dispatch(‘Word.Application‘) # 打开一个已存在的文件 doc = word.Documents.Open(docPath) #print doc.Content #print text doc.SaveAs(‘c:/temp.txt‘, 2) # 关闭 doc.Close() word.Quit() f=open(r‘c:/temp.txt‘,‘r‘) for line in f.readlines(): #f len(line)!=line.count(‘\n‘): fullText.append(line.decode(‘gbk‘).strip()) f.close() return ‘\n‘.join(fullText) if __name__ == ‘__main__‘: #docxValue=readDocx(‘d:/1.docx‘) #print docxValue #pdfValue = readPdf(‘d:/3.pdf‘) #print pdfValue docValue = readDoc(‘d:/2.doc‘) print docValue
以上是关于python_读取 doc,docx,pdf的主要内容,如果未能解决你的问题,请参考以下文章
python模块将doc/pdf/docx/rtf格式转换为文本[重复]
如何在android中读取.doc、.docx、.xls文件[重复]
Python:读取 .doc.docx 两种 Word 文件简述及“Word 未能引发事件”错误
用java读取多种文件格式的文件(pdf,pptx,ppt,doc,docx..)