python_读取 doc,docx,pdf

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python_读取 doc,docx,pdf相关的知识,希望对你有一定的参考价值。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import docx

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

from win32com import client
import sys
reload(sys)
sys.setdefaultencoding(gb2312)

def readDocx(docxPath):
    fullText = []
    doc = docx.Document(docxPath)
    paras = doc.paragraphs
    for p in paras:
        fullText.append(p.text.strip())
    return \n.join(fullText)
def readPdf(pdfPath):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = utf-8
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(pdfPath, rb)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
def readDoc(docPath):
    fullText = []
    word = client.Dispatch(Word.Application)    
    # 打开一个已存在的文件
    doc = word.Documents.Open(docPath)
    #print doc.Content
    #print text
    doc.SaveAs(c:/temp.txt, 2)
    # 关闭
    doc.Close()
    word.Quit()
    f=open(rc:/temp.txt,r)  
    for line in f.readlines(): 
        #f len(line)!=line.count(‘\n‘):
        fullText.append(line.decode(gbk).strip())
    f.close()
    return \n.join(fullText)
if __name__ == __main__:
    #docxValue=readDocx(‘d:/1.docx‘)
    #print docxValue
    #pdfValue = readPdf(‘d:/3.pdf‘)
    #print pdfValue
    docValue = readDoc(d:/2.doc)
    print docValue

 

以上是关于python_读取 doc,docx,pdf的主要内容,如果未能解决你的问题,请参考以下文章

python模块将doc/pdf/docx/rtf格式转换为文本[重复]

如何在android中读取.doc、.docx、.xls文件[重复]

Python:读取 .doc.docx 两种 Word 文件简述及“Word 未能引发事件”错误

用java读取多种文件格式的文件(pdf,pptx,ppt,doc,docx..)

如何根据文件头识别doc、docx、pdf、xls和xlsx

我有这个用于上传文件的代码,我想只允许 PDF、DOC、DOCX