将PDF转换为Excel / csv / xlsx
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了将PDF转换为Excel / csv / xlsx相关的知识,希望对你有一定的参考价值。
My intention将pdf字符串转换为excel / csv文件,如下所示:
PDF档案:(源文件)
#_________________________________________________________________________
appliance
n. 1. See server appliance. 2. See information appliance. 3. A device with a single or limited ......
appliance server
n. 1. An inexpensive computing .....2. See server appliance.
application
n. A program designed ......
#________________________________________________________________________
Excel File : (Target File)
#________________________________________________________________________
appliance , n. , 1. See server appliance ,
appliance server , n. , 1. An inexpensive co ,
application , n. , A program designed ...... ,
_#_______________________________________________________________________
我已将pdf转换为文本并尝试用“,”拆分,然后将文本文件转换为csv文件。但是在将pdf转换为文本文件后我陷入了困境。
import os
from os import chdir, getcwd, listdir, path
import PyPDF2
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print ("
The specified path does not exist.
")
abs_path = raw_input(prompt)
return abs_path
print ("
")
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
i=0
while i<=len(list):
path=list[i]
head,tail=os.path.split(path)
var="\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(filename(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "
"
print (strftime("%H:%M:%S"), " pdf -> txt ")
f=open(name,'w')
f.write(content.encode("UTF-8"))
f.close
答案
可能值得首先将PDF转换为CSV,然后将CSV操作到您想要的布局。
此API可与Python一起使用,将一个或多个PDF转换为CSV:https://pdftables.com/pdf-to-excel-api。
要转换单个PDF:
import pdftables_api
c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output.csv')
或转换多个PDF:
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\Users\MyName\Documents\PDFTablesCode\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.csv')
以上是关于将PDF转换为Excel / csv / xlsx的主要内容,如果未能解决你的问题,请参考以下文章