JAVA读取PDF文件
Posted tropica
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了JAVA读取PDF文件相关的知识,希望对你有一定的参考价值。
在java中要读取pdf文件内容,我们可以借助第三方软件实现。常用的是xpdf,本文就简单介绍在linux下如何安装xpdf,及在java中如何利用xpdf读取pdf文件内容。
一.安装xpdf
一.安装xpdf
在fc系列下,不用安装,可以直接yum,但是笔者建议还是下载安装的好,因为笔者曾经碰到过这样的问题,客户服务器上的xpdf是yum安装的,有一些特殊的pdf文件就无法预览,但是将yum安装的xpdf卸载,然后下载xpdf安装程序,再重新安装后,就可以了。
1.下载
ok,我们需要下载的xpdf安装包主要有三个:
(1)进入下载目录,将主程序解压至/usr,也可以是其他地方,根据个人情况而定。
#
tar zvfx xpdf-
3
.
01pl2-linux
.
tar
.
gz -C
/
usr
#
cd
usr
然后将其重命名,这样看起来简单点
mv xpdf-
3
.
01pl2-linux
/
xpdf
(2)建立中文支持。回到下载目录,依次执行:
#
tar zvfx xpdf-chinese-simplified
.
tar
.
gz -C
/
usr
/
xpdf
#
mv
/
usr
/
xpdf
/
xpdf-chinese-simplified
/
usr
/
xpdf
/
chinese-simplified
#
tar zvfx xpdf-chinese-traditional
.
tar
.
gz -C
/
usr
/
xpdf
#
mv
/
usr
/
xpdf
/
xpdf-chinese-traditional
/
usr
/
xpdf
/
chinese-traditional
(3)配置环境
#
vi
/
etc
/
bashrc
增加如下内容
export PATH=/usr/xpdf/:$PATH
确保重启机器后,在控制台输入xpdf不会提示找不到命令或文件即可。
(4)资源配置
#
cd
/
usr
/
xpdf
#
cp sample-xpdfrc xpdfrc
#
vi xpdfrc
*在文件开始处增加如下内容(将/usr/xpdf替换为xpdf的实际路径)*
#
----- begin Chinese Simplified support package
(
2004
-jul-
27
)
cidToUnicode Adobe-GB1
"
/usr/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode
"
unicodeMap ISO-
2022
-CN
"
/usr/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap
"
unicodeMap EUC-CN
"
/usr/xpdf/chinese-simplified/EUC-CN.unicodeMap
"
unicodeMap GBK
"
/usr/xpdf/chinese-simplified/GBK.unicodeMap
"
cMapDir Adobe-GB1
"
/usr/xpdf/chinese-simplified/CMap
"
toUnicodeDir
"
/usr/xpdf/chinese-simplified/CMap
"
#
displayCIDFontTT Adobe-GB1
/
usr
/..../
gkai00mp
.
ttf
#
-----
end
Chinese Simplified support package
#
----- begin Chinese Traditional support package
(
2004
-jul-
27
)
cidToUnicode Adobe-CNS1
"
/usr/xpdf/chinese-traditional/Adobe-CNS1.cidToUnicode
"
unicodeMap Big5
"
/usr/xpdf/chinese-traditional/Big5.unicodeMap
"
unicodeMap Big5ascii
"
/usr/xpdf/chinese-traditional/Big5ascii.unicodeMap
"
cMapDir Adobe-CNS1
"
/usr/xpdf/chinese-traditional/CMap
"
toUnicodeDir
"
/usr/xpdf/chinese-traditional/CMap
"
#
displayCIDFontTT Adobe-CNS1
/
usr
/..../
bkai00mp
.
ttf
#
-----
end
Chinese Traditional support package
然后再执行:
#
cp xpdfrc
/
usr
/
local
/
etc
/
好了,到这里我们也就安装完成了。下面介绍如何利用xpdf读取pdf文件的内容
二.利用xpdf读取pdf文件的内容
1.下载
ok,我们需要下载的xpdf安装包主要有三个:
主程序:
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.01pl2-linux.tar.gz
简体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
繁体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-traditional.tar.gz
2.安装部署
简体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
繁体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-traditional.tar.gz
(1)进入下载目录,将主程序解压至/usr,也可以是其他地方,根据个人情况而定。
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
export PATH=/usr/xpdf/:$PATH
确保重启机器后,在控制台输入xpdf不会提示找不到命令或文件即可。
(4)资源配置
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
![](https://image.cha138.com/20210517/13b20df69e18461b8de8c73267e44752.jpg)
方法很简单,利用著名的Runtime.getRuntime()即可,如下:
/** */
/**
* @param filePath pdf文件路径
* @return
*/
![](https://image.cha138.com/20210517/4700126911164384a2cab070f4d9e678.jpg)
public
String getPdfContent(String filePath)
...
{
String excute="pdftotext";
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
String[] cmd=new String[]...{excute, "-enc", "UTF-8", "-q", filePath,"-"};
Process p=null;
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
try ...{
p=Runtime.getRuntime().exec(cmd);
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
} catch (IOException e) ...{
e.printStackTrace();
}
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
BufferedInputStream bis=new BufferedInputStream(p.getInputStream());
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
InputStreamReader reader=null;
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
try ...{
reader=new InputStreamReader(bis,"UTF-8");
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
} catch (UnsupportedEncodingException e1) ...{
e1.printStackTrace();
}
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
StringBuffer sb=new StringBuffer();
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
try ...{
BufferedReader br = new BufferedReader(reader);
String line = br.readLine();
sb = new StringBuffer();
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
while (line != null) ...{
sb.append(line);
sb.append(" ");
line = br.readLine();
}
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
} catch (Exception e) ...{
e.printStackTrace();
}
return sb.toString();
}
![](https://image.cha138.com/20210517/4700126911164384a2cab070f4d9e678.jpg)
![](https://image.cha138.com/20210517/a0dfa1814e2343b5b950e3b9b83db606.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/a7d9d6d1fb6b4810b8e345718fa7e9a5.jpg)
![](https://image.cha138.com/20210517/4700126911164384a2cab070f4d9e678.jpg)
![](https://image.cha138.com/20210517/a0dfa1814e2343b5b950e3b9b83db606.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/3c432c87718145f5861a7344cbeab5b2.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/3c432c87718145f5861a7344cbeab5b2.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/3c432c87718145f5861a7344cbeab5b2.jpg)
![](https://image.cha138.com/20210517/e4502281786d406989483dda934976c2.jpg)
![](https://image.cha138.com/20210517/f5c2b0c2931b422f89500180c49aa4ae.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/3c432c87718145f5861a7344cbeab5b2.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/5c5f8fabac2a45d1b9588c906c51f35d.jpg)
![](https://image.cha138.com/20210517/a7d9d6d1fb6b4810b8e345718fa7e9a5.jpg)
以上是关于JAVA读取PDF文件的主要内容,如果未能解决你的问题,请参考以下文章
java操作office和pdf文件java读取word,excel和pdf文档内容