在.Net中阅读PDF文档[关闭]
Posted
技术标签:
【中文标题】在.Net中阅读PDF文档[关闭]【英文标题】:Reading PDF documents in .Net [closed] 【发布时间】:2010-09-10 03:14:43 【问题描述】:是否有一个开源库可以帮助我在 .NET/C# 中阅读/解析 PDF 文档?
【问题讨论】:
Brock Nusser 提供的答案看起来是最新的解决方案,应该被认为是这个问题的正确答案 更多更新的 iTextSharp 回答 here,因为这个问题已经结束。 【参考方案1】:看看Docotic.Pdf library。它不需要您打开应用程序的源代码(例如具有病毒 AGPL 3 许可证的 iTextSharp)。
Docotic.Pdf 可用于读取 PDF 文件并提取带或不带格式的文本。请查看显示how to extract text from PDFs的文章。
免责声明:我为图书馆供应商 Bit Miracle 工作。
【讨论】:
只有 30 天免费。不是一个好选择...【参考方案2】:PDFClown 可能会有所帮助,但我不建议将其用于大型或大量使用的应用程序。
【讨论】:
获得许可的 LGPL,因此可用于创建商业专有软件。【参考方案3】:iTextSharp 是最好的选择。用它为 lucene.Net 制作了一个蜘蛛,以便它可以抓取 PDF。
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
namespace Spider.Utils
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
StreamWriter outFile = null;
try
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
for (int i = 0; i < (int)charUnit; i++)
Console.Write("#");
totalWritten++;
else
curUnit += charUnit;
if (curUnit >= 1.0f)
for (int i = 0; i < (int)curUnit; i++)
Console.Write("#");
totalWritten++;
curUnit = 0;
if (totalWritten < totalLen)
for (int i = 0; i < (totalLen - totalWritten); i++)
Console.Write("#");
return true;
catch
return false;
finally
if (outFile != null) outFile.Close();
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
if (input == null || input.Length == 0) return "";
try
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];
if (inTextObject)
// Position the text
if (bracketDepth == 0)
if (CheckToken(new string[] "TD", "Td" , previousCharacters))
resultString += "\n\r";
else
if (CheckToken(new string[] "'", "T*", "\"" , previousCharacters))
resultString += "\n";
else
if (CheckToken(new string[] "Tj" , previousCharacters))
resultString += " ";
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] "ET" , previousCharacters))
inTextObject = false;
resultString += " ";
else
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
bracketDepth = 1;
else
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
bracketDepth = 0;
else
// Just a normal text character:
if (bracketDepth == 1)
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
resultString += c.ToString();
nextLiteral = true;
else
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
resultString += c.ToString();
nextLiteral = false;
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
previousCharacters[j] = previousCharacters[j + 1];
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] "BT" , previousCharacters))
inTextObject = true;
return CleanupContent(resultString);
catch
return "";
private string CleanupContent(string text)
string[] patterns = @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221";
string[] replace = "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" ;
for (int i = 0; i < patterns.Length; i++)
string regExPattern = patterns[i];
Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
text = regex.Replace(text, replace[i]);
return text;
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="tokens">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
foreach (string token in tokens)
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
return true;
return false;
#endregion
【讨论】:
hello ceetheman,我尝试使用您在上面提供的代码...但遇到了一个问题。我的一些 pdf 文件被正确读取,但在一些 pdf 文件中,我在函数“CheckToken”中收到错误“Index Out of Range”。你能帮我解决这个问题吗? 引用你的例子的来源是一个很好的礼貌的想法。在这种情况下,可以在这里找到相同的源代码codeproject.com/KB/cs/PDFToText.aspx 这段代码有问题,它返回由字母 r 和 n 组成的 gobledegook。我最后使用了 PDFBox。 太奇怪了...我插入了我的 pdf 文件,我的文本文件中有 1627 个空行... Brock Nusser 提供的答案看起来是最新的解决方案,应该被视为该问题的正确答案。【参考方案4】:自从上一次回答这个问题是在 2008 年以来,iTextSharp 已经显着改进了他们的 api。如果你从http://sourceforge.net/projects/itextsharp/下载他们最新版本的api,你可以使用下面的sn-p代码将pdf中的所有文本提取成字符串。
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PdfParser
public static class PdfTextExtractor
public static string pdfText(string path)
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
text += PdfTextExtractor.GetTextFromPage(reader,page);
reader.Close();
return text;
【讨论】:
你可能不应该给你的班级打电话PdfTextExtractor
,因为它会与iTextSharp.text.pdf.parser
中的班级发生冲突
iTextSharp 已移至 GitHub:github.com/itext/itextsharp
也许在这里回答的人可以帮助here?
现在为商业项目付费。
@iTextSharp 已被弃用并替换为 iText 7 github.com/itext/itext7-dotnet。【参考方案5】:
itext?
http://www.itextpdf.com/terms-of-use/index.php
指南
http://www.vogella.com/articles/JavaPDF/article.html
【讨论】:
【参考方案6】:public string ReadPdfFile(object Filename, DataTable ReadLibray)
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
return strText;
【讨论】:
唯一对我有用的方法!谢谢老兄! PDF阅读器?请添加一些信息。 @DT 见iTextSharp【参考方案7】:aspose pdf 工作得很好。话又说回来,你得付钱
【讨论】:
【参考方案8】:http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx 是开源的,对您来说可能是一个很好的起点。
【讨论】:
【参考方案9】:iText 是我所知道的最好的库。最初是用 Java 编写的,还有一个 .NET 端口。
见http://www.ujihara.jp/iTextdotNET/en/
【讨论】:
那不是官方端口,反正链接坏了。 iText 的官方 .NET 端口 iTextSharp 可以在 GitHub 上找到:github.com/itext/itextsharp【参考方案10】:还有LibHaru
http://libharu.org/wiki/Main_Page
【讨论】:
链接已损坏。 libharu.org 另外:“目前 libHaru 不支持阅读和编辑现有的 PDF 文件,而且这种支持不太可能出现。”这真的相关吗?【参考方案11】:你可以看看这个: http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx 它不是完全免费的,但看起来很不错。
亚历克斯
【讨论】:
这有助于将 PDF 转换为原始文本吗?似乎该工具将其转换为图像。所以我需要一个 OCR 库 :-)以上是关于在.Net中阅读PDF文档[关闭]的主要内容,如果未能解决你的问题,请参考以下文章