在.Net中阅读PDF文档[关闭]

Posted 2023-02-24

技术标签:

【中文标题】在.Net中阅读PDF文档[关闭]【英文标题】：Reading PDF documents in .Net [closed] 【发布时间】：2010-09-10 03:14:43 【问题描述】：

是否有一个开源库可以帮助我在 .NET/C# 中阅读/解析 PDF 文档？

【问题讨论】：

Brock Nusser 提供的答案看起来是最新的解决方案，应该被认为是这个问题的正确答案更多更新的 iTextSharp 回答 here，因为这个问题已经结束。 【参考方案1】：

看看Docotic.Pdf library。它不需要您打开应用程序的源代码（例如具有病毒 AGPL 3 许可证的 iTextSharp）。

Docotic.Pdf 可用于读取 PDF 文件并提取带或不带格式的文本。请查看显示how to extract text from PDFs的文章。

免责声明：我为图书馆供应商 Bit Miracle 工作。

【讨论】：

只有 30 天免费。不是一个好选择...【参考方案2】：

PDFClown 可能会有所帮助，但我不建议将其用于大型或大量使用的应用程序。

【讨论】：

获得许可的 LGPL，因此可用于创建商业专有软件。【参考方案3】：

iTextSharp 是最好的选择。用它为 lucene.Net 制作了一个蜘蛛，以便它可以抓取 PDF。

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace Spider.Utils

    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public class PDFParser
    
        /// BT = Beginning of a text object operator 
        /// ET = End of a text object operator
        /// Td move to the start of next line
        ///  5 Ts = superscript
        /// -5 Ts = subscript

        #region Fields

        #region _numberOfCharsToKeep
        /// <summary>
        /// The number of characters to keep, when extracting text.
        /// </summary>
        private static int _numberOfCharsToKeep = 15;
        #endregion

        #endregion

        #region ExtractText
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="inFileName">the full path to the pdf file.</param>
        /// <param name="outFileName">the output file name.</param>
        /// <returns>the extracted text</returns>
        public bool ExtractText(string inFileName, string outFileName)
        
            StreamWriter outFile = null;
            try
            
                // Create a reader for the given PDF file
                PdfReader reader = new PdfReader(inFileName);
                //outFile = File.CreateText(outFileName);
                outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

                Console.Write("Processing: ");

                int totalLen = 68;
                float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
                int totalWritten = 0;
                float curUnit = 0;

                for (int page = 1; page <= reader.NumberOfPages; page++)
                
                    outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

                    // Write the progress.
                    if (charUnit >= 1.0f)
                    
                        for (int i = 0; i < (int)charUnit; i++)
                        
                            Console.Write("#");
                            totalWritten++;
                        
                    
                    else
                    
                        curUnit += charUnit;
                        if (curUnit >= 1.0f)
                        
                            for (int i = 0; i < (int)curUnit; i++)
                            
                                Console.Write("#");
                                totalWritten++;
                            
                            curUnit = 0;
                        

                    
                

                if (totalWritten < totalLen)
                
                    for (int i = 0; i < (totalLen - totalWritten); i++)
                    
                        Console.Write("#");
                    
                
                return true;
            
            catch
            
                return false;
            
            finally
            
                if (outFile != null) outFile.Close();
            
        
        #endregion

        #region ExtractTextFromPDFBytes
        /// <summary>
        /// This method processes an uncompressed Adobe (text) object 
        /// and extracts text.
        /// </summary>
        /// <param name="input">uncompressed</param>
        /// <returns></returns>
        public string ExtractTextFromPDFBytes(byte[] input)
        
            if (input == null || input.Length == 0) return "";

            try
            
                string resultString = "";

                // Flag showing if we are we currently inside a text object
                bool inTextObject = false;

                // Flag showing if the next character is literal 
                // e.g. '\\' to get a '\' character or '\(' to get '('
                bool nextLiteral = false;

                // () Bracket nesting level. Text appears inside ()
                int bracketDepth = 0;

                // Keep previous chars to get extract numbers etc.:
                char[] previousCharacters = new char[_numberOfCharsToKeep];
                for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


                for (int i = 0; i < input.Length; i++)
                
                    char c = (char)input[i];
                    if (input[i] == 213)
                        c = "'".ToCharArray()[0];

                    if (inTextObject)
                    
                        // Position the text
                        if (bracketDepth == 0)
                        
                            if (CheckToken(new string[]  "TD", "Td" , previousCharacters))
                            
                                resultString += "\n\r";
                            
                            else
                            
                                if (CheckToken(new string[]  "'", "T*", "\"" , previousCharacters))
                                
                                    resultString += "\n";
                                
                                else
                                
                                    if (CheckToken(new string[]  "Tj" , previousCharacters))
                                    
                                        resultString += " ";
                                    
                                
                            
                        

                        // End of a text object, also go to a new line.
                        if (bracketDepth == 0 &&
                            CheckToken(new string[]  "ET" , previousCharacters))
                        

                            inTextObject = false;
                            resultString += " ";
                        
                        else
                        
                            // Start outputting text
                            if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            
                                bracketDepth = 1;
                            
                            else
                            
                                // Stop outputting text
                                if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                
                                    bracketDepth = 0;
                                
                                else
                                
                                    // Just a normal text character:
                                    if (bracketDepth == 1)
                                    
                                        // Only print out next character no matter what. 
                                        // Do not interpret.
                                        if (c == '\\' && !nextLiteral)
                                        
                                            resultString += c.ToString();
                                            nextLiteral = true;
                                        
                                        else
                                        
                                            if (((c >= ' ') && (c <= '~')) ||
                                                ((c >= 128) && (c < 255)))
                                            
                                                resultString += c.ToString();
                                            

                                            nextLiteral = false;
                                        
                                    
                                
                            
                        
                    

                    // Store the recent characters for 
                    // when we have to go back for a checking
                    for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    
                        previousCharacters[j] = previousCharacters[j + 1];
                    
                    previousCharacters[_numberOfCharsToKeep - 1] = c;

                    // Start of a text object
                    if (!inTextObject && CheckToken(new string[]  "BT" , previousCharacters))
                    
                        inTextObject = true;
                    
                

                return CleanupContent(resultString);
            
            catch
            
                return "";
            
        

        private string CleanupContent(string text)
        
            string[] patterns =  @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221";
            string[] replace =    "(",     ")",      "-",     "'",      "\"",      "\"",    "à",      "â",      "ä",      "À",      "Â",      "Ä",      "é",      "è",      "ê",      "ë",      "É",      "È",      "Ê",      "Ë",      "ò",      "ô",      "ö",      "Ò",      "Ô",      "Ö",      "ì",      "î",      "ï",      "Ì",      "Î",      "Ï",      "ç",      "Ç",      "ù",      "û",      "ü",      "Ù",      "Û",      "Ü",      "®",      "™",      "«",      "»",      "©",      "'" ;

            for (int i = 0; i < patterns.Length; i++)
            
                string regExPattern = patterns[i];
                Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
                text = regex.Replace(text, replace[i]);
            

            return text;
        

        #endregion

        #region CheckToken
        /// <summary>
        /// Check if a certain 2 character token just came along (e.g. BT)
        /// </summary>
        /// <param name="tokens">the searched token</param>
        /// <param name="recent">the recent character array</param>
        /// <returns></returns>
        private bool CheckToken(string[] tokens, char[] recent)
        
            foreach (string token in tokens)
            
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    )
                
                    return true;
                
            
            return false;
        
        #endregion

【讨论】：

hello ceetheman，我尝试使用您在上面提供的代码...但遇到了一个问题。我的一些 pdf 文件被正确读取，但在一些 pdf 文件中，我在函数“CheckToken”中收到错误“Index Out of Range”。你能帮我解决这个问题吗？引用你的例子的来源是一个很好的礼貌的想法。在这种情况下，可以在这里找到相同的源代码codeproject.com/KB/cs/PDFToText.aspx 这段代码有问题，它返回由字母 r 和 n 组成的 gobledegook。我最后使用了 PDFBox。太奇怪了...我插入了我的 pdf 文件，我的文本文件中有 1627 个空行... Brock Nusser 提供的答案看起来是最新的解决方案，应该被视为该问题的正确答案。【参考方案4】：

自从上一次回答这个问题是在 2008 年以来，iTextSharp 已经显着改进了他们的 api。如果你从http://sourceforge.net/projects/itextsharp/下载他们最新版本的api，你可以使用下面的sn-p代码将pdf中的所有文本提取成字符串。

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PdfParser

    public static class PdfTextExtractor
    
        public static string pdfText(string path)
        
            PdfReader reader = new PdfReader(path);
            string text = string.Empty;
            for(int page = 1; page <= reader.NumberOfPages; page++)
            
                text += PdfTextExtractor.GetTextFromPage(reader,page);
            
            reader.Close();
            return text;

【讨论】：

你可能不应该给你的班级打电话PdfTextExtractor，因为它会与iTextSharp.text.pdf.parser中的班级发生冲突 iTextSharp 已移至 GitHub：github.com/itext/itextsharp 也许在这里回答的人可以帮助here？现在为商业项目付费。 @iTextSharp 已被弃用并替换为 iText 7 github.com/itext/itext7-dotnet。【参考方案5】：

itext?

http://www.itextpdf.com/terms-of-use/index.php

指南

http://www.vogella.com/articles/JavaPDF/article.html

【讨论】：

【参考方案6】：

public string ReadPdfFile(object Filename, DataTable ReadLibray)

    PdfReader reader2 = new PdfReader((string)Filename);
    string strText = string.Empty;

    for (int page = 1; page <= reader2.NumberOfPages; page++)
    
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
    PdfReader reader = new PdfReader((string)Filename);
    String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
    strText = strText + s;
    reader.Close();
    
    return strText;

【讨论】：

唯一对我有用的方法！谢谢老兄！ PDF阅读器？请添加一些信息。 @DT 见iTextSharp【参考方案7】：

aspose pdf 工作得很好。话又说回来，你得付钱

【讨论】：

【参考方案8】：

http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx 是开源的，对您来说可能是一个很好的起点。

【讨论】：

【参考方案9】：

iText 是我所知道的最好的库。最初是用 Java 编写的，还有一个 .NET 端口。

见http://www.ujihara.jp/iTextdotNET/en/

【讨论】：

那不是官方端口，反正链接坏了。 iText 的官方 .NET 端口 iTextSharp 可以在 GitHub 上找到：github.com/itext/itextsharp【参考方案10】：

还有LibHaru

http://libharu.org/wiki/Main_Page

【讨论】：

链接已损坏。 libharu.org 另外：“目前 libHaru 不支持阅读和编辑现有的 PDF 文件，而且这种支持不太可能出现。”这真的相关吗？【参考方案11】：

你可以看看这个： http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx 它不是完全免费的，但看起来很不错。

亚历克斯

【讨论】：

这有助于将 PDF 转换为原始文本吗？似乎该工具将其转换为图像。所以我需要一个 OCR 库 :-)

以上是关于在.Net中阅读PDF文档[关闭]的主要内容，如果未能解决你的问题，请参考以下文章