如何从浏览器 Selenium C# 中读取 pdf 内容

Posted 2023-03-07

技术标签:

【中文标题】如何从浏览器 Selenium C# 中读取 pdf 内容【英文标题】：how to read the pdf contents from browser Selenium C# 【发布时间】：2020-06-25 08:22:23 【问题描述】：

我正在自动化以下场景：

点击一个按钮，在新窗口中打开一个 pdf 文件阅读在新窗口中打开的pdf文件的内容。

需要帮助：切换到打开 pdf 的窗口后，我不知道如何继续。

注意：此文件无法下载。

尝试了以下方法

public void verifypdf()

    var browerTabs = driver.WindowHandles;
    Assert.True(browerTabs.Count > 0, "Form not open in new Window");
    driver.SwitchTo().Window(browerTabs[1]);
    string PdfUrl = driver.Url;
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
    string test = readPDFContent(PdfUrl);
    driver.Close();
    driver.SwitchTo().Window(browerTabs[0]);


public String readPDFContent(String pdfUrl)

    Uri uri = new Uri(pdfUrl);

    //how to proceed from here??

【问题讨论】：

请澄清-“阅读pdf文件的内容”是什么意思？提取文本，获取屏幕截图？如果您需要解析某些内容，那么您必须下载文件。我正在阅读正文。而且我不认为必须像在 JAVA 中那样下载文件，无需下载文件即可读取此内容，因此在 C# 中也应该可以，如果我错了，请纠正我可能您在 Java 中使用了一些隐式下载 PDF 内容的 PDF 库。无论如何您都需要下载文件内容，编程语言在这里无关紧要。例如，它在 Java 中的外观：***.com/questions/40738373/… 我已将 C# 代码示例提交给答案。 【参考方案1】：

您需要获取完整的 PDF 内容，然后使用一些 PDF 库来提取文本。

这是使用标准 .NET WebClient 类和Docotic.Pdf library 的示例代码：

using (var client = new WebClient())

    byte[] data = client.DownloadData(pdfUrl);
    using (var pdf = new PdfDocument(data))
    
        string text = pdf.GetText();
        ... // do something with extracted text

【讨论】：

【参考方案2】：

//download file to local machine:
//default downliad path in win: Environment.GetEnvironmentVariable("USERPROFILE") + @"\Downloads\";

    using System.Net;
    using System.IO;
    public static class FileDownloader
        
            public static string Download(this string downloadFrom,string fileName = "newFile.pdf", string downloadTo = "")
            
                if (downloadTo.Length == 0)  downloadTo = BrowserConfig.DownloadsPath; 
                using var client = new WebClient();
                // Download data.
                var arr = client.DownloadData(downloadFrom);
                File.WriteAllBytes(downloadTo+ "\\" + fileName, arr);
                return downloadTo+fileName;
            
        

//read

    using System;
    using System.IO;
    using System.Net;
    using System.Text;
    using iTextSharp.text.pdf;
    using iTextSharp.text.pdf.parser;
    
            public static string ReadPdfFile(this string fileName)
            
                StringBuilder text = new StringBuilder();
                Byte[] bytes;
    
                var wc = WebRequest.Create(fileName);
                using (var response = wc.GetResponse())
                
                    using (var responseStream = response.GetResponseStream())
                    
                        bytes = iTextSharp.text.io.StreamUtil.InputStreamToArray(responseStream);
                        PdfReader pdfReader = new PdfReader(bytes);
                        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                        
                            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    
                            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    
                            currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8,
                                Encoding.Default.GetBytes(currentText)));
                            text.Append(currentText);
                            //Console.WriteLine(text);
                        
                        pdfReader.Close();
                    
                
                return text.ToString();

【讨论】：

请解释你的代码是做什么的以及它是怎么做的。

以上是关于如何从浏览器 Selenium C# 中读取 pdf 内容的主要内容，如果未能解决你的问题，请参考以下文章