C#iTextPdf以正确的格式读取PDF格式的阿拉伯语文本

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了C#iTextPdf以正确的格式读取PDF格式的阿拉伯语文本相关的知识,希望对你有一定的参考价值。

我正在开发一个应用程序,将PDF格式的阿拉伯语文本提取为字符串变量,每个单词的顺序相反(وسيم而不是ميسو),有时排序正确,但字符分隔符(ميسو)相似的英文字符,但在阿拉伯语中,字符连接在一起。任何解决方案?:我在Windows 10本地使用visual studio 2017 C#MVC Application,使用iTextSharp从PDF中读取文本。

在PDF查看器中看起来没问题,Text in PDF file但是在运行以下代码时:

private string GetTextFromPDF(string Path)
{
    StringBuilder text = new StringBuilder();
    using (PdfReader reader = new PdfReader(Path))
    {
        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
        }
    }

    return text.ToString();
}

the result as shown here each keyword written are in reverse order,

if used for the PDF with English Content it will show correct

注意:问题不仅是逆序,如果我手动反转字符顺序(反向数组顺序),它将显示分隔文本

答案

你需要MirrorGlyphs,这就是大多数PDF编写者创建阿拉伯语PDF的方法。我已将bidi.js移植到C#来解决此问题:

public class BidiResult
{
    public string Text { set; get; }
    public bool IsRtl { set; get; }

    public BidiResult(string text, bool isRtl)
    {
        this.Text = text;
        this.IsRtl = isRtl;
    }
}


/// <summary>
/// Ported from https://github.com/mozilla/pdf.js/blob/master/src/core/bidi.js
/// </summary>
public static class Bidi
{
    /// <summary>
    /// Character types for symbols from 0000 to 00FF.
    /// </summary>
    public static string[] BaseTypes = new[] {
                                "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "S", "B", "S", "WS",
                                "B", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN",
                                "BN", "BN", "B", "B", "B", "S", "WS", "ON", "ON", "ET", "ET", "ET", "ON",
                                "ON", "ON", "ON", "ON", "ON", "CS", "ON", "CS", "ON", "EN", "EN", "EN",
                                "EN", "EN", "EN", "EN", "EN", "EN", "EN", "ON", "ON", "ON", "ON", "ON",
                                "ON", "ON", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "ON", "ON",
                                "ON", "ON", "ON", "ON", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "ON", "ON", "ON", "ON", "BN", "BN", "BN", "BN", "BN", "BN", "B", "BN",
                                "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN",
                                "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN", "BN",
                                "BN", "CS", "ON", "ET", "ET", "ET", "ET", "ON", "ON", "ON", "ON", "L", "ON",
                                "ON", "ON", "ON", "ON", "ET", "ET", "EN", "EN", "ON", "L", "ON", "ON", "ON",
                                "EN", "L", "ON", "ON", "ON", "ON", "ON", "L", "L", "L", "L", "L", "L", "L",
                                "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "ON", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L",
                                "L", "L", "L", "ON", "L", "L", "L", "L", "L", "L", "L", "L"
                            };

    /// <summary>
    /// Character types for symbols from 0600 to 06FF
    /// </summary>
    public static string[] ArabicTypes = new[] {
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "CS", "AL", "ON", "ON", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM",
                                "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AN", "AN", "AN", "AN", "AN", "AN", "AN", "AN", "AN",
                                "AN", "ET", "AN", "AN", "AL", "AL", "AL", "NSM", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM",
                                "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "NSM", "ON", "NSM",
                                "NSM", "NSM", "NSM", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",
                                "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"
                            };

    public static bool IsOdd(int i)
    {
        return (i & 1) != 0;
    }

    public static bool IsEven(int i)
    {
        return (i & 1) == 0;
    }

    public static int FindUnequal(string[] arr, int start, string value)
    {
        int j;
        var jj = arr.Length;
        for (j = start; j < jj; ++j)
        {
            if (arr[j] != value)
                return j;
        }
        return j;
    }

    public static void SetValues(string[] arr, int start, int end, string value)
    {
        for (var j = start; j < end; ++j)
        {
            arr[j] = value;
        }
    }

    public static char[] ReverseValues(char[] arr, int start, int end)
    {
        var j = end - 1;
        for (var i = start; i < j; ++i, --j)
        {
            var temp = arr[i];
            arr[i] = arr[j];
            arr[j] = temp;
        }
        return arr;
    }

    public static char MirrorGlyphs(char c)
    {
        /*
         # BidiMirroring-1.txt
         0028; 0029 # LEFT PARENTHESIS
         0029; 0028 # RIGHT PARENTHESIS
         003C; 003E # LESS-THAN SIGN
         003E; 003C # GREATER-THAN SIGN
         005B; 005D # LEFT SQUARE BRACKET
         005D; 005B # RIGHT SQUARE BRACKET
         007B; 007D # LEFT CURLY BRACKET
         007D; 007B # RIGHT CURLY BRACKET
         00AB; 00BB # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
         00BB; 00AB # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
         */
        switch (c)
        {
            case '(':
                return ')';
            case ')':
                return '(';
            case '<':
                return '>';
            case '>':
                return '<';
            case ']':
                return '[';
            case '[':
                return ']';
            case '}':
                return '{';
            case '{':
                return '}';
            case 'u00AB':
                return 'u00BB';
            case 'u00BB':
                return 'u00AB';
            default:
                return c;
        }
    }

    public static BidiResult BidiText(string str, int startLevel)
    {
        var isLtr = true;
        var strLength = str.Length;
        if (strLength == 0)
            return new BidiResult(str, false);

        // get types, fill arrays

        var chars = new char[strLength];
        var types = new string[strLength];
        var oldtypes = new string[strLength];
        var numBidi = 0;

        for (var i = 0; i < strLength; ++i)
        {
            chars[i] = str[i];

            var charCode = str[i];
            string charType = "L";
            if (charCode <= 0x00ff)
                charType = BaseTypes[charCode];
            else if (0x0590 <

以上是关于C#iTextPdf以正确的格式读取PDF格式的阿拉伯语文本的主要内容,如果未能解决你的问题,请参考以下文章

java:mysql数据库据转换pdf格式并打印机输出

java:mysql数据库据转换pdf格式并打印机输出

java itextpdf 5.5.6读取pdf中文文档乱码怎么解决

以 PDF 格式正确保存以驱动 Google 表格

以适当的格式创建 pdf

我怎样才能改变iTextPdf注释的字体大小