itextsharp:将文本块拆分为单词时单词被破坏

Posted

技术标签:

【中文标题】itextsharp:将文本块拆分为单词时单词被破坏【英文标题】:itextsharp: words are broken when splitting textchunk into words 【发布时间】:2015-12-16 17:30:43 【问题描述】:

我想突出显示一组 PDF 文件中的几个关键字。首先,我们必须识别单个单词并将它们与我的关键字匹配。我找到了一个例子:

class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy

    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    List<string> topicTerms;
    public MyLocationTextExtractionStrategy(List<string> topicTerms)
    
        this.topicTerms = topicTerms;
    

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo)
    
        base.RenderText(renderInfo);


        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        //filter the meaingless words
        string text = renderInfo.GetText();
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));

但是,我发现很多单词都坏了。例如,“stop”将是“st”和“op”。还有其他方法可以识别单个单词及其位置吗?

【问题讨论】:

我总能认出我的老code!无论如何,请参阅 mkl's answer here 关于使用 IsChunkAtWordBoundary() 来确定两个“块”是否应该是一个“单词”。 感谢您的旧代码。它确实有很大帮助。我稍后会尝试你的建议。再次感谢。 无论如何,我发现收集单个单词的更好方法是在 GetResultantText() 而不是 RenderText()。 【参考方案1】:

当您想要收集单个单词及其协调时,更好的方法是覆盖现有的 LocationTextExtractionStrategy。这是我的代码:

public virtual String GetResultantText(ITextChunkFilter chunkFilter)
        if (DUMP_STATE) 
            DumpState();
        

        List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
        filteredTextChunks.Sort();

        List<RectAndText> tmpList = new List<RectAndText>();

        StringBuilder sb = new StringBuilder();
        TextChunk lastChunk = null;
        foreach (TextChunk chunk in filteredTextChunks) 

            if (lastChunk == null)
                sb.Append(chunk.Text);
                var startLocation = chunk.StartLocation;
                var endLocation = chunk.EndLocation;

                var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
                tmpList.Add(new RectAndText(rect, chunk.Text));
             else 
                if (chunk.SameLine(lastChunk))
                    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                    if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                    
                        sb.Append(' ');
                        if (tmpList.Count > 0)
                        
                            mergeAndStoreChunk(tmpList);
                            tmpList.Clear();
                        

                    

                    sb.Append(chunk.Text);

                   var startLocation = chunk.StartLocation; 
                    var endLocation = chunk.EndLocation;

                    var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
                    ////var topRight = renderInfo.GetAscentLine().GetEndPoint();
                    tmpList.Add(new RectAndText(rect,chunk.Text));

                 else 
                    sb.Append('\n');
                    sb.Append(chunk.Text);

                
            
            lastChunk = chunk;
        

        return sb.ToString();
    

    private void mergeAndStoreChunk(List<RectAndText> tmpList)
    
        RectAndText mergedChunk = tmpList[0];
        int tmpListCount = tmpList.Count();
        for (int i = 1; i < tmpListCount; i++)
        
            RectAndText nowChunk = tmpList[i];
            mergedChunk.Rect.Right = nowChunk.Rect.Right;
            mergedChunk.Text += nowChunk.Text;
        
        this.myPoints.Add(mergedChunk);
    

myPoints 是一个列表,它将返回我们想要的所有内容。

【讨论】:

以上是关于itextsharp:将文本块拆分为单词时单词被破坏的主要内容,如果未能解决你的问题,请参考以下文章

用空格(或任何字符)将文本单元格拆分为任意数量的单词,重复单词

JSFL 命令将文本字段拆分为单词 - Flash 数组顺序错误

PIG 脚本根据特定单词将大型文本文件拆分为多个部分

在 Embarcadero 的 C++ Builder 中使用 RegEx 将文本拆分为单个单词

RegEx Tokenizer:将文本拆分为单词、数字、标点和空格(不要删除任何内容)

将字符串拆分为第一个单词和其余文本?