itextsharp:将文本块拆分为单词时单词被破坏
Posted
技术标签:
【中文标题】itextsharp:将文本块拆分为单词时单词被破坏【英文标题】:itextsharp: words are broken when splitting textchunk into words 【发布时间】:2015-12-16 17:30:43 【问题描述】:我想突出显示一组 PDF 文件中的几个关键字。首先,我们必须识别单个单词并将它们与我的关键字匹配。我找到了一个例子:
class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
List<string> topicTerms;
public MyLocationTextExtractionStrategy(List<string> topicTerms)
this.topicTerms = topicTerms;
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
//filter the meaingless words
string text = renderInfo.GetText();
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
但是,我发现很多单词都坏了。例如,“stop”将是“st”和“op”。还有其他方法可以识别单个单词及其位置吗?
【问题讨论】:
我总能认出我的老code!无论如何,请参阅 mkl's answer here 关于使用IsChunkAtWordBoundary()
来确定两个“块”是否应该是一个“单词”。
感谢您的旧代码。它确实有很大帮助。我稍后会尝试你的建议。再次感谢。
无论如何,我发现收集单个单词的更好方法是在 GetResultantText() 而不是 RenderText()。
【参考方案1】:
当您想要收集单个单词及其协调时,更好的方法是覆盖现有的 LocationTextExtractionStrategy。这是我的代码:
public virtual String GetResultantText(ITextChunkFilter chunkFilter)
if (DUMP_STATE)
DumpState();
List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
filteredTextChunks.Sort();
List<RectAndText> tmpList = new List<RectAndText>();
StringBuilder sb = new StringBuilder();
TextChunk lastChunk = null;
foreach (TextChunk chunk in filteredTextChunks)
if (lastChunk == null)
sb.Append(chunk.Text);
var startLocation = chunk.StartLocation;
var endLocation = chunk.EndLocation;
var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
tmpList.Add(new RectAndText(rect, chunk.Text));
else
if (chunk.SameLine(lastChunk))
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
sb.Append(' ');
if (tmpList.Count > 0)
mergeAndStoreChunk(tmpList);
tmpList.Clear();
sb.Append(chunk.Text);
var startLocation = chunk.StartLocation;
var endLocation = chunk.EndLocation;
var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
////var topRight = renderInfo.GetAscentLine().GetEndPoint();
tmpList.Add(new RectAndText(rect,chunk.Text));
else
sb.Append('\n');
sb.Append(chunk.Text);
lastChunk = chunk;
return sb.ToString();
private void mergeAndStoreChunk(List<RectAndText> tmpList)
RectAndText mergedChunk = tmpList[0];
int tmpListCount = tmpList.Count();
for (int i = 1; i < tmpListCount; i++)
RectAndText nowChunk = tmpList[i];
mergedChunk.Rect.Right = nowChunk.Rect.Right;
mergedChunk.Text += nowChunk.Text;
this.myPoints.Add(mergedChunk);
myPoints 是一个列表,它将返回我们想要的所有内容。
【讨论】:
以上是关于itextsharp:将文本块拆分为单词时单词被破坏的主要内容,如果未能解决你的问题,请参考以下文章
用空格(或任何字符)将文本单元格拆分为任意数量的单词,重复单词
JSFL 命令将文本字段拆分为单词 - Flash 数组顺序错误
在 Embarcadero 的 C++ Builder 中使用 RegEx 将文本拆分为单个单词