Handbook of Document Image Processing and Recognition文档图像处理与识别手册
Posted 2008nmj
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Handbook of Document Image Processing and Recognition文档图像处理与识别手册相关的知识,希望对你有一定的参考价值。
编辑:David Doermann(马里兰大学)
Karl Tombre(洛林大学)
前言
In the beginning, there was only OCR. After some false starts, OCR became a competitive commercial enterprise in the 1950’s. A decade later there were more than 50 manufacturers in the US alone. With the advent of microprocessors and inexpensive optical scanners, the price of OCR dropped from tens and hundreds of thousands of dollars to that of a bottle of wine. Software displaced the racks of electronics. By 1985 anybody could program and test their ideas on a PC, and then write a paper about it (and perhaps even patent it).
最初,只有OCR。在经历了一些错误的开始之后,OCR在20世纪50年代成为了一家有竞争力的商业企业。10年后,仅在美国就有50多家制造商。随着微处理器和廉价的光学扫描仪的出现,光学字符识别的价格从几万和几十万美元降到了一瓶酒的价格。软件取代了电子设备的机架。到1985年,任何人都可以在个人电脑上编程和测试他们的想法,然后写一篇关于它的论文(甚至可能申请专利)。
We know, however, very little about current commercial methods or in-house experimental results. Competitive industries have scarce motivation to publish (and their patents may only be part of their legal arsenal). The dearth of industrial authors in our publications is painfully obvious. Herbert Schantz’s book, The History of OCR, was an exception: he traced the growth of REI, which was one of the major success stories of the 1960’s and 1970’s. He also told the story, widely mirrored in sundry wikis and treatises on OCR, of the previous fifty years’ attempts to mechanize reading. Among other manufacturers of the period, IBM may have stood alone in publishing detailed (though often delayed) information about its products.
然而,我们对目前的商业方法或内部实验结果知之甚少。竞争性行业很少有出版的动机(它们的专利可能只是其法律武器库的一部分)。我们的出版物中缺乏工业作者是显而易见的。赫伯特·桑茨的书《OCR的历史》是一个例外:他追溯了REI的成长,REI是60年代和70年代的主要成功案例之一。他还讲述了过去50年中各种各样的wiki和OCR论文中广泛反映的试图机械化阅读的故事。在这一时期的其他制造商中,IBM可能单独发布了有关其产品的详细信息(尽管常常被延迟)。
Of the 4000-8000 articles published since 1900 on character recognition (my estimate), at most a few hundred really bear on OCR (construed as machinery - now software - that converts visible language to a searchable digital format). The rest treat character recognition as a prototypical classification problem. It is, of course, researchers’ universal familiarity with at least some script that turned character recognition into the pre-eminent vehicle for demonstrating and illustrating new ideas in pattern recognition. Even though some of us cannot tell an azalea from a begonia, a sharp sign from a clef, a loop from a tented arch, an erythrocyte from a leukocyte, or an alluvium from an anticline, all of us know how to read.
在1900年以来出版的4000-8000篇关于字符识别(我估计)的文章中,最多有几百篇真正与OCR有关(被理解为将可视语言转换为可搜索数字格式的机器——现在是软件)。其余的将字符识别作为一个典型的分类问题。当然,正是由于研究人员对至少一些脚本的普遍熟悉,才使得字符识别成为展示和说明模式识别新思想的杰出工具。尽管我们中的一些人不能分辨杜鹃花和海棠,不能分辨裂缝的尖锐迹象,也不能分辨帐篷拱的环,不能分辨白细胞的红细胞,也不能分辨背斜的冲积层,但我们都知道如何阅读。
Until about 30 years ago, OCR meant recognizing mono-spaced OCR fonts and typewritten scripts one character at a time – eventually at the rate of several thousand characters per second. Word recognition followed for reading difficult-to-segment typeset matter. The value of language models more elaborate than letter n-gram frequencies and lexicons without word frequencies gradually became clear. Because more than half of the world population is polyglot, OCR too became multilingual (as Henry Baird predicted that it must). This triggered a movement to post all the cultural relics of the past on the Web. Much of the material awaiting conversion,ancient and modern, stretches the limits of human readability. Like humans, OCR must take full advantage of syntax, style, context, and semantics.
直到大约30年前,OCR还意味着一次只识别一个字符的单间距OCR字体和打字脚本,最终达到每秒几千个字符的速度。阅读困难的排版材料时采用的单词识别法。语言模型的价值比字母N-gram频率和没有词频率的词典更为精细。因为世界上一半以上的人口是多语种的,OCR也变成了多语种的(正如Henry Baird所预测的那样)。这引发了一场在网络上发布所有过去文物的运动。许多等待转换的材料,无论是古代还是现代,都超出了人类可读性的极限。与人类一样,OCR必须充分利用语法、样式、上下文和语义。
Although many academic researchers are aware that OCR is much more than classification, they have yet to develop a viable, broad-range, end-to-end OCR system (but they may be getting close). A complete OCR system, with language and script recognition, colored print capability, column and line layout analysis, accurate character/word, numeric, symbol and punctuation recognition, language models, document-wide consistency, tuneability and adaptability, graphics subsystems, effectively embedded interactive error correction, and multiple output formats, is far more than the sum of its parts. Furthermore, specialized systems - for postal address reading, check reading, litigation, and bureaucratic forms processing - also require high throughput and different error-reject trade-offs. Real OCR simply isn’t an appropriate PhD dissertation project.
尽管许多学术研究人员意识到OCR不仅仅是分类,他们还没有开发出一个可行的、范围广泛的、端到端的OCR系统(但他们可能正在接近)。一个完整的OCR系统,具有语言和脚本识别、彩色打印能力、列和行布局分析、精确的字符/词、数字、符号和标点符号识别、语言模型、文档范围一致性、可调性和适应性、图形子系统、有效嵌入的交互纠错和多重输出格式,远远超过其各部分的总和。此外,专门的系统——邮政地址读取、支票读取、诉讼和官僚表格处理——也需要高吞吐量和不同的错误拒绝权衡。真正的OCR根本不是一个合适的博士论文项目。
I never know whether to call hand print recognition and handwriting recognition “OCR.” but abhor intelligent as a qualifier for the latest wrinkle. No matter: they are here to stay until tracing glyphs with a stylus goes the way of the quill. Both human and machine legibility of manuscripts depend significantly on the motivation of the writer: a hand-printed income tax return requesting a refund is likely to be more legible than one reporting an underpayment. Immediate feedback, the main advantage of on-line recognition, is a powerful form of motivation. Humans still learn better than machines.
我不知道是否要将手写识别和手写识别称为“OCR”,但我讨厌智能作为最新皱纹的限定词。不管怎样:他们会一直呆在这里,直到用触控笔描绘出的字形沿着羽毛笔的方向移动。手稿的人和机器可读性在很大程度上取决于作者的动机:要求退款的手印所得税申报表可能比少付的更容易阅读。即时反馈是在线识别的主要优势,是一种强有力的激励形式。人类仍然比机器学习得更好。
Document Image Analysis (DIA) is a superset of OCR, but many of its other popular subfields require OCR. Almost all line drawings contain text. An E-sized telephone company drawing, for instance, has about 3000 words and numbers (including revision notices). Music scores contain numerals and instructions like pianissimo. A map without place names and elevations would have limited use. Mathematical expressions abound in digits and alphabetic fragments like log, limit, tan or argmin. Good lettering used to be a prime job qualification for the draftsmen who drew the legacy drawings that we are now converting to CAD. Unfortunately, commercial OCR systems, tuned to paragraph-length segments of text, do poorly on the alphanumeric fragments typical of such applications. When Open Source OCR matures, it will provide a fine opportunity for customization to specialized applications that have not yet attracted heavy-weight developers. In the meantime, the conversion of documents containing a mix of text and line art has given rise to distinct sub-disciplines with their own conference sessions and workshops that target graphics techniques like vectorization and complex symbol configurations.
文档图像分析(DIA)是OCR的一个超集,但它的许多其他流行的子字段都需要OCR。几乎所有的线条图都包含文本。例如,一个电子电话公司的图纸上有大约3000个字和数字(包括修订通知)。乐谱包含数字和指令,如pianissimo。一张没有地名和海拔的地图将有有限的用途。数学表达式中有大量的数字和字母片段,如log、limit、tan或argmin。良好的字体曾经是绘图员的主要工作资格,他们绘制了我们现在正在转换为CAD的传统图纸。不幸的是,商业OCR系统,调整到文本的段落长度段,在这类应用的典型字母数字片段上做得很差。当开源OCR成熟时,它将提供一个很好的机会来定制那些尚未吸引大量开发人员的专门应用程序。同时,包含文本和线条艺术混合的文档的转换产生了不同的子学科,它们有自己的会议和研讨会,以矢量化和复杂符号配置等图形技术为目标。
Another subfield of DIA investigates what to do with automatically or manually transcribed books, technical journals, magazines and newspapers. Although Information Retrieval (IR) is not generally considered part of DIA or vice-versa, the overlap between them includes “logical” document segmentation, extraction of tables of content, linking figures and illustrations to textual references, and word spotting. A recurring topic is assessing the effect of OCR errors on downstream applications. One factor that keeps the two disciplines apart is that IR experiments (e.g., TREC) typically involve orders of magnitude more documents than DIA experiments because the number of characters in any collection is far smaller than the number of pixels.
DIA的另一个子领域研究如何处理自动或手动抄写的书籍、技术期刊、杂志和报纸。尽管信息检索(IR)通常不被视为DIA的一部分,反之亦然,但它们之间的重叠包括“逻辑”文档分割、内容表提取、将图形和插图链接到文本引用以及单词识别。一个反复出现的主题是评估OCR错误对下游应用程序的影响。使这两个学科分开的一个因素是,红外实验(例如,TREC)通常比DIA实验涉及数量级的文档,因为任何集合中的字符数都远远小于像素数。
Computer vision used to be easily distinguished from the image processing aspects of DIA by its emphasis on illumination and camera position. The border is blurring because even cellphone cameras now offer sufficient spatial resolution for document image capture at several hundred dpi as well as for legible text in large scene images. The correction of the contrast and geometric distortions in the resulting images goes well beyond what is required for scanned documents
过去,计算机视觉以其对光照和摄像机位置的重视,很容易与DIA的图像处理方面区别开来。边界变得模糊,因为即使是手机摄像头现在也能提供足够的空间分辨率,以几百dpi的速度拍摄文档图像,以及在大型场景图像中显示清晰的文本。结果图像中对比度和几何畸变的校正远远超出了扫描文档的要求
This collection suggests that we are still far from a unified theory of DIA or even OCR. The Handbook is all the more useful because we have no choice except to rely on heuristics or algorithms based on questionable assumptions. The most useful methods available to us were all invented rather than derived from prime principles. When the time is ripe, many alternative methods are invented to fill the same need. They all remain entrenched candidates for “best practice”. This Handbook presents them fairly, but generally avoids picking winners and losers.
这个集合表明,我们还远远没有一个统一的理论,迪亚,甚至OCR。这本手册更有用,因为我们别无选择,只能依靠启发式或基于可疑假设的算法。我们所能得到的最有用的方法都是发明出来的,而不是从基本原理中衍生出来的。当时机成熟时,许多替代方法被发明来满足同样的需求。他们都是“最佳实践”的坚定候选人。这本手册公正地介绍了他们,但通常避免挑选赢家和输家。
“Noise” appears to be the principal obstacle to better results. This is all the more irritating because many types of noise (e.g. skew, bleed-through, underscore) barely slow down human readers. We have not yet succeeded in characterizing and quantifying signal and noise to the extent that communications science has. Although OCR and DIA are prime examples of information transfer, informationtheoretic concepts are seldom invoked. Are we moving in the right direction by accumulating empirical midstream comparisons – often on synthetic data – from contests organized by individual research groups in conjunction with our conferences?
“噪音”似乎是取得更好结果的主要障碍。这更让人恼火,因为许多类型的噪音(如歪斜、出血、下划线)几乎不能减慢人类读者的阅读速度。我们还没有像通信科学那样成功地描述和量化信号和噪声。虽然OCR和DIA是信息传递的主要例子,但很少引用信息论的概念。我们是否正在朝着正确的方向前进,通过积累经验中游比较——通常是综合数据——从各个研究小组与我们的会议一起组织的竞赛中得出?
Be that as it may, as one is getting increasingly forgetful it is reassuring to have most of the elusive information about one’s favorite topics at arm’s reach in a fat tome like this one. Much as on-line resources have improved over the past decade, I like to turn down the corner of the page and scribble a note in the margin. Younger folks, who prefer search-directed saccades to an old-fashioned linear presentation, may want the on-line version.
尽管如此,当一个人变得越来越健忘的时候,在这样一本厚厚的书中,把自己最喜欢的话题的大部分难以捉摸的信息放在手边是令人放心的。虽然在线资源在过去的十年里有了很大的改善,但我还是喜欢把页面的角落调低,在页边空白处潦草地写一条注释。比起老式的线性演示,年轻人更喜欢搜索导向的扫视,他们可能想要在线版本。
David Doermann and Karl Tombre were exceptionally well qualified to plan, select, solicit, and edit this compendium. Their contributions to DIA cover a broad swath and, as far as I know, they have never let the song of the sirens divert them from the muddy and winding channels of DIA. Their technical contributions are well referenced by the chapter authors and their voice is heard at the beginning of each section.
大卫·多尔曼和卡尔·汤姆布雷非常有资格策划、选择、征集和编辑这本简编。据我所知,他们对迪亚的贡献是巨大的,他们从未让警笛的歌声把他们从迪亚泥泞蜿蜒的河道中引开。他们的技术贡献被章节作者很好地引用,他们的声音在每个章节的开头都能听到。
Dave is the co-founding-editor of IJDAR, which became our flagship journal when PAMI veered towards computer vision and machine learning. Along with the venerable PR and the high-speed, high-volume PRL, IJDAR has served us well with a mixture of special issues, surveys, experimental reports, and new theories. Even earlier, with the encouragement of Azriel Rosenfeld, Dave organized and directed the Language and Media Processing Laboratory, which has become a major resource of DIA data sets, code, bibliographies, and expertise.
戴夫是IJDAR的联合创始编辑,当PAMI转向计算机视觉和机器学习时,IJDAR成为我们的旗舰期刊。伴随着古老的公共关系和高速、大容量的公共关系,IJDAR为我们提供了一系列的专题、调查、实验报告和新理论。更早些时候,在Azriel Rosenfeld的鼓励下,Dave组织并指导了语言和媒体处理实验室,该实验室已成为DIA数据集、代码、书目和专业知识的主要资源。
Karl, another IJDAR co-founder, put Nancy on the map as one of the premier global centers of DIA research and development. Beginning with a sustained drive to automate the conversion of legacy drawings to CAD formats (drawings for a bridge or a sewer line may have a lifetime of over a hundred years, and the plans for the still-flying Boeing 747 were drawn by hand), Karl brought together and expanded the horizons of University and INRIA researchers to form a critical mass of DIA.
另一位IJDAR联合创始人卡尔(Karl)将南希列为DIA研究与开发的主要全球中心之一。从持续推动将传统图纸自动转换为CAD格式开始(桥梁或下水道的图纸可能有超过100年的使用寿命,而仍在飞行的波音747的计划是手工绘制的),卡尔把大学和印度研究院的研究人员聚集在一起,拓展了他们的视野,形成了一个DIA的临界质量。
Dave and Karl have also done more than their share to bring our research community together, find common terminology and data, create benchmarks, and advance the state of the art. These big patient men have long been a familiar sight at our conferences, always ready to resolve a conundrum, provide a missing piece of information, fill in for an absentee session chair or speaker, or introduce folks who should know each other.
戴夫和卡尔也做了更多的工作,将我们的研究团体聚集在一起,找到共同的术语和数据,创建基准,并提高技术水平。在我们的会议上,这些有耐心的大人物一直是我们熟悉的景象,他们总是准备解决一个难题,提供缺失的信息,填补缺席会议的主席或发言人,或介绍应该相互认识的人。
The DIA community has every reason to be grateful to the editors and authors of this timely and comprehensive collection. Enjoy, and work hard to make a contribution to the next edition!
DIA社区有充分的理由感谢编辑和作者及时和全面的收集。好好享受,努力为下一版做贡献!
Part A Introduction, Background, Fundamentals .................... 1
1 A Brief History of Documents and Writing Systems ................... 3
2 Document Creation, Image Acquisition and Document Quality...... 11
3 The Evolution of Document Image Analysis ............................ 63
4 Imaging Techniques in Document Analysis Processes ................. 73
Part B Page Analysis........................................................ 133
5 Page Segmentation Techniques in Document Analysis ................ 135
6 Analysis of the Logical Layout of Documents........................... 177
7 Page Similarity and Classification........................................ 223
Part C Text Recognition .................................................... 255
8 Text Segmentation for Document Recognition.......................... 257
9 Language, Script, and Font Recognition ................................ 291
10 Machine-Printed Character Recognition................................ 331
11 Handprinted Character and Word Recognition ........................ 359
12 Continuous Handwritten Script Recognition ........................... 391
13 Middle Eastern Character Recognition ................................. 427
14 Asian Character Recognition ............................................. 459
Volume 2
Part D Processing of Non-textual Information ........................ 487
15 Graphics Recognition Techniques........................................ 489
16 An Overview of Symbol Recognition .................................... 523
17 Analysis and Interpretation of Graphical Documents ................. 553
18 Logo and Trademark Recognition ....................................... 591
19 Recognition of Tables and Forms ........................................ 647
20 Processing Mathematical Notation ....................................... 679
Part E Applications .......................................................... 703
21 Document Analysis in Postal Applications and Check
Processing ................................................................... 705
22 Analysis and Recognition of Music Scores .............................. 749
23 Analysis of Documents Born Digital ..................................... 775
24 Image Based Retrieval and Keyword Spotting in Documents ........ 805
25 Text Localization and Recognition in Images and Video .............. 843
Part F Analysis of Online Data............................................ 885
26 Online Handwriting Recognition......................................... 887
27 Online Signature Verification ............................................. 917
28 Sketching Interfaces ....................................................... 949
Part G Evaluation and Benchmarking .................................. 981
29 Datasets and Annotations for Document Analysis
and Recognition ............................................................ 983
30 Tools and Metrics for Document Analysis Systems Evaluation ....... 1011
Index......................................................................... 1037
以上是关于Handbook of Document Image Processing and Recognition文档图像处理与识别手册的主要内容,如果未能解决你的问题,请参考以下文章
Gerrit Handbook for Commercial Project
MonogDBThe description of index Embedded and document Index
Could not read document: Can not deserialize instance of java.lang.String out of START_ARRAY