Knowledge (XXG)

Document layout analysis

Source 📝

22: 157:
advantage that they parse the global structure of a document directly, thus eliminating the need to iteratively cluster together the possibly hundreds or even thousands of characters/symbols which appear on a document. They tend to be faster, but in order for them to operate robustly they typically require a number of assumptions to be made about on the layout of the document. Examples of top-down approaches include the recursive X-Y cut algorithm, which decomposes the document in rectangular sections.
156:
The bottom-up approaches are the traditional ones, and they have the advantage that they require no assumptions on the overall structure of the document. On the other hand, bottom-up approaches require iterative segmentation and clustering, which can be time consuming. Top-down approaches have the
209:
where k is an integer greater than or equal to four. O`Gorman suggests k=5 in his paper as a good compromise between robustness and speed. The reason to use at least k=4 is that for a symbol in a document, the two or three nearest symbols are the ones right next to it on the same text line. The
230:
For each symbol, look at its nearest neighbors and flag any of them that are a distance away which is within some tolerance of the between-character spacing distance or between-word spacing distance. For each nearest neighbor symbol which is flagged, draw a line segment connecting their
148:
approaches which iteratively parse a document based on the raw pixel data. These approaches typically first parse a document into connected regions of black and white, then these regions are grouped into words, then into text lines, and finally into text blocks. Secondly, there are
262:– A free document layout analysis and OCR system, implemented in C++ and Python and for FreeBSD, Linux, and Mac OS X. This software supports a plug-in architecture which allows the user to select from a variety of different document layout analysis and OCR algorithms. 241:
For each pair of text lines, one can compute a minimum distance between their corresponding line segments. If this distance is within some tolerance of the between-line spacing calculated in step 7, then the two text lines are grouped into the same
238:. Using all the centroids in a text line, one can compute an actual line segment representing the text line with linear regression. This is important because it is unlikely that all the centroids of symbols in a text line are actually collinear. 213:
Each nearest neighbor pair of symbols is related by a vector pointing from one symbol’s centroid to the other symbol’s centroid. If these vectors are plotted for every pair of nearest neighbor symbols, then one gets what is called the
226:
The nearest-neighbor distance histogram has several peaks, and these peaks typically represent between-character spacing, between-word spacing, and between-line spacing. Calculate these values from the histogram and set them
222:
Using the nearest-neighbor angle histogram, the skew of the document can be calculated. If the skew is acceptably low, continue to the next step. If it is not, rotate the image so as to remove the skew and return to step
98:
image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as
172:
algorithms that the characters in the document image are oriented so that text lines are horizontal. Therefore, if there is skew present then it is important to rotate the document image so as to remove it.
218:
for the document (See figure below). One can also use the angle Θ from the horizontal and distance D between two nearest neighbor symbols and create a nearest-neighbor angle and nearest-neighbor distance
168:. Skew refers to the fact that a document image may be rotated in a way so that the text lines are not perfectly horizontal. It is a common assumption in both document layout analysis algorithms and 610: 735: 464:
Cattoni, R.; Coianiz, T.; Messelodi, S.; Modena, C. M. "Geometric Layout Analysis Techniques for Document Image Understanding: a Review. ITC-irst Technical Report TR#9703-09".
188:
Preprocess the image to remove Gaussian and salt-and-pepper noise. Note that some noise removal filters may consider commas and periods as noise, so some care must be taken.
184:
In this section we will walk through the steps of a bottom-up document layout analysis algorithm developed in 1993 by O`Gorman. The steps in this approach are as follows:
210:
fourth-nearest symbol is typically on a line right above or below, and it is important to include these symbols in the nearest neighbor calculation for the following.
176:
It follows that the first steps in any document layout analysis code are to remove image noise and to come up with an estimate for the skew angle of the document.
268:– An OCR suite for Linux, written in python, which also supports document layout analysis. This software is actively being developed, and is free and open-source. 130: 126:
engine, but it can be used also to detect duplicate copies of the same document in large archives, or to index documents by their structure or pictorial content.
300: 115:. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the 603: 761: 596: 43: 122:
Document layout analysis is the union of geometric and logical labeling. It is typically performed before a document image is sent to an
694: 65: 153:
approaches which attempt to iteratively cut up a document into columns and blocks based on white space and geometric information.
160:
There are two issues common to any approach at document layout analysis: noise and skew. Noise refers to image noise, such as
619: 206: 169: 123: 83: 307:
Geometric Layout Analysis Techniques for Document Image Understanding: a Review, ITC-irst Technical Report TR#9703-09
306: 36: 30: 283: 133: 756: 47: 249:
Finally, one can calculate a bounding box for each text block, and the document layout analysis is complete.
542: 436: 341:
Simon, A.; Pret, J.-C.; Johnson, A.P. (1997). "A fast algorithm for bottom-up document layout analysis".
469: 392: 161: 108: 671: 661: 636: 547: 441: 278: 579:
Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR '95)
91: 684: 679: 552: 508: 446: 371: 350: 329: 533:
Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis".
482: 405: 362:
Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis".
301:
High Performance Document Layout Analysis by Thomas M. Breuel, at PARC, Palo Alto, CA, USA
79: 588: 165: 750: 628: 95: 104: 689: 288: 571: 709: 144:
There are two main approaches to document layout analysis. Firstly, there are
100: 651: 384: 265: 198:
Segment the image into connected components of black pixels. These are the
714: 704: 699: 195:, i.e. convert each pixel value to completely white or completely black. 719: 656: 499:
O'Gorman, L. (1993). "The document spectrum for page layout analysis".
385:"ANASTASIL: hybrid knowledge-based system for document layout analysis" 320:
O'Gorman, L. (1993). "The document spectrum for page layout analysis".
259: 556: 512: 375: 354: 333: 450: 202:
of the image. For each symbol, compute a bounding box and centroid.
646: 129:
Document layout is formally defined in the international standard
641: 572:"Recursive X-Y Cut using Bounding Boxes of Connected Components" 592: 427:
Baird, K.S. (July 1992). "Anatomy of a versatile page reader".
535:
IEEE Transactions on Pattern Analysis and Machine Intelligence
501:
IEEE Transactions on Pattern Analysis and Machine Intelligence
364:
IEEE Transactions on Pattern Analysis and Machine Intelligence
343:
IEEE Transactions on Pattern Analysis and Machine Intelligence
322:
IEEE Transactions on Pattern Analysis and Machine Intelligence
15: 570:
Ha, Jaekyu; Haralick, Robert M.; Phillips, Ihsin T. (1995).
234:
Symbols connected to their neighbors by line segments form
736:
Comparison of optical character recognition software
728: 670: 627: 90:is the process of identifying and categorizing the 111:, and tables embedded in a document is called 604: 8: 611: 597: 589: 494: 492: 546: 440: 66:Learn how and when to remove this message 383:Dengel, Andreas; Barth, Gerhard (1989). 29:This article includes a list of general 528: 526: 524: 522: 419: 478: 467: 401: 390: 7: 35:it lacks sufficient corresponding 14: 695:Microsoft Office Document Imaging 20: 205:For each symbol, determine its 180:Example of a bottom up approach 1: 762:Optical character recognition 620:Optical character recognition 170:optical character recognition 84:natural language processing 778: 284:Open Document Architecture 191:Convert the image into a 113:geometric layout analysis 254:Layout analysis software 88:document layout analysis 429:Proceedings of the IEEE 117:logical layout analysis 50:more precise citations. 477:Cite journal requires 400:Cite journal requires 387:. Ijcai'89: 1249–1254. 162:salt and pepper noise 672:Proprietary software 279:Document processing 207:k nearest neighbors 140:Overview of methods 92:regions of interest 744: 743: 685:Adobe Acrobat Pro 557:10.1109/34.969115 541:(11): 1240–1256. 513:10.1109/34.244677 507:(11): 1162–1173. 376:10.1109/34.969115 370:(11): 1240–1256. 355:10.1109/34.584106 334:10.1109/34.244677 328:(11): 1162–1173. 76: 75: 68: 769: 757:Image processing 680:ABBYY FineReader 613: 606: 599: 590: 583: 582: 576: 567: 561: 560: 550: 530: 517: 516: 496: 487: 486: 480: 475: 473: 465: 461: 455: 454: 451:10.1109/5.156469 444: 435:(7): 1059–1065. 424: 409: 403: 398: 396: 388: 379: 358: 337: 71: 64: 60: 57: 51: 46:this article by 37:inline citations 24: 23: 16: 777: 776: 772: 771: 770: 768: 767: 766: 747: 746: 745: 740: 724: 666: 623: 617: 587: 586: 574: 569: 568: 564: 548:10.1.1.574.7875 532: 531: 520: 498: 497: 490: 476: 466: 463: 462: 458: 426: 425: 421: 416: 399: 389: 382: 361: 340: 319: 316: 314:Further reading 297: 275: 256: 182: 142: 80:computer vision 72: 61: 55: 52: 42:Please help to 41: 25: 21: 12: 11: 5: 775: 773: 765: 764: 759: 749: 748: 742: 741: 739: 738: 732: 730: 726: 725: 723: 722: 717: 712: 707: 702: 697: 692: 687: 682: 676: 674: 668: 667: 665: 664: 659: 654: 649: 644: 639: 633: 631: 625: 624: 618: 616: 615: 608: 601: 593: 585: 584: 562: 518: 488: 479:|journal= 456: 442:10.1.1.40.8060 418: 417: 415: 412: 411: 410: 402:|journal= 380: 359: 349:(3): 273–277. 338: 315: 312: 311: 310: 304: 296: 295:External links 293: 292: 291: 286: 281: 274: 271: 270: 269: 263: 255: 252: 251: 250: 247: 239: 232: 228: 224: 220: 211: 203: 196: 189: 181: 178: 166:Gaussian noise 141: 138: 74: 73: 28: 26: 19: 13: 10: 9: 6: 4: 3: 2: 774: 763: 760: 758: 755: 754: 752: 737: 734: 733: 731: 727: 721: 718: 716: 713: 711: 708: 706: 703: 701: 698: 696: 693: 691: 688: 686: 683: 681: 678: 677: 675: 673: 669: 663: 660: 658: 655: 653: 650: 648: 645: 643: 640: 638: 635: 634: 632: 630: 629:Free software 626: 621: 614: 609: 607: 602: 600: 595: 594: 591: 580: 573: 566: 563: 558: 554: 549: 544: 540: 536: 529: 527: 525: 523: 519: 514: 510: 506: 502: 495: 493: 489: 484: 471: 460: 457: 452: 448: 443: 438: 434: 430: 423: 420: 413: 407: 394: 386: 381: 377: 373: 369: 365: 360: 356: 352: 348: 344: 339: 335: 331: 327: 323: 318: 317: 313: 308: 305: 302: 299: 298: 294: 290: 287: 285: 282: 280: 277: 276: 272: 267: 264: 261: 258: 257: 253: 248: 245: 240: 237: 233: 229: 225: 221: 217: 212: 208: 204: 201: 197: 194: 190: 187: 186: 185: 179: 177: 174: 171: 167: 163: 158: 154: 152: 147: 139: 137: 135: 132: 127: 125: 120: 118: 114: 110: 106: 105:illustrations 102: 97: 93: 89: 85: 81: 70: 67: 59: 49: 45: 39: 38: 32: 27: 18: 17: 578: 565: 538: 534: 504: 500: 470:cite journal 459: 432: 428: 422: 393:cite journal 367: 363: 346: 342: 325: 321: 243: 235: 215: 199: 193:binary image 192: 183: 175: 159: 155: 150: 145: 143: 128: 121: 116: 112: 109:math symbols 87: 77: 62: 53: 34: 690:Asprise OCR 289:Page layout 134:8613-1:1989 48:introducing 751:Categories 710:SmartScore 414:References 244:text block 236:text lines 231:centroids. 219:histogram. 31:references 662:Tesseract 652:OCRFeeder 637:CuneiForm 543:CiteSeerX 437:CiteSeerX 266:OCRFeeder 146:bottom-up 729:See also 715:TeleForm 705:ReadSoft 700:OmniPage 622:software 273:See also 216:docstrum 151:top-down 56:May 2010 720:VueScan 657:OCRopus 309:, 1998. 260:OCRopus 200:symbols 96:scanned 94:in the 44:improve 545:  439:  303:, 2003 227:aside. 103:body, 33:, but 647:Ocrad 575:(PDF) 642:GOCR 483:help 406:help 101:text 553:doi 509:doi 447:doi 372:doi 351:doi 330:doi 164:or 131:ISO 124:OCR 82:or 78:In 753:: 577:. 551:. 539:23 537:. 521:^ 505:15 503:. 491:^ 474:: 472:}} 468:{{ 445:. 433:80 431:. 397:: 395:}} 391:{{ 368:23 366:. 347:19 345:. 326:15 324:. 223:3. 136:. 119:. 107:, 86:, 612:e 605:t 598:v 581:. 559:. 555:: 515:. 511:: 485:) 481:( 453:. 449:: 408:) 404:( 378:. 374:: 357:. 353:: 336:. 332:: 246:. 69:) 63:( 58:) 54:( 40:.

Index

references
inline citations
improve
introducing
Learn how and when to remove this message
computer vision
natural language processing
regions of interest
scanned
text
illustrations
math symbols
OCR
ISO
8613-1:1989
salt and pepper noise
Gaussian noise
optical character recognition
k nearest neighbors
OCRopus
OCRFeeder
Document processing
Open Document Architecture
Page layout
High Performance Document Layout Analysis by Thomas M. Breuel, at PARC, Palo Alto, CA, USA
Geometric Layout Analysis Techniques for Document Image Understanding: a Review, ITC-irst Technical Report TR#9703-09
doi
10.1109/34.244677
doi
10.1109/34.584106

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.