Information extraction - Knowledge (XXG)

235:(NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template. 1355: 376:: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, by employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is 406:

links between text entities. In IE tasks, this is typically restricted to finding links between previously extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like

469:

IE on non-text documents is becoming an increasingly interesting topic in research, and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents

522:

A recent development is Visual Information Extraction, that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible

486:

that are available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the

465:

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

432:

Table information extraction : extracting information in structured manner from the tables. This task is more complex than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and

518:

motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.

458:

Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music

56:

Recent advances in NLP techniques have allowed for significantly improved performance compared to previous years. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:

513:

typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on

329:

tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a

196: 346:

in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:

1251:

Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan, Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction".

227:

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of

495:, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. 231:(IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of 1491: 1651: 426:

Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as:

380:, which aims at detecting entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", 351:

Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack.

630: 491:

tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using

1629: 354:

Event extraction: Given an input document, output zero or more event templates. For instance, a newspaper article might describe multiple terrorist attacks.

888: 582:(CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions. 436:

Comments extraction : extracting comments from the actual content of articles in order to restore the link between authors of each of the sentences

362:

Population: Fill a database of facts given a set of documents. Typically the database is in the form of triplets, (entity 1, relation, entity 2), e.g. (

2040: 1484: 597: 1164: 801:

Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023).

2209: 1372: 758: 53:

document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

243:

Information extraction dates back to the late 1970s in the early days of NLP. An early commercial system from the mid-1980s was JASPER built for

1090:

Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature".

1187: 981: 220:

of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and

63: 2240: 1950: 1641: 1477: 1010: 1038: 1322: 2204: 1234: 639:

is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language

286:), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism. 1811: 780: 1419: 1965: 1796: 1438: 1391: 588:

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

1736: 932:

Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene B.; Weinstein, Steven P. (1992).

321:. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into 255: 2153: 1806: 552: 1398: 1273:

Baumgartner, Robert; Flesca, Sergio; Gottlob, Georg (2001). "Visual Web Information Extraction with Lixto". pp. 119–128.

1801: 1546: 1376: 851:

Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In

212:

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow

45:

documents and other electronically represented sources. Typically, this involves processing human language texts by means of

388:

that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain

2070: 1791: 775: 1763: 1405: 573: 548: 2108: 2093: 2065: 1930: 1925: 1500: 681: 232: 46: 433:

understanding the information presented in the table are additional tasks necessary for table information extraction.

1845: 1816: 1594: 1387: 1365: 1054:

Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using deep biaffine attention".

1688: 1541: 866: 620: 542: 915: 2214: 2138: 1870: 1826: 1711: 1609: 1294:

Peng, F.; McCallum, A. (2006). "Information extraction from research papers using conditional random fields☆".

714: 643: 636: 579: 562: 373: 629:

is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and

407:

biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".

1037:, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine", 623:

is a Java-based package for a variety of natural language processing tasks, including information extraction.

294:

The present significance of IE pertains to the growing amount of information available in unstructured form.

2118: 2088: 1755: 1212: 1144: 993: 410: 1975: 1668: 1646: 1636: 1579: 1274: 1034: 941: 897: 691: 444: 403: 221: 1835: 763: 504: 492: 228: 42: 2188: 1864: 1840: 1693: 1109: 746: 671: 1412: 2168: 2098: 2055: 2011: 1783: 1773: 1768: 1656: 1460:

A listing of academic toolkits and industrial toolkits for natural language information extraction.

1279: 946: 902: 676: 567: 500: 343: 322: 213: 2178: 2050: 1915: 1678: 1661: 1519: 1253: 1193: 1125: 1099: 1059: 959: 907: 666: 1014: 2183: 1895: 1703: 1614: 1329: 1183: 977: 834: 736: 626: 317:. Until this transpires, the web largely consists of unstructured documents lacking semantic 38: 1231: 2060: 1945: 1920: 1721: 1624: 1463: 1303: 1175: 1117: 1069: 951: 824: 814: 686: 496: 482:, however, intensified the need for developing IE systems that help people to cope with the 331: 2172: 2133: 2128: 1996: 1726: 1599: 1574: 1556: 1238: 661: 614: 295: 1113: 855:, pages 3866–3878, Santa Fe, New Mexico, USA. Association for Computational Linguistics. 1880: 1860: 1584: 829: 802: 741: 709: 479: 367: 359: 299: 248: 1214:

A multi-layered approach to information extraction from tables in biomedical documents

1146:

A multi-layered approach to information extraction from tables in biomedical documents

17: 2234: 2143: 1955: 1935: 1716: 976:

Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.

269: 1197: 1129: 963: 911: 2123: 1457: 724: 703: 363: 311: 282:

Considerable support came from the U.S. Defense Advanced Research Projects Agency (

217: 1179: 1073: 416:

PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")

2080: 1960: 1673: 1589: 1566: 1514: 1354: 610: 483: 448: 399: 395: 392:

who is (or, "might be") the specific person whom that sentence is talking about.

258:. MUC is a competition-based conference that focused on the following domains: 1121: 819: 1683: 1469: 1307: 719: 50: 853:

Proceedings of the 27th International Conference on Computational Linguistics

419:

PERSON located in LOCATION (extracted from the sentence "Bill is in France.")

207:"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." 1551: 938:

Proceedings of the third conference on Applied natural language processing -

934:"Automatic Extraction of Facts from Press Releases to Generate News Stories" 933: 1056:

Proceedings of the 41st European Conference on Information Retrieval (ECIR)

838: 955: 2026: 2006: 1991: 1970: 1940: 1885: 1850: 1731: 1323:"Extracting Frame-based Knowledge Representation from Route Instructions" 803:"Precision information extraction for rare disease epidemiology at scale" 535:

Hand-written regular expressions (or nested group of regular expressions)

318: 303: 2163: 2021: 2001: 1875: 1619: 1534: 604: 478:

IE has been the focus of the MUC conferences. The proliferation of the

244: 1085: 1083: 191:{\displaystyle \mathrm {MergerBetween} (company_{1},company_{2},date)} 37:) is the task of automatically extracting structured information from 1529: 1524: 1174:. Lecture Notes in Computer Science. Vol. 21. pp. 162–174. 342:

Applying information extraction to text is linked to the problem of

1258: 1104: 1064: 2219: 1855: 607:

is a Java machine learning toolkit for natural language processing

283: 265:

MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.

1232:

Automatic Extraction of Drum Tracks from Polyphonic Music Signals

867:"Machine Learning for Information Extraction in Informal Domains" 1741: 1165:"Disentangling the Structure of Tables in Scientific Literature" 1473: 429:

Table extraction: finding and extracting tables from documents.

2016: 1348: 488: 326: 310:

and advocates that more of the content be made available as a

1163:

Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016).

874:

2000 Kluwer Academic Publishers. Printed in the Netherlands

600:(GATE) is bundled with a free Information Extraction system 531:

The following standard approaches are now widely accepted:

1092:

International Journal on Document Analysis and Recognition

413:: identification of relations between entities, such as: 1466:

Detailed description of the information extraction task.

613:

is an automated information extraction web service from

334:

and populate a database with the information extracted.

1241:, Proceedings of WedelMusic, Darmstadt, Germany, 2002. 262:

MUC-1 (1987), MUC-3 (1989): Naval operations messages.

507:, have been used to induce such rules automatically. 66: 247:

by the Carnegie Group Inc with the aim of providing

2197: 2152: 2107: 2079: 2039: 1984: 1906: 1894: 1825: 1782: 1754: 1702: 1565: 1507: 1379:. Unsourced material may be challenged and removed. 1172:

Natural Language Processing and Information Systems

275:MUC-6 (1995): News articles on management changes. 190: 254:Beginning in 1987, IE was spurred by a series of 621:Machine Learning for Language Toolkit (Mallet) 1485: 8: 1041:, Cambridge U. Press, 14(1), 2008, pp.33-69. 1230:A.Zils, F.Pachet, O.Delerue and F. Gouyon, 1049: 1047: 698:Mining, crawling, scraping, and recognition 1903: 1699: 1492: 1478: 1470: 1439:Learn how and when to remove this message 1278: 1257: 1103: 1063: 945: 901: 828: 818: 598:General Architecture for Text Engineering 592:Free or open source software and services 447:: finding the relevant terms for a given 164: 133: 67: 65: 27:Machine reading of unstructured documents 1321:Shimizu, Nobuyuki; Hass, Andrew (2006). 1296:Information Processing & Management 1039:Journal of Natural Language Engineering 792: 759:Applications of artificial intelligence 278:MUC-7 (1998): Satellite launch reports. 202:from an online news sentence such as: 7: 1951:Simple Knowledge Organization System 1377:adding citations to reliable sources 887:Cowie, Jim; Wilks, Yorick (1996). 781:Outline of artificial intelligence 104: 101: 98: 95: 92: 89: 86: 83: 80: 77: 74: 71: 68: 25: 1966:Thesaurus (information retrieval) 1011:"Tim Berners-Lee on the next Web" 807:Journal of Translational Medicine 572:Conditional Markov model (CMM) / 523:pattern in the HTML source code. 441:Language and vocabulary analysis 256:Message Understanding Conferences 1353: 1220:(PhD). University of Manchester. 1152:(PhD). University of Manchester. 994:"Linked Data - The Story So Far" 1364:needs additional citations for 553:Multinomial logistic regression 516:adaptive information extraction 1547:Natural language understanding 185: 108: 1: 2071:Optical character recognition 776:List of emerging technologies 1764:Multi-document summarization 1180:10.1007/978-3-319-41754-7_14 1074:10.1007/978-3-030-15712-8_47 574:Maximum-entropy Markov model 272:and microelectronics domain. 49:(NLP). Recent activities in 2241:Natural language processing 2094:Latent Dirichlet allocation 2066:Natural language generation 1931:Machine-readable dictionary 1926:Linguistic Linked Open Data 1501:Natural language processing 682:Open information extraction 474:World Wide Web applications 233:natural language processing 47:natural language processing 2257: 1846:Explicit semantic analysis 1595:Deep linguistic processing 1458:Alias-I "competition" page 1211:Milosevic, Nikola (2018). 1143:Milosevic, Nikola (2018). 1122:10.1007/s10032-019-00317-0 820:10.1186/s12967-023-04011-y 1689:Word-sense disambiguation 1542:Computational linguistics 1308:10.1016/j.ipm.2005.09.002 580:Conditional random fields 398:resolution: detection of 302:, refers to the existing 2215:Natural Language Toolkit 2139:Pronunciation assessment 2041:Automatic identification 1871:Latent semantic analysis 1827:Distributional semantics 1712:Compound-term processing 1610:Named-entity recognition 1464:Gabor Melli's page on IE 1388:"Information extraction" 715:Named entity recognition 637:Natural Language Toolkit 563:Recurrent neural network 374:Named entity recognition 325:, or by marking-up with 249:real-time financial news 2119:Automated essay scoring 2089:Document classification 1756:Automatic summarization 484:enormous amount of data 411:Relationship extraction 41:and/or semi-structured 1976:Universal Dependencies 1669:Terminology extraction 1652:Semantic decomposition 1647:Semantic role labeling 1637:Part-of-speech tagging 1605:Information extraction 1590:Coreference resolution 1580:Collocation extraction 890:Information Extraction 731:Search and translation 692:Terminology extraction 617:(Free limited version) 549:maximum entropy models 543:naïve Bayes classifier 445:Terminology extraction 382:named entity detection 378:named entity detection 251:to financial traders. 192: 31:Information extraction 18:Information Extraction 1737:Sentence segmentation 956:10.3115/974499.974531 799:name=Kariampuzha2023 764:DARPA TIPSTER Program 229:information retrieval 193: 2189:Voice user interface 1900:datasets and corpora 1841:Document-term matrix 1694:Word-sense induction 1373:improve this article 940:. pp. 170–177. 747:Semantic translation 672:Knowledge extraction 290:Present significance 64: 2169:Interactive fiction 2099:Pachinko allocation 2056:Speech segmentation 2012:Google Ngram Viewer 1784:Machine translation 1774:Text simplification 1769:Sentence extraction 1657:Semantic similarity 1114:2019arXiv190210031M 677:Ontology extraction 644:CRF implementations 568:Hidden Markov model 499:techniques, either 344:text simplification 214:automated reasoning 2179:Question answering 2051:Speech recognition 1916:Corpus linguistics 1896:Language resources 1679:Textual entailment 1662:Sentiment analysis 1237:2017-08-29 at the 667:Keyword extraction 538:Using classifiers 338:Tasks and subtasks 298:, inventor of the 188: 2228: 2227: 2184:Virtual assistant 2109:Computer-assisted 2035: 2034: 1792:Computer-assisted 1750: 1749: 1742:Word segmentation 1704:Text segmentation 1642:Semantic analysis 1630:Syntactic parsing 1615:Ontology learning 1449: 1448: 1441: 1423: 1189:978-3-319-41753-0 982:978-1-84564-146-7 737:Enterprise search 627:DBpedia Spotlight 455:Audio extraction 16:(Redirected from 2248: 2205:Formal semantics 2154:Natural language 2061:Speech synthesis 2043:and data capture 1946:Semantic network 1921:Lexical resource 1904: 1722:Lexical analysis 1700: 1625:Semantic parsing 1494: 1487: 1480: 1471: 1444: 1437: 1433: 1430: 1424: 1422: 1381: 1357: 1349: 1344: 1343: 1341: 1340: 1334: 1328:. Archived from 1327: 1318: 1312: 1311: 1291: 1285: 1284: 1282: 1270: 1264: 1263: 1261: 1248: 1242: 1228: 1222: 1221: 1219: 1208: 1202: 1201: 1169: 1160: 1154: 1153: 1151: 1140: 1134: 1133: 1107: 1087: 1078: 1077: 1067: 1051: 1042: 1032: 1026: 1025: 1023: 1022: 1013:. Archived from 1007: 1001: 1000: 998: 990: 984: 974: 968: 967: 949: 929: 923: 922: 920: 914:. Archived from 905: 895: 884: 878: 877: 871: 865:FREITAG, DAYNE. 862: 856: 849: 843: 842: 832: 822: 797: 687:Table extraction 559:Sequence models 547:Discriminative: 497:Machine learning 332:natural language 197: 195: 194: 189: 169: 168: 138: 137: 107: 43:machine-readable 21: 2256: 2255: 2251: 2250: 2249: 2247: 2246: 2245: 2231: 2230: 2229: 2224: 2193: 2173:Syntax guessing 2155: 2148: 2134:Predictive text 2129:Grammar checker 2110: 2103: 2075: 2042: 2031: 1997:Bank of English 1980: 1908: 1899: 1890: 1821: 1778: 1746: 1698: 1600:Distant reading 1575:Argument mining 1561: 1557:Text processing 1503: 1498: 1454: 1445: 1434: 1428: 1425: 1382: 1380: 1370: 1358: 1347: 1338: 1336: 1332: 1325: 1320: 1319: 1315: 1293: 1292: 1288: 1272: 1271: 1267: 1250: 1249: 1245: 1239:Wayback Machine 1229: 1225: 1217: 1210: 1209: 1205: 1190: 1167: 1162: 1161: 1157: 1149: 1142: 1141: 1137: 1089: 1088: 1081: 1053: 1052: 1045: 1033: 1029: 1020: 1018: 1009: 1008: 1004: 996: 992: 991: 987: 975: 971: 931: 930: 926: 918: 893: 886: 885: 881: 869: 864: 863: 859: 850: 846: 800: 798: 794: 790: 785: 662:Data extraction 652: 631:name resolution 615:Thomson Reuters 594: 529: 476: 340: 323:relational form 296:Tim Berners-Lee 292: 241: 160: 129: 62: 61: 28: 23: 22: 15: 12: 11: 5: 2254: 2252: 2244: 2243: 2233: 2232: 2226: 2225: 2223: 2222: 2217: 2212: 2207: 2201: 2199: 2195: 2194: 2192: 2191: 2186: 2181: 2176: 2166: 2160: 2158: 2156:user interface 2150: 2149: 2147: 2146: 2141: 2136: 2131: 2126: 2121: 2115: 2113: 2105: 2104: 2102: 2101: 2096: 2091: 2085: 2083: 2077: 2076: 2074: 2073: 2068: 2063: 2058: 2053: 2047: 2045: 2037: 2036: 2033: 2032: 2030: 2029: 2024: 2019: 2014: 2009: 2004: 1999: 1994: 1988: 1986: 1982: 1981: 1979: 1978: 1973: 1968: 1963: 1958: 1953: 1948: 1943: 1938: 1933: 1928: 1923: 1918: 1912: 1910: 1901: 1892: 1891: 1889: 1888: 1883: 1881:Word embedding 1878: 1873: 1868: 1861:Language model 1858: 1853: 1848: 1843: 1838: 1832: 1830: 1823: 1822: 1820: 1819: 1814: 1812:Transfer-based 1809: 1804: 1799: 1794: 1788: 1786: 1780: 1779: 1777: 1776: 1771: 1766: 1760: 1758: 1752: 1751: 1748: 1747: 1745: 1744: 1739: 1734: 1729: 1724: 1719: 1714: 1708: 1706: 1697: 1696: 1691: 1686: 1681: 1676: 1671: 1665: 1664: 1659: 1654: 1649: 1644: 1639: 1634: 1633: 1632: 1627: 1617: 1612: 1607: 1602: 1597: 1592: 1587: 1585:Concept mining 1582: 1577: 1571: 1569: 1563: 1562: 1560: 1559: 1554: 1549: 1544: 1539: 1538: 1537: 1532: 1522: 1517: 1511: 1509: 1505: 1504: 1499: 1497: 1496: 1489: 1482: 1474: 1468: 1467: 1461: 1453: 1452:External links 1450: 1447: 1446: 1361: 1359: 1352: 1346: 1345: 1313: 1286: 1280:10.1.1.21.8236 1265: 1243: 1223: 1203: 1188: 1155: 1135: 1079: 1043: 1027: 1002: 985: 969: 947:10.1.1.14.7943 924: 921:on 2019-02-20. 903:10.1.1.61.6480 879: 857: 844: 791: 789: 786: 784: 783: 778: 772: 771: 767: 766: 761: 755: 754: 750: 749: 744: 742:Faceted search 739: 733: 732: 728: 727: 722: 717: 712: 710:Concept mining 707: 700: 699: 695: 694: 689: 684: 679: 674: 669: 664: 658: 657: 653: 651: 648: 647: 646: 640: 634: 624: 618: 608: 601: 593: 590: 586: 585: 584: 583: 577: 570: 565: 557: 556: 555: 545: 536: 528: 525: 475: 472: 463: 462: 461: 460: 453: 452: 451: 439: 438: 437: 434: 430: 424: 423: 422: 421: 420: 417: 408: 393: 368:Michelle Obama 360:Knowledge Base 357: 356: 355: 339: 336: 306:as the web of 300:World Wide Web 291: 288: 280: 279: 276: 273: 270:Joint ventures 268:MUC-5 (1993): 266: 263: 240: 237: 210: 209: 200: 199: 187: 184: 181: 178: 175: 172: 167: 163: 159: 156: 153: 150: 147: 144: 141: 136: 132: 128: 125: 122: 119: 116: 113: 110: 106: 103: 100: 97: 94: 91: 88: 85: 82: 79: 76: 73: 70: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 2253: 2242: 2239: 2238: 2236: 2221: 2218: 2216: 2213: 2211: 2210:Hallucination 2208: 2206: 2203: 2202: 2200: 2196: 2190: 2187: 2185: 2182: 2180: 2177: 2174: 2170: 2167: 2165: 2162: 2161: 2159: 2157: 2151: 2145: 2144:Spell checker 2142: 2140: 2137: 2135: 2132: 2130: 2127: 2125: 2122: 2120: 2117: 2116: 2114: 2112: 2106: 2100: 2097: 2095: 2092: 2090: 2087: 2086: 2084: 2082: 2078: 2072: 2069: 2067: 2064: 2062: 2059: 2057: 2054: 2052: 2049: 2048: 2046: 2044: 2038: 2028: 2025: 2023: 2020: 2018: 2015: 2013: 2010: 2008: 2005: 2003: 2000: 1998: 1995: 1993: 1990: 1989: 1987: 1983: 1977: 1974: 1972: 1969: 1967: 1964: 1962: 1959: 1957: 1956:Speech corpus 1954: 1952: 1949: 1947: 1944: 1942: 1939: 1937: 1936:Parallel text 1934: 1932: 1929: 1927: 1924: 1922: 1919: 1917: 1914: 1913: 1911: 1905: 1902: 1897: 1893: 1887: 1884: 1882: 1879: 1877: 1874: 1872: 1869: 1866: 1862: 1859: 1857: 1854: 1852: 1849: 1847: 1844: 1842: 1839: 1837: 1834: 1833: 1831: 1828: 1824: 1818: 1815: 1813: 1810: 1808: 1805: 1803: 1800: 1798: 1797:Example-based 1795: 1793: 1790: 1789: 1787: 1785: 1781: 1775: 1772: 1770: 1767: 1765: 1762: 1761: 1759: 1757: 1753: 1743: 1740: 1738: 1735: 1733: 1730: 1728: 1727:Text chunking 1725: 1723: 1720: 1718: 1717:Lemmatisation 1715: 1713: 1710: 1709: 1707: 1705: 1701: 1695: 1692: 1690: 1687: 1685: 1682: 1680: 1677: 1675: 1672: 1670: 1667: 1666: 1663: 1660: 1658: 1655: 1653: 1650: 1648: 1645: 1643: 1640: 1638: 1635: 1631: 1628: 1626: 1623: 1622: 1621: 1618: 1616: 1613: 1611: 1608: 1606: 1603: 1601: 1598: 1596: 1593: 1591: 1588: 1586: 1583: 1581: 1578: 1576: 1573: 1572: 1570: 1568: 1567:Text analysis 1564: 1558: 1555: 1553: 1550: 1548: 1545: 1543: 1540: 1536: 1533: 1531: 1528: 1527: 1526: 1523: 1521: 1518: 1516: 1513: 1512: 1510: 1508:General terms 1506: 1502: 1495: 1490: 1488: 1483: 1481: 1476: 1475: 1472: 1465: 1462: 1459: 1456: 1455: 1451: 1443: 1440: 1432: 1421: 1418: 1414: 1411: 1407: 1404: 1400: 1397: 1393: 1390: – 1389: 1385: 1384:Find sources: 1378: 1374: 1368: 1367: 1362:This article 1360: 1356: 1351: 1350: 1335:on 2006-09-01 1331: 1324: 1317: 1314: 1309: 1305: 1301: 1297: 1290: 1287: 1281: 1276: 1269: 1266: 1260: 1255: 1247: 1244: 1240: 1236: 1233: 1227: 1224: 1216: 1215: 1207: 1204: 1199: 1195: 1191: 1185: 1181: 1177: 1173: 1166: 1159: 1156: 1148: 1147: 1139: 1136: 1131: 1127: 1123: 1119: 1115: 1111: 1106: 1101: 1097: 1093: 1086: 1084: 1080: 1075: 1071: 1066: 1061: 1057: 1050: 1048: 1044: 1040: 1036: 1035:R. K. Srihari 1031: 1028: 1017:on 2011-04-10 1016: 1012: 1006: 1003: 995: 989: 986: 983: 979: 973: 970: 965: 961: 957: 953: 948: 943: 939: 935: 928: 925: 917: 913: 909: 904: 899: 896:. p. 3. 892: 891: 883: 880: 875: 868: 861: 858: 854: 848: 845: 840: 836: 831: 826: 821: 816: 812: 808: 804: 796: 793: 787: 782: 779: 777: 774: 773: 769: 768: 765: 762: 760: 757: 756: 752: 751: 748: 745: 743: 740: 738: 735: 734: 730: 729: 726: 723: 721: 718: 716: 713: 711: 708: 706:, web crawler 705: 702: 701: 697: 696: 693: 690: 688: 685: 683: 680: 678: 675: 673: 670: 668: 665: 663: 660: 659: 655: 654: 649: 645: 641: 638: 635: 632: 628: 625: 622: 619: 616: 612: 609: 606: 602: 599: 596: 595: 591: 589: 581: 578: 575: 571: 569: 566: 564: 561: 560: 558: 554: 550: 546: 544: 540: 539: 537: 534: 533: 532: 526: 524: 520: 517: 512: 508: 506: 502: 498: 494: 490: 485: 481: 473: 471: 470:and sources. 467: 457: 456: 454: 450: 446: 443: 442: 440: 435: 431: 428: 427: 425: 418: 415: 414: 412: 409: 405: 401: 397: 394: 391: 387: 384:would denote 383: 379: 375: 372: 371: 369: 365: 361: 358: 353: 352: 350: 349: 348: 345: 337: 335: 333: 328: 324: 320: 316: 315: 309: 305: 301: 297: 289: 287: 285: 277: 274: 271: 267: 264: 261: 260: 259: 257: 252: 250: 246: 238: 236: 234: 230: 225: 223: 219: 215: 208: 205: 204: 203: 182: 179: 176: 173: 170: 165: 161: 157: 154: 151: 148: 145: 142: 139: 134: 130: 126: 123: 120: 117: 114: 111: 60: 59: 58: 54: 52: 48: 44: 40: 36: 32: 19: 2124:Concordancer 1604: 1520:Bag-of-words 1435: 1426: 1416: 1409: 1402: 1395: 1383: 1371:Please help 1366:verification 1363: 1337:. Retrieved 1330:the original 1316: 1299: 1295: 1289: 1268: 1246: 1226: 1213: 1206: 1171: 1158: 1145: 1138: 1098:(1): 55–78. 1095: 1091: 1055: 1030: 1019:. Retrieved 1015:the original 1005: 988: 972: 937: 927: 916:the original 889: 882: 873: 860: 852: 847: 810: 806: 795: 725:Web scraping 704:Apache Nutch 587: 541:Generative: 530: 521: 515: 510: 509: 505:unsupervised 477: 468: 464: 389: 385: 381: 377: 364:Barack Obama 341: 313: 307: 293: 281: 253: 242: 226: 218:logical form 211: 206: 201: 55: 39:unstructured 34: 30: 29: 2081:Topic model 1961:Text corpus 1807:Statistical 1674:Text mining 1515:AI-complete 400:coreference 396:Coreference 1802:Rule-based 1684:Truecasing 1552:Stop words 1429:March 2017 1399:newspapers 1339:2010-03-27 1302:(4): 963. 1259:1506.08454 1105:1902.10031 1065:1812.11275 1021:2010-03-27 813:(1): 157. 788:References 720:Textmining 656:Extraction 611:OpenCalais 527:Approaches 501:supervised 366:, Spouse, 216:about the 51:multimedia 2111:reviewing 1909:standards 1907:Types and 1275:CiteSeerX 942:CiteSeerX 898:CiteSeerX 642:See also 404:anaphoric 386:detecting 308:documents 2235:Category 2027:Wikidata 2007:FrameNet 1992:BabelNet 1971:Treebank 1941:PropBank 1886:Word2vec 1851:fastText 1732:Stemming 1235:Archived 1198:19538141 1130:62880746 964:14746386 912:10237124 839:36855134 650:See also 551:such as 511:Wrappers 493:wrappers 390:M. Smith 319:metadata 304:Internet 2198:Related 2164:Chatbot 2022:WordNet 2002:DBpedia 1876:Seq2seq 1620:Parsing 1535:Trigram 1413:scholar 1110:Bibcode 830:9972634 753:General 605:OpenNLP 603:Apache 312:web of 245:Reuters 239:History 222:context 2171:(c.f. 1829:models 1817:Neural 1530:Bigram 1525:n-gram 1415: 1408: 1401: 1394: 1386: 1277: 1196: 1186: 1128: 980: 962: 944: 910: 900: 837: 827: 576:(MEMM) 459:piece. 449:corpus 2220:spaCy 1865:large 1856:GloVe 1420:JSTOR 1406:books 1333:(PDF) 1326:(PDF) 1254:arXiv 1218:(PDF) 1194:S2CID 1168:(PDF) 1150:(PDF) 1126:S2CID 1100:arXiv 1060:arXiv 997:(PDF) 960:S2CID 919:(PDF) 908:S2CID 894:(PDF) 870:(PDF) 770:Lists 487:HTML/ 284:DARPA 1985:Data 1836:BERT 1392:news 1184:ISBN 978:ISBN 835:PMID 402:and 314:data 2017:UBY 1375:by 1304:doi 1176:doi 1118:doi 1070:doi 952:doi 825:PMC 815:doi 503:or 489:XML 480:Web 327:XML 2237:: 1300:42 1298:. 1192:. 1182:. 1170:. 1124:. 1116:. 1108:. 1096:22 1094:. 1082:^ 1068:. 1058:. 1046:^ 958:. 950:. 936:. 906:. 872:. 833:. 823:. 811:21 809:. 805:. 370:) 224:. 35:IE 2175:) 1898:, 1867:) 1863:( 1493:e 1486:t 1479:v 1442:) 1436:( 1431:) 1427:( 1417:· 1410:· 1403:· 1396:· 1369:. 1342:. 1310:. 1306:: 1283:. 1262:. 1256:: 1200:. 1178:: 1132:. 1120:: 1112:: 1102:: 1076:. 1072:: 1062:: 1024:. 999:. 966:. 954:: 876:. 841:. 817:: 633:. 198:, 186:) 183:e 180:t 177:a 174:d 171:, 166:2 162:y 158:n 155:a 152:p 149:m 146:o 143:c 140:, 135:1 131:y 127:n 124:a 121:p 118:m 115:o 112:c 109:( 105:n 102:e 99:e 96:w 93:t 90:e 87:B 84:r 81:e 78:g 75:r 72:e 69:M 33:( 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index