Knowledge (XXG)

Document classification

Source đź“ť

106:: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach. 98:(or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). 65:. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. 126:
has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be
101:
Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand
135:
to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that
91:
is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is
79:
or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.
480:
Aitchison, J. (1986). "A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure." Journal of Documentation, Vol. 42 No. 3, pp.
448:
Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the
700:
X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7.
490:
Aitchison, J. (2004). "Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule." Bliss Classification Bulletin, Vol. 46, pp. 20-26.
310:
article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology
512:
Riesthuis, G. J. A., & Bliedung, St. (1991). "Thesaurification of the UDC." Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.
888: 1048: 131:
and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, the act of labeling a document (say by assigning a term from a
68:
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified,
294:, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger 687: 501:
A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification
1026: 644:. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158–165, ACL. 535: 1437: 881: 803: 1606: 201: 160:, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available. 823: 568: 1657: 1347: 1038: 874: 811: 1601: 1208: 1647: 1362: 1193: 1133: 655: 583: 304:, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. 1550: 1203: 326: 1198: 943: 391: 1467: 1188: 376: 341: 148:
where some external mechanism (such as human feedback) provides information on the correct classification for documents,
1642: 1160: 336: 245: 1505: 1490: 1462: 1327: 1322: 897: 617: 523:
Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts
221: 123: 1242: 1213: 991: 211: 1652: 1085: 938: 542: 346: 172: 1611: 1535: 1267: 1223: 1108: 1006: 331: 206: 195: 847: 786: 1515: 1152: 282: 239: 114:
Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning
1372: 1065: 1043: 1033: 1001: 976: 381: 371: 216: 182: 663:, BCS IRSG Symposium: Future Directions in Information Access, London, UK, pp. 54–63, archived from 92:
assigned. In automatic classification it could be the number of times given words appears in a document.
1232: 415: 366: 132: 820: 522: 471:
Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.
1637: 1585: 1261: 1237: 1090: 1565: 1495: 1452: 1408: 1180: 1170: 1165: 1053: 594: 565: 411: 401: 361: 295: 153: 115: 76: 42: 1575: 1447: 1312: 1075: 1058: 916: 356: 301: 156:), where the classification must be done entirely without reference to external information, and 54: 808: 279:, sending an email sent to a general address to a specific address or mailbox depending on topic 1580: 1292: 1100: 1011: 754: 681: 500: 1457: 1342: 1317: 1118: 1021: 744: 734: 701: 406: 386: 119: 46: 1569: 1530: 1525: 1393: 1123: 996: 971: 953: 851: 827: 815: 790: 572: 62: 38: 832: 861: 856: 1277: 1257: 981: 839:
Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python
749: 718: 428: 396: 321: 190: 177: 58: 1631: 1540: 1352: 1332: 1113: 664: 186: 1520: 17: 459: 1477: 1357: 1070: 986: 963: 911: 420: 291: 269: 265: 719:"Overview of the protein-protein interaction annotation extraction task of Bio 705: 307:
health-related classification using social media in public health surveillance
250: 1080: 866: 424: 948: 739: 227: 128: 758: 1423: 1403: 1388: 1367: 1337: 1282: 1247: 1128: 351: 233: 144:
Automatic document classification tasks can be divided into three sorts:
50: 1560: 1418: 1398: 1272: 1016: 931: 276: 776: 926: 921: 838: 717:
Krallinger, M; Leitner, F; Rodriguez-Penagos, C; Valencia, A (2008).
605: 460:
Organizing information: Principles of data base and retrieval systems
783: 641: 288:
genre classification, automatically determining the genre of a text
1616: 1252: 844: 784:
Information Retrieval: Implementing and Evaluating Search Engines
1138: 870: 1413: 782:
Stefan BĂĽttcher, Charles L. A. Clarke, and Gordon V. Cormack.
657:
Testing a Genre-Enabled Application: A Preliminary Assessment
845:
TechTC - Technion Repository of Text Categorization Datasets
536:"An Interactive Automatic Document Classification Prototype" 640:
Stephan Busemann, Sven Schmeier and Roman G. Arens (2000).
521:
Rossi, R. G., Lopes, A. d. A., and Rezende, S. O. (2016).
862:
BioCreative III ACT (article classification task) dataset
525:. Information Processing & Management, 52(2):217–257. 566:
Interactive Automatic Document Classification Prototype
61:. This may be done "manually" (or "intellectually") or 618:"3 Document Classification Methods for Tough Projects" 168:
Automatic document classification techniques include:
84:"Content-based" versus "request-based" classification 1594: 1549: 1504: 1476: 1436: 1381: 1303: 1291: 1222: 1179: 1151: 1099: 962: 904: 503:(2nd Ed.).]" Axiomathes, Vol. 18 No.2, pp. 193-210. 285:, automatically determining the language of a text 777:Machine learning in automated text categorization 261:Classification techniques have been applied to 75:Documents may be classified according to their 882: 809:Bibliography on Automated Text Categorization 8: 1300: 1096: 889: 875: 867: 779:. ACM Computing Surveys, 34(1):1–47, 2002. 686:: CS1 maint: location missing publisher ( 748: 738: 642:Message classification in the call center 804:Introduction to document classification 441: 202:Instantaneously trained neural networks 158:semi-supervised document classification 140:Automatic document classification (ADC) 679: 595:ABBYY FineReader Engine 11 for Windows 654:Santini, Marina; Rosso, Mark (2008), 7: 1348:Simple Knowledge Organization System 821:Bibliography on Query Classification 150:unsupervised document classification 268:, a process which tries to discern 102:request-oriented classification as 146:supervised document classification 25: 1363:Thesaurus (information retrieval) 27:Process of categorizing documents 584:Document Classification - Artsyl 327:Classification (disambiguation) 272:messages from legitimate emails 96:Request-oriented classification 944:Natural language understanding 462:. Orlando, FL: Academic Press. 392:Native Language Identification 246:K-nearest neighbour algorithms 110:Classification versus indexing 1: 1468:Optical character recognition 377:Knowledge Organization System 342:Content-based image retrieval 1161:Multi-document summarization 337:Concept-based image indexing 89:Content-based classification 1658:Natural language processing 1491:Latent Dirichlet allocation 1463:Natural language generation 1328:Machine-readable dictionary 1323:Linguistic Linked Open Data 898:Natural language processing 222:Natural language processing 124:Frederick Wilfrid Lancaster 104:policy-based classification 1674: 1243:Explicit semantic analysis 992:Deep linguistic processing 706:10.1109/SECON.2017.7925400 458:Soergel, Dagobert (1985). 212:Multiple-instance learning 49:. The task is to assign a 1086:Word-sense disambiguation 939:Computational linguistics 857:David D. Lewis's Datasets 347:Decimal section numbering 173:Artificial neural network 1648:Knowledge representation 1612:Natural Language Toolkit 1536:Pronunciation assessment 1438:Automatic identification 1268:Latent semantic analysis 1224:Distributional semantics 1109:Compound-term processing 1007:Named-entity recognition 332:Compound term processing 207:Latent semantic indexing 196:Expectation maximization 1516:Automated essay scoring 1486:Document classification 1153:Automatic summarization 740:10.1186/gb-2008-9-s2-s4 571:April 24, 2015, at the 499:Broughton, V. (2008). " 283:language identification 240:Support vector machines 35:document categorization 31:Document classification 1373:Universal Dependencies 1066:Terminology extraction 1049:Semantic decomposition 1044:Semantic role labeling 1034:Part-of-speech tagging 1002:Information extraction 987:Coreference resolution 977:Collocation extraction 382:Library classification 372:Knowledge organization 292:readability assessment 217:Naive Bayes classifier 1134:Sentence segmentation 775:Fabrizio Sebastiani. 416:unsupervised learning 367:Information retrieval 133:controlled vocabulary 1586:Voice user interface 1297:datasets and corpora 1238:Document-term matrix 1091:Word-sense induction 606:Classifier - Antidot 1643:Information science 1566:Interactive fiction 1496:Pachinko allocation 1453:Speech segmentation 1409:Google Ngram Viewer 1181:Machine translation 1171:Text simplification 1166:Sentence extraction 1054:Semantic similarity 833:Text Classification 412:Supervised learning 402:Subject (documents) 362:Document clustering 296:text simplification 154:document clustering 127:transformed into a 70:text classification 43:information science 18:Text categorization 1576:Question answering 1448:Speech recognition 1313:Corpus linguistics 1293:Language resources 1076:Textual entailment 1059:Sentiment analysis 850:2020-02-14 at the 841:(available online) 826:2019-10-02 at the 814:2019-09-26 at the 793:. MIT Press, 2010. 789:2020-10-05 at the 357:Document retrieval 302:sentiment analysis 1625: 1624: 1581:Virtual assistant 1506:Computer-assisted 1432: 1431: 1189:Computer-assisted 1147: 1146: 1139:Word segmentation 1101:Text segmentation 1039:Semantic analysis 1027:Syntactic parsing 1012:Ontology learning 236:-based classifier 230:-based classifier 16:(Redirected from 1665: 1653:Machine learning 1602:Formal semantics 1551:Natural language 1458:Speech synthesis 1440:and data capture 1343:Semantic network 1318:Lexical resource 1301: 1119:Lexical analysis 1097: 1022:Semantic parsing 891: 884: 877: 868: 763: 762: 752: 742: 714: 708: 698: 692: 691: 685: 677: 676: 675: 669: 662: 651: 645: 638: 632: 631: 629: 628: 614: 608: 603: 597: 592: 586: 581: 575: 563: 557: 556: 554: 553: 547: 541:. Archived from 540: 532: 526: 519: 513: 510: 504: 497: 491: 488: 482: 478: 472: 469: 463: 456: 450: 446: 407:Subject indexing 387:Machine learning 120:subject indexing 47:computer science 37:is a problem in 21: 1673: 1672: 1668: 1667: 1666: 1664: 1663: 1662: 1628: 1627: 1626: 1621: 1590: 1570:Syntax guessing 1552: 1545: 1531:Predictive text 1526:Grammar checker 1507: 1500: 1472: 1439: 1428: 1394:Bank of English 1377: 1305: 1296: 1287: 1218: 1175: 1143: 1095: 997:Distant reading 972:Argument mining 958: 954:Text processing 900: 895: 852:Wayback Machine 828:Wayback Machine 816:Wayback Machine 800: 791:Wayback Machine 772: 770:Further reading 767: 766: 733:(Suppl 2): S4. 716: 715: 711: 699: 695: 678: 673: 671: 667: 660: 653: 652: 648: 639: 635: 626: 624: 616: 615: 611: 604: 600: 593: 589: 582: 578: 573:Wayback Machine 564: 560: 551: 549: 545: 538: 534: 533: 529: 520: 516: 511: 507: 498: 494: 489: 485: 479: 475: 470: 466: 457: 453: 447: 443: 438: 433: 317: 259: 166: 152:(also known as 142: 118:to documents (" 112: 86: 63:algorithmically 53:to one or more 39:library science 28: 23: 22: 15: 12: 11: 5: 1671: 1669: 1661: 1660: 1655: 1650: 1645: 1640: 1630: 1629: 1623: 1622: 1620: 1619: 1614: 1609: 1604: 1598: 1596: 1592: 1591: 1589: 1588: 1583: 1578: 1573: 1563: 1557: 1555: 1553:user interface 1547: 1546: 1544: 1543: 1538: 1533: 1528: 1523: 1518: 1512: 1510: 1502: 1501: 1499: 1498: 1493: 1488: 1482: 1480: 1474: 1473: 1471: 1470: 1465: 1460: 1455: 1450: 1444: 1442: 1434: 1433: 1430: 1429: 1427: 1426: 1421: 1416: 1411: 1406: 1401: 1396: 1391: 1385: 1383: 1379: 1378: 1376: 1375: 1370: 1365: 1360: 1355: 1350: 1345: 1340: 1335: 1330: 1325: 1320: 1315: 1309: 1307: 1298: 1289: 1288: 1286: 1285: 1280: 1278:Word embedding 1275: 1270: 1265: 1258:Language model 1255: 1250: 1245: 1240: 1235: 1229: 1227: 1220: 1219: 1217: 1216: 1211: 1209:Transfer-based 1206: 1201: 1196: 1191: 1185: 1183: 1177: 1176: 1174: 1173: 1168: 1163: 1157: 1155: 1149: 1148: 1145: 1144: 1142: 1141: 1136: 1131: 1126: 1121: 1116: 1111: 1105: 1103: 1094: 1093: 1088: 1083: 1078: 1073: 1068: 1062: 1061: 1056: 1051: 1046: 1041: 1036: 1031: 1030: 1029: 1024: 1014: 1009: 1004: 999: 994: 989: 984: 982:Concept mining 979: 974: 968: 966: 960: 959: 957: 956: 951: 946: 941: 936: 935: 934: 929: 919: 914: 908: 906: 902: 901: 896: 894: 893: 886: 879: 871: 865: 864: 859: 854: 842: 836: 830: 818: 806: 799: 798:External links 796: 795: 794: 780: 771: 768: 765: 764: 727:Genome Biology 709: 693: 646: 633: 609: 598: 587: 576: 558: 527: 514: 505: 492: 483: 473: 464: 451: 440: 439: 437: 434: 432: 431: 429:concept mining 418: 409: 404: 399: 397:String metrics 394: 389: 384: 379: 374: 369: 364: 359: 354: 349: 344: 339: 334: 329: 324: 322:Categorization 318: 316: 313: 312: 311: 308: 305: 299: 289: 286: 280: 273: 266:spam filtering 258: 255: 254: 253: 248: 243: 237: 231: 225: 219: 214: 209: 204: 199: 193: 183:Decision trees 180: 178:Concept Mining 175: 165: 162: 141: 138: 111: 108: 85: 82: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 1670: 1659: 1656: 1654: 1651: 1649: 1646: 1644: 1641: 1639: 1636: 1635: 1633: 1618: 1615: 1613: 1610: 1608: 1607:Hallucination 1605: 1603: 1600: 1599: 1597: 1593: 1587: 1584: 1582: 1579: 1577: 1574: 1571: 1567: 1564: 1562: 1559: 1558: 1556: 1554: 1548: 1542: 1541:Spell checker 1539: 1537: 1534: 1532: 1529: 1527: 1524: 1522: 1519: 1517: 1514: 1513: 1511: 1509: 1503: 1497: 1494: 1492: 1489: 1487: 1484: 1483: 1481: 1479: 1475: 1469: 1466: 1464: 1461: 1459: 1456: 1454: 1451: 1449: 1446: 1445: 1443: 1441: 1435: 1425: 1422: 1420: 1417: 1415: 1412: 1410: 1407: 1405: 1402: 1400: 1397: 1395: 1392: 1390: 1387: 1386: 1384: 1380: 1374: 1371: 1369: 1366: 1364: 1361: 1359: 1356: 1354: 1353:Speech corpus 1351: 1349: 1346: 1344: 1341: 1339: 1336: 1334: 1333:Parallel text 1331: 1329: 1326: 1324: 1321: 1319: 1316: 1314: 1311: 1310: 1308: 1302: 1299: 1294: 1290: 1284: 1281: 1279: 1276: 1274: 1271: 1269: 1266: 1263: 1259: 1256: 1254: 1251: 1249: 1246: 1244: 1241: 1239: 1236: 1234: 1231: 1230: 1228: 1225: 1221: 1215: 1212: 1210: 1207: 1205: 1202: 1200: 1197: 1195: 1194:Example-based 1192: 1190: 1187: 1186: 1184: 1182: 1178: 1172: 1169: 1167: 1164: 1162: 1159: 1158: 1156: 1154: 1150: 1140: 1137: 1135: 1132: 1130: 1127: 1125: 1124:Text chunking 1122: 1120: 1117: 1115: 1114:Lemmatisation 1112: 1110: 1107: 1106: 1104: 1102: 1098: 1092: 1089: 1087: 1084: 1082: 1079: 1077: 1074: 1072: 1069: 1067: 1064: 1063: 1060: 1057: 1055: 1052: 1050: 1047: 1045: 1042: 1040: 1037: 1035: 1032: 1028: 1025: 1023: 1020: 1019: 1018: 1015: 1013: 1010: 1008: 1005: 1003: 1000: 998: 995: 993: 990: 988: 985: 983: 980: 978: 975: 973: 970: 969: 967: 965: 964:Text analysis 961: 955: 952: 950: 947: 945: 942: 940: 937: 933: 930: 928: 925: 924: 923: 920: 918: 915: 913: 910: 909: 907: 905:General terms 903: 899: 892: 887: 885: 880: 878: 873: 872: 869: 863: 860: 858: 855: 853: 849: 846: 843: 840: 837: 835:analysis page 834: 831: 829: 825: 822: 819: 817: 813: 810: 807: 805: 802: 801: 797: 792: 788: 785: 781: 778: 774: 773: 769: 760: 756: 751: 746: 741: 736: 732: 728: 724: 722: 713: 710: 707: 703: 697: 694: 689: 683: 670:on 2019-11-15 666: 659: 658: 650: 647: 643: 637: 634: 623: 622:www.bisok.com 619: 613: 610: 607: 602: 599: 596: 591: 588: 585: 580: 577: 574: 570: 567: 562: 559: 548:on 2017-11-15 544: 537: 531: 528: 524: 518: 515: 509: 506: 502: 496: 493: 487: 484: 477: 474: 468: 465: 461: 455: 452: 445: 442: 435: 430: 426: 422: 419: 417: 413: 410: 408: 405: 403: 400: 398: 395: 393: 390: 388: 385: 383: 380: 378: 375: 373: 370: 368: 365: 363: 360: 358: 355: 353: 350: 348: 345: 343: 340: 338: 335: 333: 330: 328: 325: 323: 320: 319: 314: 309: 306: 303: 300: 297: 293: 290: 287: 284: 281: 278: 274: 271: 267: 264: 263: 262: 256: 252: 249: 247: 244: 241: 238: 235: 232: 229: 226: 223: 220: 218: 215: 213: 210: 208: 205: 203: 200: 197: 194: 192: 188: 184: 181: 179: 176: 174: 171: 170: 169: 163: 161: 159: 155: 151: 147: 139: 137: 134: 130: 125: 121: 117: 109: 107: 105: 99: 97: 93: 90: 83: 81: 78: 73: 71: 66: 64: 60: 56: 52: 48: 44: 40: 36: 32: 19: 1521:Concordancer 1485: 917:Bag-of-words 730: 726: 720: 712: 696: 672:, retrieved 665:the original 656: 649: 636: 625:. Retrieved 621: 612: 601: 590: 579: 561: 550:. Retrieved 543:the original 530: 517: 508: 495: 486: 476: 467: 454: 444: 260: 257:Applications 167: 157: 149: 145: 143: 113: 103: 100: 95: 94: 88: 87: 74: 72:is implied. 69: 67: 34: 30: 29: 1638:Data mining 1478:Topic model 1358:Text corpus 1204:Statistical 1071:Text mining 912:AI-complete 421:Text mining 270:E-mail spam 1632:Categories 1199:Rule-based 1081:Truecasing 949:Stop words 674:2011-10-21 627:2021-08-04 552:2017-11-14 436:References 425:web mining 224:approaches 164:Techniques 122:") but as 59:categories 1508:reviewing 1306:standards 1304:Types and 228:Rough set 129:thesaurus 1424:Wikidata 1404:FrameNet 1389:BabelNet 1368:Treebank 1338:PropBank 1283:Word2vec 1248:fastText 1129:Stemming 848:Archived 824:Archived 812:Archived 787:Archived 759:18834495 721:Creative 682:citation 569:Archived 481:160-181. 352:Document 315:See also 234:Soft set 185:such as 136:label. 116:subjects 77:subjects 51:document 1595:Related 1561:Chatbot 1419:WordNet 1399:DBpedia 1273:Seq2seq 1017:Parsing 932:Trigram 750:2559988 449:work.") 277:routing 55:classes 1568:(c.f. 1226:models 1214:Neural 927:Bigram 922:n-gram 757:  747:  298:system 275:email 251:tf–idf 1617:spaCy 1262:large 1253:GloVe 668:(PDF) 661:(PDF) 546:(PDF) 539:(PDF) 242:(SVM) 1382:Data 1233:BERT 755:PMID 688:link 198:(EM) 191:C4.5 45:and 1414:UBY 745:PMC 735:doi 723:II" 702:doi 189:or 187:ID3 57:or 33:or 1634:: 753:. 743:. 729:. 725:. 684:}} 680:{{ 620:. 427:, 423:, 414:, 41:, 1572:) 1295:, 1264:) 1260:( 890:e 883:t 876:v 761:. 737:: 731:9 704:: 690:) 630:. 555:. 20:)

Index

Text categorization
library science
information science
computer science
document
classes
categories
algorithmically
subjects
subjects
subject indexing
Frederick Wilfrid Lancaster
thesaurus
controlled vocabulary
document clustering
Artificial neural network
Concept Mining
Decision trees
ID3
C4.5
Expectation maximization
Instantaneously trained neural networks
Latent semantic indexing
Multiple-instance learning
Naive Bayes classifier
Natural language processing
Rough set
Soft set
Support vector machines
K-nearest neighbour algorithms

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑