Document clustering - Knowledge (XXG)

121: 22: 63: 230:

In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However,

348:

After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about

242:

These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document's assignment is a distribution over all clusters. In a

226:

The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets,

222:

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for

338:

Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.

285:

often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories.

324:

Different tokens might carry out similar information (e.g. tokenization and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.

504:

Wui Lee Chang, Kai Meng Tay, and Chee Peng Lim, A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization, Neural Processing Letters, DOI: 10.1007/s11063-017-9597-3.

369:

Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space. See

533: 693: 494:

Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys, Volume 41, Issue 3 (July 2009), Article No. 17,

671: 235:

and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the

385:

where the algorithm's goal is to create internally coherent clusters that are distinct from one another. Classification on the other hand, is a form of

1282: 1082: 526: 303:

Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include

1251: 252: 992: 683: 519: 35: 142: 1246: 853: 1007: 838: 182: 164: 102: 49: 349:

the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. See

778: 1195: 848: 298: 843: 588: 381:

Clustering algorithms in computational text analysis groups documents into grouping a set of text what are called subsets or

263: 1112: 833: 84: 73: 805: 1150: 1135: 1107: 972: 967: 542: 227:

etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.

135: 129: 887: 858: 636: 730: 583: 146: 1256: 1180: 912: 868: 753: 651: 398: 370: 248: 244: 80: 41: 359:

We can then cluster different documents based on the features we have generated. See the algorithm section in

1160: 1130: 797: 1017: 710: 688: 678: 646: 621: 269:

Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters.

877: 211: 231:

such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the

1230: 906: 882: 735: 489:

Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007

1210: 1140: 1097: 1053: 825: 815: 810: 698: 386: 207: 1220: 1092: 957: 720: 703: 561: 304: 499: 1225: 937: 745: 656: 506: 495: 282: 236: 232: 1102: 987: 962: 763: 666: 490: 403: 360: 270: 203: 1214: 1175: 1170: 1038: 768: 641: 616: 598: 922: 902: 626: 1276: 1185: 997: 977: 758: 319: 247:

methods can be considered a subtype of soft clustering; for documents, these include

1165: 308: 438: 389:

where the features of the documents are used to predict the "type" of documents.

1122: 1002: 715: 631: 608: 556: 333: 256: 725: 511: 329: 206:

to textual documents. It has applications in automatic document organization,

593: 243:

soft assignment, a document has fractional membership in several clusters.

1068: 1048: 1033: 1012: 982: 927: 892: 773: 315: 239:

are more efficient and provide sufficient information for most purposes.

1205: 1063: 1043: 917: 661: 576: 571: 566: 350: 343: 450: 1261: 897: 294:

In practice, document clustering often takes the following steps:

478:

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.

783: 515: 1058: 114: 56: 15: 507:

https://link.springer.com/article/10.1007/s11063-017-9597-3

427:

Foundations of Statistical Natural Language Processing

266:

supported clustering and order sensitive clustering.

83:. Please help to ensure that disputed statements are 1239: 1194: 1149: 1121: 1081: 1026: 948: 936: 867: 824: 796: 744: 607: 549: 262:Other algorithms involve graph based clustering, 439:http://nlp.stanford.edu/IR-book/pdf/16flat.pdf 527: 8: 363:for different types of clustering methods. 50:Learn how and when to remove these messages 945: 741: 534: 520: 512: 421: 419: 183:Learn how and when to remove this message 165:Learn how and when to remove this message 103:Learn how and when to remove this message 128:This article includes a list of general 79:Relevant discussion may be found on the 451:"Introduction to Information Retrieval" 415: 484:Introduction to Information Retrieval. 253:truncated singular value decomposition 429:, MIT Press. Cambridge, MA: May 1999. 425:Manning, Chris, and Hinrich Schütze, 7: 993:Simple Knowledge Organization System 134:it lacks sufficient corresponding 14: 1008:Thesaurus (information retrieval) 342:4. Computing term frequencies or 31:This article has multiple issues. 1283:Information retrieval techniques 486:Cambridge University Press. 2008 366:6. Evaluation and visualization 119: 61: 20: 39:or discuss these issues on the 589:Natural language understanding 1: 1113:Optical character recognition 806:Multi-document summarization 277:Clustering in search engines 1136:Latent Dirichlet allocation 1108:Natural language generation 973:Machine-readable dictionary 968:Linguistic Linked Open Data 543:Natural language processing 1299: 888:Explicit semantic analysis 637:Deep linguistic processing 353:for detailed discussions. 731:Word-sense disambiguation 584:Computational linguistics 377:Clustering v. Classifying 1257:Natural Language Toolkit 1181:Pronunciation assessment 1083:Automatic identification 913:Latent semantic analysis 869:Distributional semantics 754:Compound-term processing 652:Named-entity recognition 373:as a possible approach. 371:multidimensional scaling 273:exist for this purpose. 255:on term histograms) and 249:latent semantic indexing 245:Dimensionality reduction 202:) is the application of 1161:Automated essay scoring 1131:Document classification 798:Automatic summarization 149:more precise citations. 1018:Universal Dependencies 711:Terminology extraction 694:Semantic decomposition 689:Semantic role labeling 679:Part-of-speech tagging 647:Information extraction 632:Coreference resolution 622:Collocation extraction 779:Sentence segmentation 212:information retrieval 1231:Voice user interface 942:datasets and corpora 883:Document-term matrix 736:Word-sense induction 210:extraction and fast 72:factual accuracy is 1211:Interactive fiction 1141:Pachinko allocation 1098:Speech segmentation 1054:Google Ngram Viewer 826:Machine translation 816:Text simplification 811:Sentence extraction 699:Semantic similarity 387:supervised learning 196:Document clustering 1221:Question answering 1093:Speech recognition 958:Corpus linguistics 938:Language resources 721:Textual entailment 704:Sentiment analysis 305:Bag-of-words model 1270: 1269: 1226:Virtual assistant 1151:Computer-assisted 1077: 1076: 834:Computer-assisted 792: 791: 784:Word segmentation 746:Text segmentation 684:Semantic analysis 672:Syntactic parsing 657:Ontology learning 283:web search engine 237:K-means algorithm 233:K-means algorithm 193: 192: 185: 175: 174: 167: 113: 112: 105: 54: 1290: 1247:Formal semantics 1196:Natural language 1103:Speech synthesis 1085:and data capture 988:Semantic network 963:Lexical resource 946: 764:Lexical analysis 742: 667:Semantic parsing 536: 529: 522: 513: 466: 465: 463: 462: 455:nlp.stanford.edu 447: 441: 436: 430: 423: 404:Fuzzy clustering 361:cluster analysis 204:cluster analysis 188: 181: 170: 163: 159: 156: 150: 145:this article by 136:inline citations 123: 122: 115: 108: 101: 97: 94: 88: 85:reliably sourced 65: 64: 57: 46: 24: 23: 16: 1298: 1297: 1293: 1292: 1291: 1289: 1288: 1287: 1273: 1272: 1271: 1266: 1235: 1215:Syntax guessing 1197: 1190: 1176:Predictive text 1171:Grammar checker 1152: 1145: 1117: 1084: 1073: 1039:Bank of English 1022: 950: 941: 932: 863: 820: 788: 740: 642:Distant reading 617:Argument mining 603: 599:Text processing 545: 540: 480:Flat Clustering 475: 470: 469: 460: 458: 449: 448: 444: 437: 433: 424: 417: 412: 395: 379: 292: 279: 271:Various methods 220: 200:text clustering 189: 178: 177: 176: 171: 160: 154: 151: 141:Please help to 140: 124: 120: 109: 98: 92: 89: 78: 70:This article's 66: 62: 25: 21: 12: 11: 5: 1296: 1294: 1286: 1285: 1275: 1274: 1268: 1267: 1265: 1264: 1259: 1254: 1249: 1243: 1241: 1237: 1236: 1234: 1233: 1228: 1223: 1218: 1208: 1202: 1200: 1198:user interface 1192: 1191: 1189: 1188: 1183: 1178: 1173: 1168: 1163: 1157: 1155: 1147: 1146: 1144: 1143: 1138: 1133: 1127: 1125: 1119: 1118: 1116: 1115: 1110: 1105: 1100: 1095: 1089: 1087: 1079: 1078: 1075: 1074: 1072: 1071: 1066: 1061: 1056: 1051: 1046: 1041: 1036: 1030: 1028: 1024: 1023: 1021: 1020: 1015: 1010: 1005: 1000: 995: 990: 985: 980: 975: 970: 965: 960: 954: 952: 943: 934: 933: 931: 930: 925: 923:Word embedding 920: 915: 910: 903:Language model 900: 895: 890: 885: 880: 874: 872: 865: 864: 862: 861: 856: 854:Transfer-based 851: 846: 841: 836: 830: 828: 822: 821: 819: 818: 813: 808: 802: 800: 794: 793: 790: 789: 787: 786: 781: 776: 771: 766: 761: 756: 750: 748: 739: 738: 733: 728: 723: 718: 713: 707: 706: 701: 696: 691: 686: 681: 676: 675: 674: 669: 659: 654: 649: 644: 639: 634: 629: 627:Concept mining 624: 619: 613: 611: 605: 604: 602: 601: 596: 591: 586: 581: 580: 579: 574: 564: 559: 553: 551: 547: 546: 541: 539: 538: 531: 524: 516: 510: 509: 502: 492: 487: 474: 471: 468: 467: 442: 431: 414: 413: 411: 408: 407: 406: 401: 394: 391: 378: 375: 356:5. Clustering 291: 288: 278: 275: 223:search users. 219: 216: 214:or filtering. 191: 190: 173: 172: 127: 125: 118: 111: 110: 69: 67: 60: 55: 29: 28: 26: 19: 13: 10: 9: 6: 4: 3: 2: 1295: 1284: 1281: 1280: 1278: 1263: 1260: 1258: 1255: 1253: 1252:Hallucination 1250: 1248: 1245: 1244: 1242: 1238: 1232: 1229: 1227: 1224: 1222: 1219: 1216: 1212: 1209: 1207: 1204: 1203: 1201: 1199: 1193: 1187: 1186:Spell checker 1184: 1182: 1179: 1177: 1174: 1172: 1169: 1167: 1164: 1162: 1159: 1158: 1156: 1154: 1148: 1142: 1139: 1137: 1134: 1132: 1129: 1128: 1126: 1124: 1120: 1114: 1111: 1109: 1106: 1104: 1101: 1099: 1096: 1094: 1091: 1090: 1088: 1086: 1080: 1070: 1067: 1065: 1062: 1060: 1057: 1055: 1052: 1050: 1047: 1045: 1042: 1040: 1037: 1035: 1032: 1031: 1029: 1025: 1019: 1016: 1014: 1011: 1009: 1006: 1004: 1001: 999: 998:Speech corpus 996: 994: 991: 989: 986: 984: 981: 979: 978:Parallel text 976: 974: 971: 969: 966: 964: 961: 959: 956: 955: 953: 947: 944: 939: 935: 929: 926: 924: 921: 919: 916: 914: 911: 908: 904: 901: 899: 896: 894: 891: 889: 886: 884: 881: 879: 876: 875: 873: 870: 866: 860: 857: 855: 852: 850: 847: 845: 842: 840: 839:Example-based 837: 835: 832: 831: 829: 827: 823: 817: 814: 812: 809: 807: 804: 803: 801: 799: 795: 785: 782: 780: 777: 775: 772: 770: 769:Text chunking 767: 765: 762: 760: 759:Lemmatisation 757: 755: 752: 751: 749: 747: 743: 737: 734: 732: 729: 727: 724: 722: 719: 717: 714: 712: 709: 708: 705: 702: 700: 697: 695: 692: 690: 687: 685: 682: 680: 677: 673: 670: 668: 665: 664: 663: 660: 658: 655: 653: 650: 648: 645: 643: 640: 638: 635: 633: 630: 628: 625: 623: 620: 618: 615: 614: 612: 610: 609:Text analysis 606: 600: 597: 595: 592: 590: 587: 585: 582: 578: 575: 573: 570: 569: 568: 565: 563: 560: 558: 555: 554: 552: 550:General terms 548: 544: 537: 532: 530: 525: 523: 518: 517: 514: 508: 503: 501: 497: 493: 491: 488: 485: 481: 477: 476: 472: 457:. p. 349 456: 452: 446: 443: 440: 435: 432: 428: 422: 420: 416: 409: 405: 402: 400: 397: 396: 392: 390: 388: 384: 376: 374: 372: 367: 364: 362: 357: 354: 352: 346: 345: 340: 336: 335: 331: 326: 322: 321: 320:lemmatization 317: 312: 310: 306: 301: 300: 295: 289: 287: 284: 276: 274: 272: 267: 265: 260: 258: 254: 250: 246: 240: 238: 234: 228: 224: 217: 215: 213: 209: 205: 201: 197: 187: 184: 169: 166: 158: 148: 144: 138: 137: 131: 126: 117: 116: 107: 104: 96: 86: 82: 76: 75: 68: 59: 58: 53: 51: 44: 43: 38: 37: 32: 27: 18: 17: 1166:Concordancer 562:Bag-of-words 483: 479: 473:Bibliography 459:. Retrieved 454: 445: 434: 426: 382: 380: 368: 365: 358: 355: 347: 341: 337: 328:3. Removing 327: 323: 313: 309:N-gram model 302: 299:Tokenization 296: 293: 280: 268: 261: 257:topic models 241: 229: 225: 221: 199: 195: 194: 179: 161: 152: 133: 99: 90: 71: 47: 40: 34: 33:Please help 30: 1123:Topic model 1003:Text corpus 849:Statistical 716:Text mining 557:AI-complete 334:punctuation 147:introducing 844:Rule-based 726:Truecasing 594:Stop words 461:2016-05-03 410:References 330:stop words 290:Procedures 155:March 2014 130:references 93:March 2014 36:improve it 1153:reviewing 951:standards 949:Types and 500:0360-0300 81:talk page 42:talk page 1277:Category 1069:Wikidata 1049:FrameNet 1034:BabelNet 1013:Treebank 983:PropBank 928:Word2vec 893:fastText 774:Stemming 393:See also 383:clusters 316:Stemming 264:ontology 218:Overview 74:disputed 1240:Related 1206:Chatbot 1064:WordNet 1044:DBpedia 918:Seq2seq 662:Parsing 577:Trigram 399:Cluster 143:improve 1213:(c.f. 871:models 859:Neural 572:Bigram 567:n-gram 498: 351:tf-idf 344:tf-idf 132:, but 1262:spaCy 907:large 898:GloVe 208:topic 1027:Data 878:BERT 496:ISSN 332:and 318:and 307:and 198:(or 1059:UBY 482:in 314:2. 297:1. 1279:: 453:. 418:^ 311:. 281:A 259:. 45:. 1217:) 940:, 909:) 905:( 535:e 528:t 521:v 464:. 251:( 186:) 180:( 168:) 162:( 157:) 153:( 139:. 106:) 100:( 95:) 91:( 87:. 77:. 52:) 48:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index