121:
22:
63:
230:
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However,
348:
After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about
242:
These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document's assignment is a distribution over all clusters. In a
226:
The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets,
222:
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for
338:
Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
285:
often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories.
324:
Different tokens might carry out similar information (e.g. tokenization and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
504:
Wui Lee Chang, Kai Meng Tay, and Chee Peng Lim, A New
Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization, Neural Processing Letters, DOI: 10.1007/s11063-017-9597-3.
369:
Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space. See
533:
693:
494:
Claudio
Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys, Volume 41, Issue 3 (July 2009), Article No. 17,
671:
235:
and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the
385:
where the algorithm's goal is to create internally coherent clusters that are distinct from one another. Classification on the other hand, is a form of
1282:
1082:
526:
303:
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include
1251:
252:
992:
683:
519:
35:
142:
1246:
853:
1007:
838:
182:
164:
102:
49:
349:
the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. See
778:
1195:
848:
298:
843:
588:
381:
Clustering algorithms in computational text analysis groups documents into grouping a set of text what are called subsets or
263:
1112:
833:
84:
73:
805:
1150:
1135:
1107:
972:
967:
542:
227:
etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.
135:
129:
887:
858:
636:
730:
583:
146:
1256:
1180:
912:
868:
753:
651:
398:
370:
248:
244:
80:
41:
359:
We can then cluster different documents based on the features we have generated. See the algorithm section in
1160:
1130:
797:
1017:
710:
688:
678:
646:
621:
269:
Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters.
877:
211:
231:
such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the
1230:
906:
882:
735:
489:
Nicholas O. Andrews and Edward A. Fox, Recent
Developments in Document Clustering, October 16, 2007
1210:
1140:
1097:
1053:
825:
815:
810:
698:
386:
207:
1220:
1092:
957:
720:
703:
561:
304:
499:
1225:
937:
745:
656:
506:
495:
282:
236:
232:
1102:
987:
962:
763:
666:
490:
403:
360:
270:
203:
1214:
1175:
1170:
1038:
768:
641:
616:
598:
922:
902:
626:
1276:
1185:
997:
977:
758:
319:
247:
methods can be considered a subtype of soft clustering; for documents, these include
1165:
308:
438:
389:
where the features of the documents are used to predict the "type" of documents.
1122:
1002:
715:
631:
608:
556:
333:
256:
725:
511:
329:
206:
to textual documents. It has applications in automatic document organization,
593:
243:
soft assignment, a document has fractional membership in several clusters.
1068:
1048:
1033:
1012:
982:
927:
892:
773:
315:
239:
are more efficient and provide sufficient information for most purposes.
1205:
1063:
1043:
917:
661:
576:
571:
566:
350:
343:
450:
1261:
897:
294:
In practice, document clustering often takes the following steps:
478:
Christopher D. Manning, Prabhakar
Raghavan, and Hinrich Schütze.
783:
515:
1058:
114:
56:
15:
507:
https://link.springer.com/article/10.1007/s11063-017-9597-3
427:
266:
supported clustering and order sensitive clustering.
83:. Please help to ensure that disputed statements are
1239:
1194:
1149:
1121:
1081:
1026:
948:
936:
867:
824:
796:
744:
607:
549:
262:Other algorithms involve graph based clustering,
439:http://nlp.stanford.edu/IR-book/pdf/16flat.pdf
527:
8:
363:for different types of clustering methods.
50:Learn how and when to remove these messages
945:
741:
534:
520:
512:
421:
419:
183:Learn how and when to remove this message
165:Learn how and when to remove this message
103:Learn how and when to remove this message
128:This article includes a list of general
79:Relevant discussion may be found on the
451:"Introduction to Information Retrieval"
415:
484:Introduction to Information Retrieval.
253:truncated singular value decomposition
429:, MIT Press. Cambridge, MA: May 1999.
425:Manning, Chris, and Hinrich Schütze,
7:
993:Simple Knowledge Organization System
134:it lacks sufficient corresponding
14:
1008:Thesaurus (information retrieval)
342:4. Computing term frequencies or
31:This article has multiple issues.
1283:Information retrieval techniques
486:Cambridge University Press. 2008
366:6. Evaluation and visualization
119:
61:
20:
39:or discuss these issues on the
589:Natural language understanding
1:
1113:Optical character recognition
806:Multi-document summarization
277:Clustering in search engines
1136:Latent Dirichlet allocation
1108:Natural language generation
973:Machine-readable dictionary
968:Linguistic Linked Open Data
543:Natural language processing
1299:
888:Explicit semantic analysis
637:Deep linguistic processing
353:for detailed discussions.
731:Word-sense disambiguation
584:Computational linguistics
377:Clustering v. Classifying
1257:Natural Language Toolkit
1181:Pronunciation assessment
1083:Automatic identification
913:Latent semantic analysis
869:Distributional semantics
754:Compound-term processing
652:Named-entity recognition
373:as a possible approach.
371:multidimensional scaling
273:exist for this purpose.
255:on term histograms) and
249:latent semantic indexing
245:Dimensionality reduction
202:) is the application of
1161:Automated essay scoring
1131:Document classification
798:Automatic summarization
149:more precise citations.
1018:Universal Dependencies
711:Terminology extraction
694:Semantic decomposition
689:Semantic role labeling
679:Part-of-speech tagging
647:Information extraction
632:Coreference resolution
622:Collocation extraction
779:Sentence segmentation
212:information retrieval
1231:Voice user interface
942:datasets and corpora
883:Document-term matrix
736:Word-sense induction
210:extraction and fast
72:factual accuracy is
1211:Interactive fiction
1141:Pachinko allocation
1098:Speech segmentation
1054:Google Ngram Viewer
826:Machine translation
816:Text simplification
811:Sentence extraction
699:Semantic similarity
387:supervised learning
196:Document clustering
1221:Question answering
1093:Speech recognition
958:Corpus linguistics
938:Language resources
721:Textual entailment
704:Sentiment analysis
305:Bag-of-words model
1270:
1269:
1226:Virtual assistant
1151:Computer-assisted
1077:
1076:
834:Computer-assisted
792:
791:
784:Word segmentation
746:Text segmentation
684:Semantic analysis
672:Syntactic parsing
657:Ontology learning
283:web search engine
237:K-means algorithm
233:K-means algorithm
193:
192:
185:
175:
174:
167:
113:
112:
105:
54:
1290:
1247:Formal semantics
1196:Natural language
1103:Speech synthesis
1085:and data capture
988:Semantic network
963:Lexical resource
946:
764:Lexical analysis
742:
667:Semantic parsing
536:
529:
522:
513:
466:
465:
463:
462:
455:nlp.stanford.edu
447:
441:
436:
430:
423:
404:Fuzzy clustering
361:cluster analysis
204:cluster analysis
188:
181:
170:
163:
159:
156:
150:
145:this article by
136:inline citations
123:
122:
115:
108:
101:
97:
94:
88:
85:reliably sourced
65:
64:
57:
46:
24:
23:
16:
1298:
1297:
1293:
1292:
1291:
1289:
1288:
1287:
1273:
1272:
1271:
1266:
1235:
1215:Syntax guessing
1197:
1190:
1176:Predictive text
1171:Grammar checker
1152:
1145:
1117:
1084:
1073:
1039:Bank of English
1022:
950:
941:
932:
863:
820:
788:
740:
642:Distant reading
617:Argument mining
603:
599:Text processing
545:
540:
480:Flat Clustering
475:
470:
469:
460:
458:
449:
448:
444:
437:
433:
424:
417:
412:
395:
379:
292:
279:
271:Various methods
220:
200:text clustering
189:
178:
177:
176:
171:
160:
154:
151:
141:Please help to
140:
124:
120:
109:
98:
92:
89:
78:
70:This article's
66:
62:
25:
21:
12:
11:
5:
1296:
1294:
1286:
1285:
1275:
1274:
1268:
1267:
1265:
1264:
1259:
1254:
1249:
1243:
1241:
1237:
1236:
1234:
1233:
1228:
1223:
1218:
1208:
1202:
1200:
1198:user interface
1192:
1191:
1189:
1188:
1183:
1178:
1173:
1168:
1163:
1157:
1155:
1147:
1146:
1144:
1143:
1138:
1133:
1127:
1125:
1119:
1118:
1116:
1115:
1110:
1105:
1100:
1095:
1089:
1087:
1079:
1078:
1075:
1074:
1072:
1071:
1066:
1061:
1056:
1051:
1046:
1041:
1036:
1030:
1028:
1024:
1023:
1021:
1020:
1015:
1010:
1005:
1000:
995:
990:
985:
980:
975:
970:
965:
960:
954:
952:
943:
934:
933:
931:
930:
925:
923:Word embedding
920:
915:
910:
903:Language model
900:
895:
890:
885:
880:
874:
872:
865:
864:
862:
861:
856:
854:Transfer-based
851:
846:
841:
836:
830:
828:
822:
821:
819:
818:
813:
808:
802:
800:
794:
793:
790:
789:
787:
786:
781:
776:
771:
766:
761:
756:
750:
748:
739:
738:
733:
728:
723:
718:
713:
707:
706:
701:
696:
691:
686:
681:
676:
675:
674:
669:
659:
654:
649:
644:
639:
634:
629:
627:Concept mining
624:
619:
613:
611:
605:
604:
602:
601:
596:
591:
586:
581:
580:
579:
574:
564:
559:
553:
551:
547:
546:
541:
539:
538:
531:
524:
516:
510:
509:
502:
492:
487:
474:
471:
468:
467:
442:
431:
414:
413:
411:
408:
407:
406:
401:
394:
391:
378:
375:
356:5. Clustering
291:
288:
278:
275:
223:search users.
219:
216:
214:or filtering.
191:
190:
173:
172:
127:
125:
118:
111:
110:
69:
67:
60:
55:
29:
28:
26:
19:
13:
10:
9:
6:
4:
3:
2:
1295:
1284:
1281:
1280:
1278:
1263:
1260:
1258:
1255:
1253:
1252:Hallucination
1250:
1248:
1245:
1244:
1242:
1238:
1232:
1229:
1227:
1224:
1222:
1219:
1216:
1212:
1209:
1207:
1204:
1203:
1201:
1199:
1193:
1187:
1186:Spell checker
1184:
1182:
1179:
1177:
1174:
1172:
1169:
1167:
1164:
1162:
1159:
1158:
1156:
1154:
1148:
1142:
1139:
1137:
1134:
1132:
1129:
1128:
1126:
1124:
1120:
1114:
1111:
1109:
1106:
1104:
1101:
1099:
1096:
1094:
1091:
1090:
1088:
1086:
1080:
1070:
1067:
1065:
1062:
1060:
1057:
1055:
1052:
1050:
1047:
1045:
1042:
1040:
1037:
1035:
1032:
1031:
1029:
1025:
1019:
1016:
1014:
1011:
1009:
1006:
1004:
1001:
999:
998:Speech corpus
996:
994:
991:
989:
986:
984:
981:
979:
978:Parallel text
976:
974:
971:
969:
966:
964:
961:
959:
956:
955:
953:
947:
944:
939:
935:
929:
926:
924:
921:
919:
916:
914:
911:
908:
904:
901:
899:
896:
894:
891:
889:
886:
884:
881:
879:
876:
875:
873:
870:
866:
860:
857:
855:
852:
850:
847:
845:
842:
840:
839:Example-based
837:
835:
832:
831:
829:
827:
823:
817:
814:
812:
809:
807:
804:
803:
801:
799:
795:
785:
782:
780:
777:
775:
772:
770:
769:Text chunking
767:
765:
762:
760:
759:Lemmatisation
757:
755:
752:
751:
749:
747:
743:
737:
734:
732:
729:
727:
724:
722:
719:
717:
714:
712:
709:
708:
705:
702:
700:
697:
695:
692:
690:
687:
685:
682:
680:
677:
673:
670:
668:
665:
664:
663:
660:
658:
655:
653:
650:
648:
645:
643:
640:
638:
635:
633:
630:
628:
625:
623:
620:
618:
615:
614:
612:
610:
609:Text analysis
606:
600:
597:
595:
592:
590:
587:
585:
582:
578:
575:
573:
570:
569:
568:
565:
563:
560:
558:
555:
554:
552:
550:General terms
548:
544:
537:
532:
530:
525:
523:
518:
517:
514:
508:
503:
501:
497:
493:
491:
488:
485:
481:
477:
476:
472:
457:. p. 349
456:
452:
446:
443:
440:
435:
432:
428:
422:
420:
416:
409:
405:
402:
400:
397:
396:
392:
390:
388:
384:
376:
374:
372:
367:
364:
362:
357:
354:
352:
346:
345:
340:
336:
335:
331:
326:
322:
321:
320:lemmatization
317:
312:
310:
306:
301:
300:
295:
289:
287:
284:
276:
274:
272:
267:
265:
260:
258:
254:
250:
246:
240:
238:
234:
228:
224:
217:
215:
213:
209:
205:
201:
197:
187:
184:
169:
166:
158:
148:
144:
138:
137:
131:
126:
117:
116:
107:
104:
96:
86:
82:
76:
75:
68:
59:
58:
53:
51:
44:
43:
38:
37:
32:
27:
18:
17:
1166:Concordancer
562:Bag-of-words
483:
479:
473:Bibliography
459:. Retrieved
454:
445:
434:
426:
382:
380:
368:
365:
358:
355:
347:
341:
337:
328:3. Removing
327:
323:
313:
309:N-gram model
302:
299:Tokenization
296:
293:
280:
268:
261:
257:topic models
241:
229:
225:
221:
199:
195:
194:
179:
161:
152:
133:
99:
90:
71:
47:
40:
34:
33:Please help
30:
1123:Topic model
1003:Text corpus
849:Statistical
716:Text mining
557:AI-complete
334:punctuation
147:introducing
844:Rule-based
726:Truecasing
594:Stop words
461:2016-05-03
410:References
330:stop words
290:Procedures
155:March 2014
130:references
93:March 2014
36:improve it
1153:reviewing
951:standards
949:Types and
500:0360-0300
81:talk page
42:talk page
1277:Category
1069:Wikidata
1049:FrameNet
1034:BabelNet
1013:Treebank
983:PropBank
928:Word2vec
893:fastText
774:Stemming
393:See also
383:clusters
316:Stemming
264:ontology
218:Overview
74:disputed
1240:Related
1206:Chatbot
1064:WordNet
1044:DBpedia
918:Seq2seq
662:Parsing
577:Trigram
399:Cluster
143:improve
1213:(c.f.
871:models
859:Neural
572:Bigram
567:n-gram
498:
351:tf-idf
344:tf-idf
132:, but
1262:spaCy
907:large
898:GloVe
208:topic
1027:Data
878:BERT
496:ISSN
332:and
318:and
307:and
198:(or
1059:UBY
482:in
314:2.
297:1.
1279::
453:.
418:^
311:.
281:A
259:.
45:.
1217:)
940:,
909:)
905:(
535:e
528:t
521:v
464:.
251:(
186:)
180:(
168:)
162:(
157:)
153:(
139:.
106:)
100:(
95:)
91:(
87:.
77:.
52:)
48:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.