237:
129:. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.
91:
For the purposes of concept mining, however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.
99:
that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on
137:
Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their
73:, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's
84:. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available
280:
440:
179:
117:
One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on
418:
829:
273:
998:
213:
1029:
739:
430:
266:
993:
600:
754:
585:
57:. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is
525:
942:
595:
590:
335:
125:. These structures can be used to generate simple tree membership statistics, that can be used to locate any document in a
859:
580:
141:
cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.
552:
897:
882:
854:
719:
714:
289:
240:", Journal of the American Society for Information Science and Technology, Vol. 53, No. 13, Nov. 2002, pp. 1130-1138.
250:
634:
605:
383:
477:
330:
96:
1003:
927:
659:
615:
500:
398:
160:
126:
104:
between the possible concepts and the context have appeared and gained interest in the scientific community.
907:
877:
544:
150:
42:
764:
457:
435:
425:
393:
368:
155:
624:
1034:
977:
653:
629:
482:
957:
887:
844:
800:
572:
562:
557:
445:
101:
85:
967:
839:
704:
467:
450:
308:
219:
81:
61:, but it can provide powerful insights into the meaning, provenance and similarity of documents.
972:
684:
492:
403:
209:
849:
734:
709:
510:
413:
201:
961:
922:
917:
785:
515:
388:
363:
345:
669:
649:
180:
Mining
Concept Maps from News Stories for Measuring Civic Scientific Literacy in Media
1023:
932:
744:
724:
505:
223:
178:
Yuen-Hsien Tseng, Chun-Yen Chang, Shu-Nu Chang
Rundgren, and Carl-Johan Rundgren, "
912:
253:", Expert Systems With Applications, Vol. 37, No. 3, 15 March 2010, pp. 2247-2254 .
869:
749:
462:
378:
355:
303:
138:
54:
50:
472:
258:
69:
Traditionally, the conversion of words to concepts has been performed using a
58:
46:
17:
193:
340:
205:
118:
70:
815:
795:
780:
759:
729:
674:
639:
520:
122:
38:
952:
810:
790:
664:
408:
323:
74:
34:
318:
313:
182:", Computers and Education, Vol. 55, No. 1, August 2010, pp. 165-177.
1008:
644:
530:
262:
805:
192:
Li, Keqian; Zha, Hanwen; Su, Yu; Yan, Xifeng (November 2018).
113:
Detecting and indexing similar documents in large corpora
198:
2018 IEEE International
Conference on Data Mining (ICDM)
41:. Solutions to the task typically involve aspects of
238:
Automatic
Thesaurus Generation for Chinese Documents
986:
941:
896:
868:
828:
773:
695:
683:
614:
571:
543:
491:
354:
296:
33:is an activity that results in the extraction of
251:Generic Title Labeling for Clustered Documents
274:
8:
80:The mappings of words to concepts are often
692:
488:
281:
267:
259:
171:
88:systems cannot easily infer context.
27:Application of statistical techniques
7:
740:Simple Knowledge Organization System
25:
755:Thesaurus (information retrieval)
336:Natural language understanding
194:"Concept Mining via Embedding"
95:There are many techniques for
1:
860:Optical character recognition
133:Clustering documents by topic
553:Multi-document summarization
1030:Natural language processing
883:Latent Dirichlet allocation
855:Natural language generation
720:Machine-readable dictionary
715:Linguistic Linked Open Data
290:Natural language processing
1051:
635:Explicit semantic analysis
384:Deep linguistic processing
200:. IEEE. pp. 267–276.
478:Word-sense disambiguation
331:Computational linguistics
1004:Natural Language Toolkit
928:Pronunciation assessment
830:Automatic identification
660:Latent semantic analysis
616:Distributional semantics
501:Compound-term processing
399:Named-entity recognition
161:Compound term processing
908:Automated essay scoring
878:Document classification
545:Automatic summarization
206:10.1109/icdm.2018.00042
151:Formal concept analysis
127:Euclidean concept space
43:artificial intelligence
765:Universal Dependencies
458:Terminology extraction
441:Semantic decomposition
436:Semantic role labeling
426:Part-of-speech tagging
394:Information extraction
379:Coreference resolution
369:Collocation extraction
156:Information extraction
526:Sentence segmentation
978:Voice user interface
689:datasets and corpora
630:Document-term matrix
483:Word-sense induction
249:Yuen-Hsien Tseng, "
236:Yuen-Hsien Tseng, "
958:Interactive fiction
888:Pachinko allocation
845:Speech segmentation
801:Google Ngram Viewer
573:Machine translation
563:Text simplification
558:Sentence extraction
446:Semantic similarity
102:semantic similarity
86:machine translation
968:Question answering
840:Speech recognition
705:Corpus linguistics
685:Language resources
468:Textual entailment
451:Sentiment analysis
1017:
1016:
973:Virtual assistant
898:Computer-assisted
824:
823:
581:Computer-assisted
539:
538:
531:Word segmentation
493:Text segmentation
431:Semantic analysis
419:Syntactic parsing
404:Ontology learning
215:978-1-5386-9159-5
16:(Redirected from
1042:
994:Formal semantics
943:Natural language
850:Speech synthesis
832:and data capture
735:Semantic network
710:Lexical resource
693:
511:Lexical analysis
489:
414:Semantic parsing
283:
276:
269:
260:
254:
247:
241:
234:
228:
227:
189:
183:
176:
21:
1050:
1049:
1045:
1044:
1043:
1041:
1040:
1039:
1020:
1019:
1018:
1013:
982:
962:Syntax guessing
944:
937:
923:Predictive text
918:Grammar checker
899:
892:
864:
831:
820:
786:Bank of English
769:
697:
688:
679:
610:
567:
535:
487:
389:Distant reading
364:Argument mining
350:
346:Text processing
292:
287:
257:
248:
244:
235:
231:
216:
191:
190:
186:
177:
173:
169:
147:
135:
115:
110:
67:
28:
23:
22:
15:
12:
11:
5:
1048:
1046:
1038:
1037:
1032:
1022:
1021:
1015:
1014:
1012:
1011:
1006:
1001:
996:
990:
988:
984:
983:
981:
980:
975:
970:
965:
955:
949:
947:
945:user interface
939:
938:
936:
935:
930:
925:
920:
915:
910:
904:
902:
894:
893:
891:
890:
885:
880:
874:
872:
866:
865:
863:
862:
857:
852:
847:
842:
836:
834:
826:
825:
822:
821:
819:
818:
813:
808:
803:
798:
793:
788:
783:
777:
775:
771:
770:
768:
767:
762:
757:
752:
747:
742:
737:
732:
727:
722:
717:
712:
707:
701:
699:
690:
681:
680:
678:
677:
672:
670:Word embedding
667:
662:
657:
650:Language model
647:
642:
637:
632:
627:
621:
619:
612:
611:
609:
608:
603:
601:Transfer-based
598:
593:
588:
583:
577:
575:
569:
568:
566:
565:
560:
555:
549:
547:
541:
540:
537:
536:
534:
533:
528:
523:
518:
513:
508:
503:
497:
495:
486:
485:
480:
475:
470:
465:
460:
454:
453:
448:
443:
438:
433:
428:
423:
422:
421:
416:
406:
401:
396:
391:
386:
381:
376:
374:Concept mining
371:
366:
360:
358:
352:
351:
349:
348:
343:
338:
333:
328:
327:
326:
321:
311:
306:
300:
298:
294:
293:
288:
286:
285:
278:
271:
263:
256:
255:
242:
229:
214:
184:
170:
168:
165:
164:
163:
158:
153:
146:
143:
134:
131:
114:
111:
109:
106:
97:disambiguation
66:
63:
31:Concept mining
26:
24:
18:Concept Mining
14:
13:
10:
9:
6:
4:
3:
2:
1047:
1036:
1033:
1031:
1028:
1027:
1025:
1010:
1007:
1005:
1002:
1000:
999:Hallucination
997:
995:
992:
991:
989:
985:
979:
976:
974:
971:
969:
966:
963:
959:
956:
954:
951:
950:
948:
946:
940:
934:
933:Spell checker
931:
929:
926:
924:
921:
919:
916:
914:
911:
909:
906:
905:
903:
901:
895:
889:
886:
884:
881:
879:
876:
875:
873:
871:
867:
861:
858:
856:
853:
851:
848:
846:
843:
841:
838:
837:
835:
833:
827:
817:
814:
812:
809:
807:
804:
802:
799:
797:
794:
792:
789:
787:
784:
782:
779:
778:
776:
772:
766:
763:
761:
758:
756:
753:
751:
748:
746:
745:Speech corpus
743:
741:
738:
736:
733:
731:
728:
726:
725:Parallel text
723:
721:
718:
716:
713:
711:
708:
706:
703:
702:
700:
694:
691:
686:
682:
676:
673:
671:
668:
666:
663:
661:
658:
655:
651:
648:
646:
643:
641:
638:
636:
633:
631:
628:
626:
623:
622:
620:
617:
613:
607:
604:
602:
599:
597:
594:
592:
589:
587:
586:Example-based
584:
582:
579:
578:
576:
574:
570:
564:
561:
559:
556:
554:
551:
550:
548:
546:
542:
532:
529:
527:
524:
522:
519:
517:
516:Text chunking
514:
512:
509:
507:
506:Lemmatisation
504:
502:
499:
498:
496:
494:
490:
484:
481:
479:
476:
474:
471:
469:
466:
464:
461:
459:
456:
455:
452:
449:
447:
444:
442:
439:
437:
434:
432:
429:
427:
424:
420:
417:
415:
412:
411:
410:
407:
405:
402:
400:
397:
395:
392:
390:
387:
385:
382:
380:
377:
375:
372:
370:
367:
365:
362:
361:
359:
357:
356:Text analysis
353:
347:
344:
342:
339:
337:
334:
332:
329:
325:
322:
320:
317:
316:
315:
312:
310:
307:
305:
302:
301:
299:
297:General terms
295:
291:
284:
279:
277:
272:
270:
265:
264:
261:
252:
246:
243:
239:
233:
230:
225:
221:
217:
211:
207:
203:
199:
195:
188:
185:
181:
175:
172:
166:
162:
159:
157:
154:
152:
149:
148:
144:
142:
140:
132:
130:
128:
124:
120:
112:
107:
105:
103:
98:
93:
89:
87:
83:
78:
76:
72:
64:
62:
60:
56:
52:
48:
44:
40:
36:
32:
19:
913:Concordancer
373:
309:Bag-of-words
245:
232:
197:
187:
174:
136:
116:
108:Applications
94:
90:
79:
68:
30:
29:
1035:Data mining
870:Topic model
750:Text corpus
596:Statistical
463:Text mining
304:AI-complete
139:text mining
55:text mining
51:data mining
1024:Categories
591:Rule-based
473:Truecasing
341:Stop words
167:References
59:nontrivial
49:, such as
47:statistics
900:reviewing
698:standards
696:Types and
119:hypernymy
82:ambiguous
71:thesaurus
39:artifacts
816:Wikidata
796:FrameNet
781:BabelNet
760:Treebank
730:PropBank
675:Word2vec
640:fastText
521:Stemming
224:52841398
145:See also
123:meronymy
35:concepts
987:Related
953:Chatbot
811:WordNet
791:DBpedia
665:Seq2seq
409:Parsing
324:Trigram
75:WordNet
65:Methods
960:(c.f.
618:models
606:Neural
319:Bigram
314:n-gram
222:
212:
1009:spaCy
654:large
645:GloVe
220:S2CID
37:from
774:Data
625:BERT
210:ISBN
121:and
53:and
45:and
806:UBY
202:doi
1026::
218:.
208:.
196:.
77:.
964:)
687:,
656:)
652:(
282:e
275:t
268:v
226:.
204::
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.