Knowledge (XXG)

Concept mining

Source 📝

237: 129:. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus. 91:
For the purposes of concept mining, however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.
99:
that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on
137:
Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their
73:, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's 84:. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available 280: 440: 179: 117:
One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on
418: 829: 273: 998: 213: 1029: 739: 430: 266: 993: 600: 754: 585: 57:. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is 525: 942: 595: 590: 335: 125:. These structures can be used to generate simple tree membership statistics, that can be used to locate any document in a 859: 580: 141:
cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.
552: 897: 882: 854: 719: 714: 289: 240:", Journal of the American Society for Information Science and Technology, Vol. 53, No. 13, Nov. 2002, pp. 1130-1138. 250: 634: 605: 383: 477: 330: 96: 1003: 927: 659: 615: 500: 398: 160: 126: 104:
between the possible concepts and the context have appeared and gained interest in the scientific community.
907: 877: 544: 150: 42: 764: 457: 435: 425: 393: 368: 155: 624: 1034: 977: 653: 629: 482: 957: 887: 844: 800: 572: 562: 557: 445: 101: 85: 967: 839: 704: 467: 450: 308: 219: 81: 61:, but it can provide powerful insights into the meaning, provenance and similarity of documents. 972: 684: 492: 403: 209: 849: 734: 709: 510: 413: 201: 961: 922: 917: 785: 515: 388: 363: 345: 669: 649: 180:
Mining Concept Maps from News Stories for Measuring Civic Scientific Literacy in Media
1023: 932: 744: 724: 505: 223: 178:
Yuen-Hsien Tseng, Chun-Yen Chang, Shu-Nu Chang Rundgren, and Carl-Johan Rundgren, "
912: 253:", Expert Systems With Applications, Vol. 37, No. 3, 15 March 2010, pp. 2247-2254 . 869: 749: 462: 378: 355: 303: 138: 54: 50: 472: 258: 69:
Traditionally, the conversion of words to concepts has been performed using a
58: 46: 17: 193: 340: 205: 118: 70: 815: 795: 780: 759: 729: 674: 639: 520: 122: 38: 952: 810: 790: 664: 408: 323: 74: 34: 318: 313: 182:", Computers and Education, Vol. 55, No. 1, August 2010, pp. 165-177. 1008: 644: 530: 262: 805: 192:
Li, Keqian; Zha, Hanwen; Su, Yu; Yan, Xifeng (November 2018).
113:
Detecting and indexing similar documents in large corpora
198:
2018 IEEE International Conference on Data Mining (ICDM)
41:. Solutions to the task typically involve aspects of 238:
Automatic Thesaurus Generation for Chinese Documents
986: 941: 896: 868: 828: 773: 695: 683: 614: 571: 543: 491: 354: 296: 33:is an activity that results in the extraction of 251:Generic Title Labeling for Clustered Documents 274: 8: 80:The mappings of words to concepts are often 692: 488: 281: 267: 259: 171: 88:systems cannot easily infer context. 27:Application of statistical techniques 7: 740:Simple Knowledge Organization System 25: 755:Thesaurus (information retrieval) 336:Natural language understanding 194:"Concept Mining via Embedding" 95:There are many techniques for 1: 860:Optical character recognition 133:Clustering documents by topic 553:Multi-document summarization 1030:Natural language processing 883:Latent Dirichlet allocation 855:Natural language generation 720:Machine-readable dictionary 715:Linguistic Linked Open Data 290:Natural language processing 1051: 635:Explicit semantic analysis 384:Deep linguistic processing 200:. IEEE. pp. 267–276. 478:Word-sense disambiguation 331:Computational linguistics 1004:Natural Language Toolkit 928:Pronunciation assessment 830:Automatic identification 660:Latent semantic analysis 616:Distributional semantics 501:Compound-term processing 399:Named-entity recognition 161:Compound term processing 908:Automated essay scoring 878:Document classification 545:Automatic summarization 206:10.1109/icdm.2018.00042 151:Formal concept analysis 127:Euclidean concept space 43:artificial intelligence 765:Universal Dependencies 458:Terminology extraction 441:Semantic decomposition 436:Semantic role labeling 426:Part-of-speech tagging 394:Information extraction 379:Coreference resolution 369:Collocation extraction 156:Information extraction 526:Sentence segmentation 978:Voice user interface 689:datasets and corpora 630:Document-term matrix 483:Word-sense induction 249:Yuen-Hsien Tseng, " 236:Yuen-Hsien Tseng, " 958:Interactive fiction 888:Pachinko allocation 845:Speech segmentation 801:Google Ngram Viewer 573:Machine translation 563:Text simplification 558:Sentence extraction 446:Semantic similarity 102:semantic similarity 86:machine translation 968:Question answering 840:Speech recognition 705:Corpus linguistics 685:Language resources 468:Textual entailment 451:Sentiment analysis 1017: 1016: 973:Virtual assistant 898:Computer-assisted 824: 823: 581:Computer-assisted 539: 538: 531:Word segmentation 493:Text segmentation 431:Semantic analysis 419:Syntactic parsing 404:Ontology learning 215:978-1-5386-9159-5 16:(Redirected from 1042: 994:Formal semantics 943:Natural language 850:Speech synthesis 832:and data capture 735:Semantic network 710:Lexical resource 693: 511:Lexical analysis 489: 414:Semantic parsing 283: 276: 269: 260: 254: 247: 241: 234: 228: 227: 189: 183: 176: 21: 1050: 1049: 1045: 1044: 1043: 1041: 1040: 1039: 1020: 1019: 1018: 1013: 982: 962:Syntax guessing 944: 937: 923:Predictive text 918:Grammar checker 899: 892: 864: 831: 820: 786:Bank of English 769: 697: 688: 679: 610: 567: 535: 487: 389:Distant reading 364:Argument mining 350: 346:Text processing 292: 287: 257: 248: 244: 235: 231: 216: 191: 190: 186: 177: 173: 169: 147: 135: 115: 110: 67: 28: 23: 22: 15: 12: 11: 5: 1048: 1046: 1038: 1037: 1032: 1022: 1021: 1015: 1014: 1012: 1011: 1006: 1001: 996: 990: 988: 984: 983: 981: 980: 975: 970: 965: 955: 949: 947: 945:user interface 939: 938: 936: 935: 930: 925: 920: 915: 910: 904: 902: 894: 893: 891: 890: 885: 880: 874: 872: 866: 865: 863: 862: 857: 852: 847: 842: 836: 834: 826: 825: 822: 821: 819: 818: 813: 808: 803: 798: 793: 788: 783: 777: 775: 771: 770: 768: 767: 762: 757: 752: 747: 742: 737: 732: 727: 722: 717: 712: 707: 701: 699: 690: 681: 680: 678: 677: 672: 670:Word embedding 667: 662: 657: 650:Language model 647: 642: 637: 632: 627: 621: 619: 612: 611: 609: 608: 603: 601:Transfer-based 598: 593: 588: 583: 577: 575: 569: 568: 566: 565: 560: 555: 549: 547: 541: 540: 537: 536: 534: 533: 528: 523: 518: 513: 508: 503: 497: 495: 486: 485: 480: 475: 470: 465: 460: 454: 453: 448: 443: 438: 433: 428: 423: 422: 421: 416: 406: 401: 396: 391: 386: 381: 376: 374:Concept mining 371: 366: 360: 358: 352: 351: 349: 348: 343: 338: 333: 328: 327: 326: 321: 311: 306: 300: 298: 294: 293: 288: 286: 285: 278: 271: 263: 256: 255: 242: 229: 214: 184: 170: 168: 165: 164: 163: 158: 153: 146: 143: 134: 131: 114: 111: 109: 106: 97:disambiguation 66: 63: 31:Concept mining 26: 24: 18:Concept Mining 14: 13: 10: 9: 6: 4: 3: 2: 1047: 1036: 1033: 1031: 1028: 1027: 1025: 1010: 1007: 1005: 1002: 1000: 999:Hallucination 997: 995: 992: 991: 989: 985: 979: 976: 974: 971: 969: 966: 963: 959: 956: 954: 951: 950: 948: 946: 940: 934: 933:Spell checker 931: 929: 926: 924: 921: 919: 916: 914: 911: 909: 906: 905: 903: 901: 895: 889: 886: 884: 881: 879: 876: 875: 873: 871: 867: 861: 858: 856: 853: 851: 848: 846: 843: 841: 838: 837: 835: 833: 827: 817: 814: 812: 809: 807: 804: 802: 799: 797: 794: 792: 789: 787: 784: 782: 779: 778: 776: 772: 766: 763: 761: 758: 756: 753: 751: 748: 746: 745:Speech corpus 743: 741: 738: 736: 733: 731: 728: 726: 725:Parallel text 723: 721: 718: 716: 713: 711: 708: 706: 703: 702: 700: 694: 691: 686: 682: 676: 673: 671: 668: 666: 663: 661: 658: 655: 651: 648: 646: 643: 641: 638: 636: 633: 631: 628: 626: 623: 622: 620: 617: 613: 607: 604: 602: 599: 597: 594: 592: 589: 587: 586:Example-based 584: 582: 579: 578: 576: 574: 570: 564: 561: 559: 556: 554: 551: 550: 548: 546: 542: 532: 529: 527: 524: 522: 519: 517: 516:Text chunking 514: 512: 509: 507: 506:Lemmatisation 504: 502: 499: 498: 496: 494: 490: 484: 481: 479: 476: 474: 471: 469: 466: 464: 461: 459: 456: 455: 452: 449: 447: 444: 442: 439: 437: 434: 432: 429: 427: 424: 420: 417: 415: 412: 411: 410: 407: 405: 402: 400: 397: 395: 392: 390: 387: 385: 382: 380: 377: 375: 372: 370: 367: 365: 362: 361: 359: 357: 356:Text analysis 353: 347: 344: 342: 339: 337: 334: 332: 329: 325: 322: 320: 317: 316: 315: 312: 310: 307: 305: 302: 301: 299: 297:General terms 295: 291: 284: 279: 277: 272: 270: 265: 264: 261: 252: 246: 243: 239: 233: 230: 225: 221: 217: 211: 207: 203: 199: 195: 188: 185: 181: 175: 172: 166: 162: 159: 157: 154: 152: 149: 148: 144: 142: 140: 132: 130: 128: 124: 120: 112: 107: 105: 103: 98: 93: 89: 87: 83: 78: 76: 72: 64: 62: 60: 56: 52: 48: 44: 40: 36: 32: 19: 913:Concordancer 373: 309:Bag-of-words 245: 232: 197: 187: 174: 136: 116: 108:Applications 94: 90: 79: 68: 30: 29: 1035:Data mining 870:Topic model 750:Text corpus 596:Statistical 463:Text mining 304:AI-complete 139:text mining 55:text mining 51:data mining 1024:Categories 591:Rule-based 473:Truecasing 341:Stop words 167:References 59:nontrivial 49:, such as 47:statistics 900:reviewing 698:standards 696:Types and 119:hypernymy 82:ambiguous 71:thesaurus 39:artifacts 816:Wikidata 796:FrameNet 781:BabelNet 760:Treebank 730:PropBank 675:Word2vec 640:fastText 521:Stemming 224:52841398 145:See also 123:meronymy 35:concepts 987:Related 953:Chatbot 811:WordNet 791:DBpedia 665:Seq2seq 409:Parsing 324:Trigram 75:WordNet 65:Methods 960:(c.f. 618:models 606:Neural 319:Bigram 314:n-gram 222:  212:  1009:spaCy 654:large 645:GloVe 220:S2CID 37:from 774:Data 625:BERT 210:ISBN 121:and 53:and 45:and 806:UBY 202:doi 1026:: 218:. 208:. 196:. 77:. 964:) 687:, 656:) 652:( 282:e 275:t 268:v 226:. 204:: 20:)

Index

Concept Mining
concepts
artifacts
artificial intelligence
statistics
data mining
text mining
nontrivial
thesaurus
WordNet
ambiguous
machine translation
disambiguation
semantic similarity
hypernymy
meronymy
Euclidean concept space
text mining
Formal concept analysis
Information extraction
Compound term processing
Mining Concept Maps from News Stories for Measuring Civic Scientific Literacy in Media
"Concept Mining via Embedding"
doi
10.1109/icdm.2018.00042
ISBN
978-1-5386-9159-5
S2CID
52841398
Automatic Thesaurus Generation for Chinese Documents

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.