74:. Robertson stated that the assumption of word independence is not justified and exists as a matter of mathematical convenience. His objection to the term independence is not a new idea, dating back to at least 1964 when H. H. Williams stated that "he assumption of independence of words in a document is usually made as a matter of mathematical convenience".
104:
Statistical compound-term processing is also more adaptable than the linguistic approach taken by the CLAMOUR project, which must consider the syntactic properties of the terms (i.e. part of speech, gender, number, etc.) and their combinations. CLAMOUR is highly language-dependent, whereas the
128:
engines add a degree of sophistication by allowing the user to specify additional requirements. For example, "Tiger NEAR Woods AND (golf OR golfing) NOT Volkswagen" uses the operators "NEAR", "AND", "OR" and "NOT" to specify that these words must follow certain requirements. A
38:, which itself uses compound-term processing. This will extract the key concepts automatically (in this case "survival rates", "triple heart bypass" and "elderly people") and use these concepts to select the most relevant documents.
53:
CLAMOUR is a
European collaborative project which aims to find a better way to classify when collecting and disseminating industrial information and statistics. CLAMOUR appears to use a linguistic approach, rather than one based on
30:
Compound-term processing is a new approach to an old problem: how can one improve the relevance of search results while maintaining ease of use? Using this technique, a search for
337:
497:
171:
475:
93:
where an extensive statistical knowledge of common searches can be used to identify candidate phrases. Statistical compound term processing is more suited to
27:. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term.
295:
1086:
886:
330:
1055:
215:
227:
67:
200:
796:
487:
323:
117:, to perform their matching on the basis of multi-word concepts, rather than on single words in isolation which can be highly ambiguous.
89:
Statistical compound-term processing is more adaptable than the process described by
Patterson. Her process is targeted at searching the
1050:
657:
811:
642:
182:
582:
34:
will locate documents about this topic even if this precise phrase is not contained in any document. This can be performed by a
999:
652:
270:
647:
392:
120:
Early search engines looked for documents containing the words entered by the user into the search box . These are known as
916:
637:
609:
262:
954:
939:
911:
776:
771:
346:
66:
Techniques for probabilistic weighting of single word terms date back to at least 1976 in the landmark publication by
691:
662:
440:
77:
In 2004, Anna Lynn
Patterson filed patents on "phrase-based searching in an information retrieval system" to which
534:
387:
142:
98:
47:
1060:
984:
716:
672:
455:
308:
964:
934:
601:
821:
514:
492:
482:
450:
425:
291:
681:
152:
20:
231:
71:
1034:
710:
686:
539:
1014:
944:
901:
857:
629:
619:
614:
502:
198:
1024:
896:
761:
524:
507:
365:
1029:
741:
549:
460:
147:
94:
55:
906:
791:
766:
567:
470:
243:
1018:
979:
974:
842:
572:
445:
420:
402:
267:
Statistical
Association Methods for Mechanized Documentation, National Bureau of Standards
204:
726:
706:
430:
133:
is simpler to use, but requires that the exact phrase specified appear in the results.
125:
121:
114:
90:
35:
1080:
989:
801:
781:
562:
130:
24:
969:
926:
806:
519:
435:
412:
360:
529:
315:
397:
113:
Compound-term processing allows information-retrieval applications, such as
247:
872:
852:
837:
816:
786:
731:
696:
577:
1009:
867:
847:
721:
465:
380:
263:"Results of classifying documents with multiple discriminant functions"
375:
370:
78:
50:
introduced the idea of using statistical compound-term processing.
1065:
701:
587:
32:
survival rates following a triple heart bypass in elderly people
319:
862:
197:
The
British Library Direct catalogue entry can be found here:
236:
Journal of the
American Society for Information Science
1043:
998:
953:
925:
885:
830:
752:
740:
671:
628:
600:
548:
411:
353:
234:(1976). "Relevance weighting of search terms".
105:statistical approach is language-independent.
331:
8:
23:, is search result matching on the basis of
172:"Lateral Thinking in Information Retrieval"
749:
545:
338:
324:
316:
309:Google Acquires Cuil Patent Applications
163:
269:. Washington: 217–224. Archived from
179:Information Management and Technology
7:
797:Simple Knowledge Organization System
217:National Statistics CLAMOUR project
81:subsequently acquired the rights.
14:
812:Thesaurus (information retrieval)
1087:Information retrieval techniques
393:Natural language understanding
1:
917:Optical character recognition
610:Multi-document summarization
101:knowledge is not available.
940:Latent Dirichlet allocation
912:Natural language generation
777:Machine-readable dictionary
772:Linguistic Linked Open Data
347:Natural language processing
181:. 36 PART 4. Archived from
1103:
692:Explicit semantic analysis
441:Deep linguistic processing
535:Word-sense disambiguation
388:Computational linguistics
143:Concept Searching Limited
48:Concept Searching Limited
17:Compound-term processing,
1061:Natural Language Toolkit
985:Pronunciation assessment
887:Automatic identification
717:Latent semantic analysis
673:Distributional semantics
558:Compound-term processing
456:Named-entity recognition
97:applications where such
965:Automated essay scoring
935:Document classification
602:Automatic summarization
261:WILLIAMS, J.H. (1965).
822:Universal Dependencies
515:Terminology extraction
498:Semantic decomposition
493:Semantic role labeling
483:Part-of-speech tagging
451:Information extraction
436:Coreference resolution
426:Collocation extraction
248:10.1002/asi.4630270302
583:Sentence segmentation
153:Information retrieval
56:statistical modelling
21:information-retrieval
1035:Voice user interface
746:datasets and corpora
687:Document-term matrix
540:Word-sense induction
68:Stephen E. Robertson
1015:Interactive fiction
945:Pachinko allocation
902:Speech segmentation
858:Google Ngram Viewer
630:Machine translation
620:Text simplification
615:Sentence extraction
503:Semantic similarity
1025:Question answering
897:Speech recognition
762:Corpus linguistics
742:Language resources
525:Textual entailment
508:Sentiment analysis
203:2012-02-10 at the
72:Karen Spärck Jones
1074:
1073:
1030:Virtual assistant
955:Computer-assisted
881:
880:
638:Computer-assisted
596:
595:
588:Word segmentation
550:Text segmentation
488:Semantic analysis
476:Syntactic parsing
461:Ontology learning
148:Enterprise search
95:enterprise search
1094:
1051:Formal semantics
1000:Natural language
907:Speech synthesis
889:and data capture
792:Semantic network
767:Lexical resource
750:
568:Lexical analysis
546:
471:Semantic parsing
340:
333:
326:
317:
311:
306:
300:
299:
298:
294:
288:
282:
281:
279:
278:
258:
252:
251:
232:Spärck Jones, K.
228:Robertson, S. E.
224:
218:
213:
207:
196:
194:
193:
187:
176:
168:
46:In August 2003,
1102:
1101:
1097:
1096:
1095:
1093:
1092:
1091:
1077:
1076:
1075:
1070:
1039:
1019:Syntax guessing
1001:
994:
980:Predictive text
975:Grammar checker
956:
949:
921:
888:
877:
843:Bank of English
826:
754:
745:
736:
667:
624:
592:
544:
446:Distant reading
421:Argument mining
407:
403:Text processing
349:
344:
314:
307:
303:
296:
290:
289:
285:
276:
274:
260:
259:
255:
226:
225:
221:
214:
210:
205:Wayback Machine
191:
189:
185:
174:
170:
169:
165:
161:
139:
111:
87:
64:
44:
12:
11:
5:
1100:
1098:
1090:
1089:
1079:
1078:
1072:
1071:
1069:
1068:
1063:
1058:
1053:
1047:
1045:
1041:
1040:
1038:
1037:
1032:
1027:
1022:
1012:
1006:
1004:
1002:user interface
996:
995:
993:
992:
987:
982:
977:
972:
967:
961:
959:
951:
950:
948:
947:
942:
937:
931:
929:
923:
922:
920:
919:
914:
909:
904:
899:
893:
891:
883:
882:
879:
878:
876:
875:
870:
865:
860:
855:
850:
845:
840:
834:
832:
828:
827:
825:
824:
819:
814:
809:
804:
799:
794:
789:
784:
779:
774:
769:
764:
758:
756:
747:
738:
737:
735:
734:
729:
727:Word embedding
724:
719:
714:
707:Language model
704:
699:
694:
689:
684:
678:
676:
669:
668:
666:
665:
660:
658:Transfer-based
655:
650:
645:
640:
634:
632:
626:
625:
623:
622:
617:
612:
606:
604:
598:
597:
594:
593:
591:
590:
585:
580:
575:
570:
565:
560:
554:
552:
543:
542:
537:
532:
527:
522:
517:
511:
510:
505:
500:
495:
490:
485:
480:
479:
478:
473:
463:
458:
453:
448:
443:
438:
433:
431:Concept mining
428:
423:
417:
415:
409:
408:
406:
405:
400:
395:
390:
385:
384:
383:
378:
368:
363:
357:
355:
351:
350:
345:
343:
342:
335:
328:
320:
313:
312:
301:
292:US 20060031195
283:
253:
219:
208:
162:
160:
157:
156:
155:
150:
145:
138:
135:
126:Boolean search
122:keyword search
115:search engines
110:
107:
91:World Wide Web
86:
83:
63:
60:
43:
40:
36:concept search
25:compound terms
13:
10:
9:
6:
4:
3:
2:
1099:
1088:
1085:
1084:
1082:
1067:
1064:
1062:
1059:
1057:
1056:Hallucination
1054:
1052:
1049:
1048:
1046:
1042:
1036:
1033:
1031:
1028:
1026:
1023:
1020:
1016:
1013:
1011:
1008:
1007:
1005:
1003:
997:
991:
990:Spell checker
988:
986:
983:
981:
978:
976:
973:
971:
968:
966:
963:
962:
960:
958:
952:
946:
943:
941:
938:
936:
933:
932:
930:
928:
924:
918:
915:
913:
910:
908:
905:
903:
900:
898:
895:
894:
892:
890:
884:
874:
871:
869:
866:
864:
861:
859:
856:
854:
851:
849:
846:
844:
841:
839:
836:
835:
833:
829:
823:
820:
818:
815:
813:
810:
808:
805:
803:
802:Speech corpus
800:
798:
795:
793:
790:
788:
785:
783:
782:Parallel text
780:
778:
775:
773:
770:
768:
765:
763:
760:
759:
757:
751:
748:
743:
739:
733:
730:
728:
725:
723:
720:
718:
715:
712:
708:
705:
703:
700:
698:
695:
693:
690:
688:
685:
683:
680:
679:
677:
674:
670:
664:
661:
659:
656:
654:
651:
649:
646:
644:
643:Example-based
641:
639:
636:
635:
633:
631:
627:
621:
618:
616:
613:
611:
608:
607:
605:
603:
599:
589:
586:
584:
581:
579:
576:
574:
573:Text chunking
571:
569:
566:
564:
563:Lemmatisation
561:
559:
556:
555:
553:
551:
547:
541:
538:
536:
533:
531:
528:
526:
523:
521:
518:
516:
513:
512:
509:
506:
504:
501:
499:
496:
494:
491:
489:
486:
484:
481:
477:
474:
472:
469:
468:
467:
464:
462:
459:
457:
454:
452:
449:
447:
444:
442:
439:
437:
434:
432:
429:
427:
424:
422:
419:
418:
416:
414:
413:Text analysis
410:
404:
401:
399:
396:
394:
391:
389:
386:
382:
379:
377:
374:
373:
372:
369:
367:
364:
362:
359:
358:
356:
354:General terms
352:
348:
341:
336:
334:
329:
327:
322:
321:
318:
310:
305:
302:
293:
287:
284:
273:on 2011-07-17
272:
268:
264:
257:
254:
249:
245:
241:
237:
233:
229:
223:
220:
216:
212:
209:
206:
202:
199:
188:on 2017-11-15
184:
180:
173:
167:
164:
158:
154:
151:
149:
146:
144:
141:
140:
136:
134:
132:
131:phrase search
127:
123:
118:
116:
108:
106:
102:
100:
96:
92:
84:
82:
80:
75:
73:
69:
61:
59:
57:
51:
49:
41:
39:
37:
33:
28:
26:
22:
18:
970:Concordancer
557:
366:Bag-of-words
304:
286:
275:. Retrieved
271:the original
266:
256:
239:
235:
222:
211:
190:. Retrieved
183:the original
178:
166:
119:
112:
109:Applications
103:
88:
85:Adaptability
76:
65:
52:
45:
31:
29:
16:
15:
927:Topic model
807:Text corpus
653:Statistical
520:Text mining
361:AI-complete
648:Rule-based
530:Truecasing
398:Stop words
277:2015-05-21
242:(3): 129.
192:2008-06-20
159:References
42:Techniques
957:reviewing
755:standards
753:Types and
124:engines.
1081:Category
873:Wikidata
853:FrameNet
838:BabelNet
817:Treebank
787:PropBank
732:Word2vec
697:fastText
578:Stemming
201:Archived
137:See also
99:a priori
1044:Related
1010:Chatbot
868:WordNet
848:DBpedia
722:Seq2seq
466:Parsing
381:Trigram
62:History
1017:(c.f.
675:models
663:Neural
376:Bigram
371:n-gram
297:
79:Google
1066:spaCy
711:large
702:GloVe
186:(PDF)
175:(PDF)
831:Data
682:BERT
70:and
863:UBY
244:doi
19:in
1083::
265:.
240:27
238:.
230:;
177:.
58:.
1021:)
744:,
713:)
709:(
339:e
332:t
325:v
280:.
250:.
246::
195:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.