Knowledge (XXG)

tf–idf

Source 📝

2345:(in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0. 5244:
different domains. Another derivate is TF–IDuF. In TF–IDuF, idf is not calculated based on the document corpus that is to be searched or recommended. Instead, idf is calculated on users' personal document collections. The authors report that TF–IDuF was equally effective as tf–idf but could also be applied in situations when, e.g., a user modeling system has no access to a global document corpus.
1934: 5217: 5235:
addition, tf–idf was applied to "visual words" with the purpose of conducting object matching in videos, and entire sentences. However, the concept of tf–idf did not prove to be more effective in all cases than a plain tf scheme (without idf). When tf–idf was applied to citations, researchers could find no improvement over a simple citation-count weight that had no idf component.
3777: 4972: 3170: 3448: 3486: 3782:
This expression shows that summing the Tf–idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution. Each Tf–idf hence carries the "bit of information" attached to a term x document pair.
2659: 5243:
A number of term-weighting schemes have derived from tf–idf. One of them is TF–PDF (term frequency * proportional document frequency). TF–PDF was introduced in 2001 in the context of identifying emerging topics in the media. The PDF component measures the difference of how often a term occurs in
5234:
The idea behind tf–idf also applies to entities other than terms. In 1998, the concept of idf was applied to citations. The authors argued that "if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents". In
6024:
MATLAB toolbox that can be used for various tasks in text mining (TM) specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. The indexing step offers the user the ability to apply local and global weighting methods, including
4978: 4722: 2871: 4187: 4611: 4076: 4733: 1133: 2135: 3233: 4510: 4404: 4298: 3984: 3772:{\displaystyle M({\cal {T}};{\cal {D}})=\sum _{t,d}p_{t|d}\cdot p_{d}\cdot \mathrm {idf} (t)=\sum _{t,d}\mathrm {tf} (t,d)\cdot {\frac {1}{|D|}}\cdot \mathrm {idf} (t)={\frac {1}{|D|}}\sum _{t,d}\mathrm {tf} (t,d)\cdot \mathrm {idf} (t).} 881: 258:", and "salad" appears in very few plays, so seeing these words, one could get a good idea as to which play it might be. In contrast, "good" and "sweet" appears in every play and are completely uninformative as to which play it is. 1656: 1472: 619: 2336: 2485: 735: 455: 5719: 1301: 3891:
In its raw frequency form, tf is just the frequency of the "this" for each document. In each document, the word "this" appears once; but as the document 2 has more words, its relative frequency is smaller.
1548:
inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):
2218: 2474: 1369: 2813: 2019: 2490: 5212:{\displaystyle \mathrm {tfidf} ({\mathsf {''example''}},d_{2},D)=\mathrm {tf} ({\mathsf {''example''}},d_{2})\times \mathrm {idf} ({\mathsf {''example''}},D)=0.429\times 0.301\approx 0.129} 3165:{\displaystyle H({\cal {D}}|{\cal {T}}=t)=-\sum _{d}p_{d|t}\log p_{d|t}=-\log {\frac {1}{|\{d\in D:t\in d\}|}}=\log {\frac {|\{d\in D:t\in d\}|}{|D|}}+\log |D|=-\mathrm {idf} (t)+\log |D|} 1533: 6012: 1217: 1841: 1927: 1775: 4967:{\displaystyle \mathrm {tfidf} ({\mathsf {''example''}},d_{1},D)=\mathrm {tf} ({\mathsf {''example''}},d_{1})\times \mathrm {idf} ({\mathsf {''example''}},D)=0\times 0.301=0} 512: 93:. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf. Variations of the tf–idf weighting scheme were often used by 3443:{\displaystyle M({\cal {T}};{\cal {D}})=H({\cal {D}})-H({\cal {D}}|{\cal {T}})=\sum _{t}p_{t}\cdot (H({\cal {D}})-H({\cal {D}}|W=t))=\sum _{t}p_{t}\cdot \mathrm {idf} (t)} 2030: 3221: 3197: 6044: 1007:
augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most frequently occurring term in the document:
4617: 1720: 363: 5730: 2724: 3478: 324: 2691:; it helps to understand why their product has a meaning in terms of joint informational content of a document. A characteristic assumption about the distribution 1867: 1014: 2863: 2843: 1795: 1682: 4091: 4516: 1555: 2654:{\displaystyle {\begin{aligned}\mathrm {idf} &=-\log P(t|D)\\&=\log {\frac {1}{P(t|D)}}\\&=\log {\frac {N}{|\{d\in D:t\in d\}|}}\end{aligned}}} 3990: 2231: 123:(1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting: 777: 2671:. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate 1380: 4415: 4304: 523: 4198: 4085:
for the ratio of documents that include the word "this". In this case, we have a corpus of two documents and all of them include the word "this".
3898: 5561: 5531: 5846: 5781: 5676: 5633: 5496: 5349: 630: 5304: 1237: 5720:"Evaluating the CC-IDF citation-weighting scheme – How effectively can 'Inverse Document Frequency' (IDF) be applied to references?" 112:
is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.
98: 2146: 66:, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a 374: 5259: 2361:, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find 2390: 1312: 2732: 4192:
So tf–idf is zero for the word "this", which implies that the word is not very informative as it appears in all documents.
1965: 1843:). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the numerator 5947:
Wu, H. C.; Luk, R.W.P.; Wong, K.F.; Kwok, K.L. (2008). "Interpreting TF-IDF term weights as making relevance decisions".
6003: 5756:
Proceedings Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2001
1544:
is a measure of how much information the word provides, i.e., how common or rare it is across all documents. It is the
6054: 5657:
Sivic, Josef; Zisserman, Andrew (2003-01-01). "Video Google: A text retrieval approach to object matching in videos".
5264: 6049: 5701: 1483: 127:
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.
5269: 2676: 1155: 920:(counting each occurrence of the same term separately). There are various other ways to define term frequency: 6028: 3480:, the unconditional probability to draw a term, with respect to the (random) choice of a document, to obtain: 1800: 3791:
Suppose that we have term count tables of a corpus consisting of only two documents, as listed on the right.
1872: 5759: 5424: 2818:
This assumption and its implications, according to Aizawa: "represent the heuristic that tf–idf employs."
2342: 131:
For example, the df (document frequency) and idf for some words in Shakespeare's 37 plays are as follows:
1726: 5807:"TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users' Personal Document Collections" 102: 31: 5412: 2354: 120: 466: 5429: 2130:{\displaystyle \left(0.5+0.5{\frac {f_{t,q}}{\max _{t}f_{t,q}}}\right)\cdot \log {\frac {N}{n_{t}}}} 278:
A formula that aims to define the importance of a keyword or phrase within a document or a web page.
5764: 5512: 2822: 2664:
Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency.
5974: 5935: 5886: 5787: 5682: 5639: 5598: 5442: 5394: 5294: 5274: 4409:
The word "example" is more interesting - it occurs three times, but only in the second document:
3224: 2688: 2373: 2362: 75: 5998: 5902: 5806: 4717:{\displaystyle \mathrm {idf} ({\mathsf {''example''}},D)=\log \left({\frac {2}{1}}\right)=0.301} 3202: 3178: 396: 5842: 5777: 5672: 5629: 5492: 5386: 5345: 5223: 1545: 976: 950: 5459:
Speech and Language Processing (3rd ed. draft), Dan Jurafsky and James H. Martin, chapter 14.
5415:(1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". 1687: 335: 5964: 5956: 5925: 5917: 5876: 5868: 5769: 5664: 5621: 5590: 5543: 5484: 5434: 5378: 5337: 2694: 2668: 109: 82: 3456: 1128:{\displaystyle \mathrm {tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}} 301: 4182:{\displaystyle \mathrm {idf} ({\mathsf {''this''}},D)=\log \left({\frac {2}{2}}\right)=0} 1846: 5754:
Khoo Khyou Bun; Bun, Khoo Khyou; Ishizuka, M. (2001). "Emerging Topic Tracking System".
4606:{\displaystyle \mathrm {tf} ({\mathsf {''example''}},d_{2})={\frac {3}{7}}\approx 0.429} 1937:
Plot of different inverse document frequency functions: standard, smooth, probabilistic.
1651:{\displaystyle \mathrm {idf} (t,D)=\log {\frac {N}{|\{d:d\in D{\text{ and }}t\in d\}|}}} 5254: 2848: 2828: 2368:
Spärck Jones's own explanation did not propose much theory, aside from a connection to
1780: 1667: 5594: 5534:(2004). "Understanding inverse document frequency: On theoretical arguments for IDF". 5366: 6038: 5921: 5898: 5890: 5856: 5831: 5826: 5398: 2369: 255: 94: 90: 5978: 5686: 5618:
Proceedings of the second international conference on Autonomous agents - AGENTS '98
5602: 4071:{\displaystyle \mathrm {tf} ({\mathsf {''this''}},d_{2})={\frac {1}{7}}\approx 0.14} 3223:
are "random variables" corresponding to respectively draw a document or a term. The
6016: 5939: 5791: 5643: 5446: 2331:{\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)} 5658: 1933: 5581:
Aizawa, Akiko (2003). "An information-theoretic perspective of tf–idf measures".
5488: 5473: 5460: 5341: 2687:
Both term frequency and inverse document frequency can be formulated in terms of
5838: 5279: 2679:: not only documents need to be taken into account, but also queries and terms. 2672: 86: 63: 5773: 876:{\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}} 5702:"Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles" 5668: 5547: 5382: 5299: 5284: 1467:{\displaystyle \log \left({\frac {\max _{\{t'\in d\}}n_{t'}}{1+n_{t}}}\right)} 71: 5616:
Bollacker, Kurt D.; Lawrence, Steve; Giles, C. Lee (1998-01-01). "CiteSeer".
5390: 5326: 4505:{\displaystyle \mathrm {tf} ({\mathsf {''example''}},d_{1})={\frac {0}{5}}=0} 4399:{\displaystyle \mathrm {tfidf} ({\mathsf {''this''}},d_{2},D)=0.14\times 0=0} 614:{\displaystyle 0.5+0.5\cdot {\frac {f_{t,d}}{\max _{\{t'\in d\}}{f_{t',d}}}}} 275:. There are various ways for determining the exact values of both statistics. 5995:
is a Python library for vector space modeling and includes tf–idf weighting.
5960: 4293:{\displaystyle \mathrm {tfidf} ({\mathsf {''this''}},d_{1},D)=0.2\times 0=0} 3979:{\displaystyle \mathrm {tf} ({\mathsf {''this''}},d_{1})={\frac {1}{5}}=0.2} 2358: 6021: 5625: 2865:(and assuming that all documents have equal probability to be chosen) is: 5289: 2667:
This probabilistic interpretation in turn takes the same form as that of
67: 59: 5969: 5872: 5859:; Fox, E. A.; Wu, H. (1983). "Extended Boolean information retrieval". 3888:
The calculation of tf–idf for the term "this" is performed as follows:
1947:
Variants of term frequency-inverse document frequency (tf–idf) weights
914:. Note the denominator is simply the total number of terms in document 730:{\displaystyle K+(1-K){\frac {f_{t,d}}{\max _{\{t'\in d\}}{f_{t',d}}}}} 78:, by allowing the weight of words to depend on the rest of the corpus. 17: 6007: 5992: 5930: 5881: 5438: 5660:
Proceedings Ninth IEEE International Conference on Computer Vision
1932: 251: 1296:{\displaystyle \log {\frac {N}{n_{t}}}=-\log {\frac {n_{t}}{N}}} 5365:
Breitinger, Corinna; Gipp, Bela; Langer, Stefan (2015-07-26).
2376:
footing, by estimating the probability that a given document
2213:{\displaystyle (1+\log f_{t,d})\cdot \log {\frac {N}{n_{t}}}} 906:
of a term in a document, i.e., the number of times that term
3508: 3498: 3369: 3350: 3305: 3293: 3274: 3255: 3245: 3208: 3184: 2895: 2883: 450:{\displaystyle f_{t,d}{\Bigg /}{\sum _{t'\in d}{f_{t',d}}}} 5367:"Research-paper recommender systems: a literature survey" 2469:{\displaystyle P(t|D)={\frac {|\{d\in D:t\in d\}|}{N}},} 1364:{\displaystyle \log \left({\frac {N}{1+n_{t}}}\right)+1} 5903:"Term-weighting approaches in automatic text retrieval" 2808:{\displaystyle p(d|t)={\frac {1}{|\{d\in D:t\in d\}|}}} 2845:, conditional to the fact it contains a specific term 97:
as a central tool in scoring and ranking a document's
5474:"Scoring, term weighting, and the vector space model" 4981: 4736: 4620: 4519: 4418: 4307: 4201: 4094: 3993: 3901: 3489: 3459: 3236: 3205: 3181: 2874: 2851: 2831: 2735: 2697: 2488: 2393: 2234: 2149: 2033: 1968: 1875: 1849: 1803: 1783: 1729: 1690: 1670: 1558: 1486: 1383: 1315: 1240: 1158: 1017: 780: 633: 526: 469: 377: 338: 304: 2014:{\displaystyle f_{t,d}\cdot \log {\frac {N}{n_{t}}}} 1144:
Variants of inverse document frequency (idf) weight
5576: 5574: 5830: 5211: 4966: 4716: 4605: 4504: 4398: 4292: 4181: 4070: 3978: 3771: 3472: 3442: 3215: 3191: 3164: 2857: 2837: 2807: 2718: 2653: 2468: 2357:in a 1972 paper. Although it has worked well as a 2341:A high weight in tf–idf is reached by a high term 2330: 2212: 2129: 2013: 1921: 1861: 1835: 1789: 1769: 1714: 1676: 1650: 1527: 1466: 1363: 1295: 1211: 1127: 875: 729: 613: 506: 449: 357: 318: 27:Estimate of the importance of a word in a document 5472:Manning, C.D.; Raghavan, P.; Schutze, H. (2008). 2068: 1398: 1075: 675: 559: 5526: 5524: 5522: 125: 5461:https://web.stanford.edu/~jurafsky/slp3/14.pdf 2825:of a "randomly chosen" document in the corpus 1528:{\displaystyle \log {\frac {N-n_{t}}{n_{t}}}} 267:The tf–idf is the product of two statistics, 58:, is a measure of importance of a word to a 8: 5833:Introduction to modern information retrieval 3066: 3042: 3014: 2990: 2794: 2770: 2636: 2612: 2449: 2425: 2353:Idf was introduced as "term specificity" by 1911: 1887: 1759: 1735: 1637: 1605: 1419: 1402: 1201: 1177: 1119: 1078: 696: 679: 580: 563: 1777: : number of documents where the term 5718:Beel, Joeran; Breitinger, Corinna (2017). 5371:International Journal on Digital Libraries 2372:. Attempts have been made to put idf on a 1684:: total number of documents in the corpus 5968: 5929: 5880: 5763: 5428: 5146: 5145: 5131: 5119: 5076: 5075: 5064: 5046: 5003: 5002: 4982: 4980: 4901: 4900: 4886: 4874: 4831: 4830: 4819: 4801: 4758: 4757: 4737: 4735: 4694: 4636: 4635: 4621: 4619: 4587: 4575: 4532: 4531: 4520: 4518: 4486: 4474: 4431: 4430: 4419: 4417: 4363: 4329: 4328: 4308: 4306: 4257: 4223: 4222: 4202: 4200: 4159: 4110: 4109: 4095: 4093: 4052: 4040: 4006: 4005: 3994: 3992: 3960: 3948: 3914: 3913: 3902: 3900: 3746: 3720: 3708: 3696: 3688: 3682: 3659: 3648: 3640: 3634: 3608: 3596: 3569: 3560: 3543: 3539: 3523: 3507: 3506: 3497: 3496: 3488: 3464: 3458: 3420: 3411: 3401: 3374: 3368: 3367: 3349: 3348: 3330: 3320: 3304: 3303: 3298: 3292: 3291: 3273: 3272: 3254: 3253: 3244: 3243: 3235: 3207: 3206: 3204: 3183: 3182: 3180: 3157: 3149: 3120: 3109: 3101: 3084: 3076: 3069: 3037: 3034: 3017: 2985: 2979: 2957: 2953: 2933: 2929: 2919: 2894: 2893: 2888: 2882: 2881: 2873: 2850: 2830: 2797: 2765: 2759: 2745: 2734: 2696: 2639: 2607: 2601: 2571: 2556: 2529: 2493: 2489: 2487: 2452: 2420: 2417: 2403: 2392: 2302: 2276: 2235: 2233: 2202: 2193: 2169: 2148: 2119: 2110: 2081: 2071: 2054: 2048: 2032: 2003: 1994: 1973: 1967: 1942:Term frequency–inverse document frequency 1914: 1882: 1874: 1848: 1804: 1802: 1782: 1762: 1730: 1728: 1706: 1698: 1697: 1689: 1669: 1640: 1623: 1600: 1594: 1559: 1557: 1517: 1506: 1493: 1485: 1451: 1428: 1401: 1394: 1382: 1342: 1326: 1314: 1282: 1276: 1256: 1247: 1239: 1212:{\displaystyle n_{t}=|\{d\in D:t\in d\}|} 1204: 1172: 1163: 1157: 1085: 1062: 1056: 1018: 1016: 852: 847: 830: 813: 807: 781: 779: 706: 701: 678: 661: 655: 632: 590: 585: 562: 545: 539: 525: 489: 468: 428: 423: 406: 401: 395: 394: 382: 376: 343: 337: 305: 303: 56:term frequency–inverse document frequency 3840: 3793: 1945: 1836:{\displaystyle \mathrm {tf} (t,d)\neq 0} 1478:probabilistic inverse document frequency 1142: 281: 133: 6045:Statistical natural language processing 5949:ACM Transactions on Information Systems 5910:Information Processing & Management 5317: 283:Variants of term frequency (tf) weight 5173: 5169: 5166: 5163: 5160: 5157: 5154: 5103: 5099: 5096: 5093: 5090: 5087: 5084: 5030: 5026: 5023: 5020: 5017: 5014: 5011: 4928: 4924: 4921: 4918: 4915: 4912: 4909: 4858: 4854: 4851: 4848: 4845: 4842: 4839: 4785: 4781: 4778: 4775: 4772: 4769: 4766: 4663: 4659: 4656: 4653: 4650: 4647: 4644: 4559: 4555: 4552: 4549: 4546: 4543: 4540: 4458: 4454: 4451: 4448: 4445: 4442: 4439: 4347: 4343: 4340: 4337: 4241: 4237: 4234: 4231: 4128: 4124: 4121: 4118: 4024: 4020: 4017: 4014: 3932: 3928: 3925: 3922: 85:in searches of information retrieval, 5583:Information Processing and Management 5566:Introduction to Information Retrieval 5481:Introduction to Information Retrieval 1922:{\displaystyle 1+|\{d\in D:t\in d\}|} 74:. It is a refinement over the simple 7: 5987:External links and suggested reading 5707:. National Institute of Informatics. 5325:Rajaraman, A.; Ullman, J.D. (2011). 2384:as the relative document frequency, 759:, is the relative frequency of term 5805:Langer, Stefan; Gipp, Bela (2017). 5727:Proceedings of the 12th IConference 4081:An idf is constant per corpus, and 1770:{\displaystyle |\{d\in D:t\in d\}|} 5305:SMART Information Retrieval System 5138: 5135: 5132: 5068: 5065: 4995: 4992: 4989: 4986: 4983: 4893: 4890: 4887: 4823: 4820: 4750: 4747: 4744: 4741: 4738: 4628: 4625: 4622: 4524: 4521: 4423: 4420: 4321: 4318: 4315: 4312: 4309: 4215: 4212: 4209: 4206: 4203: 4102: 4099: 4096: 3998: 3995: 3906: 3903: 3753: 3750: 3747: 3724: 3721: 3666: 3663: 3660: 3612: 3609: 3576: 3573: 3570: 3427: 3424: 3421: 3127: 3124: 3121: 2500: 2497: 2494: 2309: 2306: 2303: 2280: 2277: 2248: 2245: 2242: 2239: 2236: 1808: 1805: 1566: 1563: 1560: 1022: 1019: 785: 782: 25: 5562:Probability estimates in practice 1307:inverse document frequency smooth 507:{\displaystyle \log(1+f_{t,d})} 6022:Text to Matrix Generator (TMG) 6004:tf–idf and related definitions 5188: 5142: 5125: 5072: 5058: 4999: 4943: 4897: 4880: 4827: 4813: 4754: 4678: 4632: 4581: 4528: 4480: 4427: 4375: 4325: 4269: 4219: 4143: 4106: 4046: 4002: 3954: 3910: 3763: 3757: 3740: 3728: 3697: 3689: 3676: 3670: 3649: 3641: 3628: 3616: 3586: 3580: 3544: 3513: 3493: 3437: 3431: 3391: 3388: 3375: 3364: 3355: 3345: 3339: 3310: 3299: 3288: 3279: 3269: 3260: 3240: 3158: 3150: 3137: 3131: 3110: 3102: 3085: 3077: 3070: 3038: 3018: 2986: 2958: 2934: 2906: 2889: 2878: 2798: 2766: 2753: 2746: 2739: 2713: 2701: 2640: 2608: 2579: 2572: 2565: 2537: 2530: 2523: 2453: 2421: 2411: 2404: 2397: 2325: 2313: 2296: 2284: 2270: 2252: 2181: 2150: 1915: 1883: 1824: 1812: 1763: 1731: 1707: 1699: 1641: 1601: 1582: 1570: 1375:inverse document frequency max 1205: 1173: 1038: 1026: 801: 789: 652: 640: 501: 476: 1: 6031:Explanation of term-frequency 5595:10.1016/S0306-4573(02)00021-3 2479:so that we can define idf as 2225:Then tf–idf is calculated as 5922:10.1016/0306-4573(88)90021-0 5663:. ICCV '03. pp. 1470–. 5513:"TFIDF statistics | SAX-VSM" 5489:10.1017/CBO9780511809071.007 5342:10.1017/CBO9781139058452.002 2683:Link with information theory 5265:Latent Dirichlet allocation 5260:Kullback–Leibler divergence 3453:The last step is to expand 6071: 5999:Anatomy of a search engine 5774:10.1109/wecwis.2001.933900 5334:Mining of Massive Datasets 3216:{\displaystyle {\cal {T}}} 3192:{\displaystyle {\cal {D}}} 1542:inverse document frequency 1232:inverse document frequency 1139:Inverse document frequency 273:inverse document frequency 5861:Communications of the ACM 5669:10.1109/ICCV.2003.1238663 5548:10.1108/00220410410560582 5383:10.1007/s00799-015-0156-0 2677:probability distributions 2025:double normalization-idf 6029:Term-frequency explained 5829:; McGill, M. J. (1986). 5536:Journal of Documentation 5417:Journal of Documentation 5270:Latent semantic analysis 518:double normalization 0.5 5961:10.1145/1361684.1361686 2365:justifications for it. 1715:{\displaystyle N={|D|}} 358:{\displaystyle f_{t,d}} 81:It was often used as a 5901:; Buckley, C. (1988). 5213: 4968: 4718: 4607: 4506: 4400: 4294: 4183: 4072: 3980: 3773: 3474: 3444: 3217: 3193: 3175:In terms of notation, 3166: 2859: 2839: 2809: 2720: 2719:{\displaystyle p(d,t)} 2655: 2470: 2332: 2214: 2141:log normalization-idf 2131: 2015: 1938: 1923: 1863: 1837: 1791: 1771: 1716: 1678: 1652: 1546:logarithmically scaled 1529: 1468: 1365: 1297: 1213: 1129: 977:logarithmically scaled 924:the raw count itself: 877: 731: 625:double normalization K 615: 508: 451: 359: 320: 129: 5626:10.1145/280765.280786 5214: 4969: 4719: 4608: 4507: 4401: 4295: 4184: 4073: 3981: 3774: 3475: 3473:{\displaystyle p_{t}} 3445: 3218: 3194: 3167: 2860: 2840: 2810: 2721: 2656: 2471: 2363:information theoretic 2333: 2215: 2132: 2016: 1936: 1924: 1864: 1838: 1792: 1772: 1717: 1679: 1653: 1530: 1469: 1366: 1298: 1214: 1130: 878: 732: 616: 509: 452: 360: 321: 319:{\displaystyle {0,1}} 32:information retrieval 5620:. pp. 116–123. 4979: 4734: 4618: 4517: 4416: 4305: 4199: 4092: 3991: 3899: 3487: 3457: 3234: 3227:can be expressed as 3203: 3179: 2872: 2849: 2829: 2733: 2695: 2486: 2391: 2349:Justification of idf 2232: 2147: 2031: 1966: 1873: 1847: 1801: 1781: 1727: 1688: 1668: 1556: 1484: 1381: 1313: 1238: 1156: 1015: 778: 631: 524: 467: 375: 336: 302: 108:One of the simplest 3843: 3796: 2823:conditional entropy 1948: 1869:and denominator to 1862:{\displaystyle 1+N} 1145: 910:occurs in document 284: 135: 62:in a collection or 6055:Vector space model 5873:10.1145/182.358466 5295:Vector space model 5275:Mutual information 5209: 4964: 4714: 4603: 4502: 4396: 4290: 4179: 4068: 3976: 3841: 3794: 3769: 3719: 3607: 3534: 3470: 3440: 3406: 3325: 3225:mutual information 3213: 3189: 3162: 2924: 2855: 2835: 2805: 2716: 2689:information theory 2651: 2649: 2466: 2355:Karen Spärck Jones 2328: 2210: 2127: 2076: 2011: 1946: 1939: 1919: 1859: 1833: 1787: 1767: 1712: 1674: 1648: 1525: 1464: 1423: 1361: 1293: 1209: 1143: 1125: 873: 846: 727: 700: 611: 584: 504: 447: 422: 355: 316: 282: 134: 121:Karen Spärck Jones 76:bag-of-words model 70:of words, without 6050:Ranking functions 5867:(11): 1022–1036. 5848:978-0-07-054484-0 5783:978-0-7695-1224-2 5758:. pp. 2–11. 5678:978-0-7695-1950-0 5635:978-0-89791-983-8 5498:978-0-511-80907-1 5351:978-1-139-05845-2 5336:. pp. 1–17. 5224:base 10 logarithm 4702: 4595: 4494: 4167: 4060: 3968: 3886: 3885: 3839: 3838: 3787:Example of tf–idf 3704: 3702: 3654: 3592: 3519: 3397: 3316: 3090: 3023: 2915: 2858:{\displaystyle t} 2838:{\displaystyle D} 2803: 2675:for the required 2645: 2583: 2461: 2223: 2222: 2208: 2125: 2094: 2067: 2009: 1952:weighting scheme 1790:{\displaystyle t} 1677:{\displaystyle N} 1646: 1626: 1538: 1537: 1523: 1458: 1397: 1349: 1291: 1262: 1149:weighting scheme 1123: 871: 826: 740: 739: 725: 674: 609: 558: 461:log normalization 402: 288:weighting scheme 248: 247: 110:ranking functions 16:(Redirected from 6062: 6013:TfidfTransformer 5982: 5972: 5943: 5933: 5907: 5894: 5884: 5852: 5836: 5818: 5817: 5811: 5802: 5796: 5795: 5767: 5751: 5745: 5744: 5742: 5741: 5735: 5729:. Archived from 5724: 5715: 5709: 5708: 5706: 5697: 5691: 5690: 5654: 5648: 5647: 5613: 5607: 5606: 5578: 5569: 5558: 5552: 5551: 5528: 5517: 5516: 5509: 5503: 5502: 5478: 5469: 5463: 5457: 5451: 5450: 5439:10.1108/eb026526 5432: 5413:Spärck Jones, K. 5409: 5403: 5402: 5362: 5356: 5355: 5331: 5322: 5218: 5216: 5215: 5210: 5181: 5180: 5179: 5153: 5141: 5124: 5123: 5111: 5110: 5109: 5083: 5071: 5051: 5050: 5038: 5037: 5036: 5010: 4998: 4973: 4971: 4970: 4965: 4936: 4935: 4934: 4908: 4896: 4879: 4878: 4866: 4865: 4864: 4838: 4826: 4806: 4805: 4793: 4792: 4791: 4765: 4753: 4723: 4721: 4720: 4715: 4707: 4703: 4695: 4671: 4670: 4669: 4643: 4631: 4612: 4610: 4609: 4604: 4596: 4588: 4580: 4579: 4567: 4566: 4565: 4539: 4527: 4511: 4509: 4508: 4503: 4495: 4487: 4479: 4478: 4466: 4465: 4464: 4438: 4426: 4405: 4403: 4402: 4397: 4368: 4367: 4355: 4354: 4353: 4336: 4324: 4299: 4297: 4296: 4291: 4262: 4261: 4249: 4248: 4247: 4230: 4218: 4188: 4186: 4185: 4180: 4172: 4168: 4160: 4136: 4135: 4134: 4117: 4105: 4077: 4075: 4074: 4069: 4061: 4053: 4045: 4044: 4032: 4031: 4030: 4013: 4001: 3985: 3983: 3982: 3977: 3969: 3961: 3953: 3952: 3940: 3939: 3938: 3921: 3909: 3844: 3797: 3778: 3776: 3775: 3770: 3756: 3727: 3718: 3703: 3701: 3700: 3692: 3683: 3669: 3655: 3653: 3652: 3644: 3635: 3615: 3606: 3579: 3565: 3564: 3552: 3551: 3547: 3533: 3512: 3511: 3502: 3501: 3479: 3477: 3476: 3471: 3469: 3468: 3449: 3447: 3446: 3441: 3430: 3416: 3415: 3405: 3378: 3373: 3372: 3354: 3353: 3335: 3334: 3324: 3309: 3308: 3302: 3297: 3296: 3278: 3277: 3259: 3258: 3249: 3248: 3222: 3220: 3219: 3214: 3212: 3211: 3198: 3196: 3195: 3190: 3188: 3187: 3171: 3169: 3168: 3163: 3161: 3153: 3130: 3113: 3105: 3091: 3089: 3088: 3080: 3074: 3073: 3041: 3035: 3024: 3022: 3021: 2989: 2980: 2966: 2965: 2961: 2942: 2941: 2937: 2923: 2899: 2898: 2892: 2887: 2886: 2864: 2862: 2861: 2856: 2844: 2842: 2841: 2836: 2814: 2812: 2811: 2806: 2804: 2802: 2801: 2769: 2760: 2749: 2725: 2723: 2722: 2717: 2669:self-information 2660: 2658: 2657: 2652: 2650: 2646: 2644: 2643: 2611: 2602: 2588: 2584: 2582: 2575: 2557: 2543: 2533: 2503: 2475: 2473: 2472: 2467: 2462: 2457: 2456: 2424: 2418: 2407: 2383: 2380:contains a term 2379: 2337: 2335: 2334: 2329: 2312: 2283: 2251: 2219: 2217: 2216: 2211: 2209: 2207: 2206: 2194: 2180: 2179: 2136: 2134: 2133: 2128: 2126: 2124: 2123: 2111: 2100: 2096: 2095: 2093: 2092: 2091: 2075: 2065: 2064: 2049: 2020: 2018: 2017: 2012: 2010: 2008: 2007: 1995: 1984: 1983: 1949: 1928: 1926: 1925: 1920: 1918: 1886: 1868: 1866: 1865: 1860: 1842: 1840: 1839: 1834: 1811: 1796: 1794: 1793: 1788: 1776: 1774: 1773: 1768: 1766: 1734: 1721: 1719: 1718: 1713: 1711: 1710: 1702: 1683: 1681: 1680: 1675: 1657: 1655: 1654: 1649: 1647: 1645: 1644: 1627: 1624: 1604: 1595: 1569: 1534: 1532: 1531: 1526: 1524: 1522: 1521: 1512: 1511: 1510: 1494: 1473: 1471: 1470: 1465: 1463: 1459: 1457: 1456: 1455: 1439: 1438: 1437: 1436: 1422: 1412: 1395: 1370: 1368: 1367: 1362: 1354: 1350: 1348: 1347: 1346: 1327: 1302: 1300: 1299: 1294: 1292: 1287: 1286: 1277: 1263: 1261: 1260: 1248: 1218: 1216: 1215: 1210: 1208: 1176: 1168: 1167: 1146: 1134: 1132: 1131: 1126: 1124: 1122: 1112: 1101: 1100: 1093: 1073: 1072: 1057: 1025: 1003: 973:and 0 otherwise; 972: 968: 964: 947: 919: 913: 909: 901: 882: 880: 879: 874: 872: 870: 869: 868: 867: 860: 845: 838: 824: 823: 808: 788: 770: 765:within document 764: 758: 747:Term frequency, 736: 734: 733: 728: 726: 724: 723: 722: 721: 714: 699: 689: 672: 671: 656: 620: 618: 617: 612: 610: 608: 607: 606: 605: 598: 583: 573: 556: 555: 540: 513: 511: 510: 505: 500: 499: 456: 454: 453: 448: 446: 445: 444: 443: 436: 421: 414: 400: 399: 393: 392: 364: 362: 361: 356: 354: 353: 325: 323: 322: 317: 315: 285: 136: 83:weighting factor 21: 6070: 6069: 6065: 6064: 6063: 6061: 6060: 6059: 6035: 6034: 5989: 5946: 5905: 5897: 5855: 5849: 5825: 5822: 5821: 5809: 5804: 5803: 5799: 5784: 5753: 5752: 5748: 5739: 5737: 5733: 5722: 5717: 5716: 5712: 5704: 5699: 5698: 5694: 5679: 5656: 5655: 5651: 5636: 5615: 5614: 5610: 5580: 5579: 5572: 5559: 5555: 5530: 5529: 5520: 5511: 5510: 5506: 5499: 5483:. p. 100. 5476: 5471: 5470: 5466: 5458: 5454: 5430:10.1.1.115.8343 5411: 5410: 5406: 5364: 5363: 5359: 5352: 5329: 5324: 5323: 5319: 5314: 5309: 5250: 5241: 5232: 5172: 5147: 5115: 5102: 5077: 5042: 5029: 5004: 4977: 4976: 4927: 4902: 4870: 4857: 4832: 4797: 4784: 4759: 4732: 4731: 4690: 4662: 4637: 4616: 4615: 4571: 4558: 4533: 4515: 4514: 4470: 4457: 4432: 4414: 4413: 4359: 4346: 4330: 4303: 4302: 4253: 4240: 4224: 4197: 4196: 4155: 4127: 4111: 4090: 4089: 4036: 4023: 4007: 3989: 3988: 3944: 3931: 3915: 3897: 3896: 3789: 3687: 3639: 3556: 3535: 3485: 3484: 3460: 3455: 3454: 3407: 3326: 3232: 3231: 3201: 3200: 3177: 3176: 3075: 3036: 2984: 2949: 2925: 2870: 2869: 2847: 2846: 2827: 2826: 2764: 2731: 2730: 2693: 2692: 2685: 2648: 2647: 2606: 2586: 2585: 2561: 2541: 2540: 2504: 2484: 2483: 2419: 2389: 2388: 2381: 2377: 2351: 2230: 2229: 2198: 2165: 2145: 2144: 2115: 2077: 2066: 2050: 2038: 2034: 2029: 2028: 1999: 1969: 1964: 1963: 1944: 1871: 1870: 1845: 1844: 1799: 1798: 1797:appears (i.e., 1779: 1778: 1725: 1724: 1686: 1685: 1666: 1665: 1625: and  1599: 1554: 1553: 1513: 1502: 1495: 1482: 1481: 1447: 1440: 1429: 1424: 1405: 1396: 1390: 1379: 1378: 1338: 1331: 1322: 1311: 1310: 1278: 1252: 1236: 1235: 1159: 1154: 1153: 1141: 1105: 1086: 1081: 1074: 1058: 1013: 1012: 1001: 980: 970: 966: 954: 953:"frequencies": 946: 925: 915: 911: 907: 900: 888: 853: 848: 831: 825: 809: 776: 775: 766: 760: 748: 745: 707: 702: 682: 673: 657: 629: 628: 591: 586: 566: 557: 541: 522: 521: 485: 465: 464: 429: 424: 407: 378: 373: 372: 369:term frequency 339: 334: 333: 300: 299: 264: 118: 28: 23: 22: 15: 12: 11: 5: 6068: 6066: 6058: 6057: 6052: 6047: 6037: 6036: 6033: 6032: 6026: 6019: 6010: 6001: 5996: 5988: 5985: 5984: 5983: 5944: 5916:(5): 513–523. 5895: 5853: 5847: 5820: 5819: 5797: 5782: 5765:10.1.1.16.7986 5746: 5710: 5692: 5677: 5649: 5634: 5608: 5570: 5553: 5542:(5): 503–520. 5518: 5504: 5497: 5464: 5452: 5404: 5377:(4): 305–338. 5357: 5350: 5316: 5315: 5313: 5310: 5308: 5307: 5302: 5297: 5292: 5287: 5282: 5277: 5272: 5267: 5262: 5257: 5255:Word embedding 5251: 5249: 5246: 5240: 5237: 5231: 5228: 5220: 5219: 5208: 5205: 5202: 5199: 5196: 5193: 5190: 5187: 5184: 5178: 5175: 5171: 5168: 5165: 5162: 5159: 5156: 5152: 5149: 5144: 5140: 5137: 5134: 5130: 5127: 5122: 5118: 5114: 5108: 5105: 5101: 5098: 5095: 5092: 5089: 5086: 5082: 5079: 5074: 5070: 5067: 5063: 5060: 5057: 5054: 5049: 5045: 5041: 5035: 5032: 5028: 5025: 5022: 5019: 5016: 5013: 5009: 5006: 5001: 4997: 4994: 4991: 4988: 4985: 4974: 4963: 4960: 4957: 4954: 4951: 4948: 4945: 4942: 4939: 4933: 4930: 4926: 4923: 4920: 4917: 4914: 4911: 4907: 4904: 4899: 4895: 4892: 4889: 4885: 4882: 4877: 4873: 4869: 4863: 4860: 4856: 4853: 4850: 4847: 4844: 4841: 4837: 4834: 4829: 4825: 4822: 4818: 4815: 4812: 4809: 4804: 4800: 4796: 4790: 4787: 4783: 4780: 4777: 4774: 4771: 4768: 4764: 4761: 4756: 4752: 4749: 4746: 4743: 4740: 4725: 4724: 4713: 4710: 4706: 4701: 4698: 4693: 4689: 4686: 4683: 4680: 4677: 4674: 4668: 4665: 4661: 4658: 4655: 4652: 4649: 4646: 4642: 4639: 4634: 4630: 4627: 4624: 4613: 4602: 4599: 4594: 4591: 4586: 4583: 4578: 4574: 4570: 4564: 4561: 4557: 4554: 4551: 4548: 4545: 4542: 4538: 4535: 4530: 4526: 4523: 4512: 4501: 4498: 4493: 4490: 4485: 4482: 4477: 4473: 4469: 4463: 4460: 4456: 4453: 4450: 4447: 4444: 4441: 4437: 4434: 4429: 4425: 4422: 4407: 4406: 4395: 4392: 4389: 4386: 4383: 4380: 4377: 4374: 4371: 4366: 4362: 4358: 4352: 4349: 4345: 4342: 4339: 4335: 4332: 4327: 4323: 4320: 4317: 4314: 4311: 4300: 4289: 4286: 4283: 4280: 4277: 4274: 4271: 4268: 4265: 4260: 4256: 4252: 4246: 4243: 4239: 4236: 4233: 4229: 4226: 4221: 4217: 4214: 4211: 4208: 4205: 4190: 4189: 4178: 4175: 4171: 4166: 4163: 4158: 4154: 4151: 4148: 4145: 4142: 4139: 4133: 4130: 4126: 4123: 4120: 4116: 4113: 4108: 4104: 4101: 4098: 4079: 4078: 4067: 4064: 4059: 4056: 4051: 4048: 4043: 4039: 4035: 4029: 4026: 4022: 4019: 4016: 4012: 4009: 4004: 4000: 3997: 3986: 3975: 3972: 3967: 3964: 3959: 3956: 3951: 3947: 3943: 3937: 3934: 3930: 3927: 3924: 3920: 3917: 3912: 3908: 3905: 3884: 3883: 3880: 3876: 3875: 3872: 3868: 3867: 3864: 3860: 3859: 3856: 3852: 3851: 3848: 3837: 3836: 3833: 3829: 3828: 3825: 3821: 3820: 3817: 3813: 3812: 3809: 3805: 3804: 3801: 3788: 3785: 3780: 3779: 3768: 3765: 3762: 3759: 3755: 3752: 3749: 3745: 3742: 3739: 3736: 3733: 3730: 3726: 3723: 3717: 3714: 3711: 3707: 3699: 3695: 3691: 3686: 3681: 3678: 3675: 3672: 3668: 3665: 3662: 3658: 3651: 3647: 3643: 3638: 3633: 3630: 3627: 3624: 3621: 3618: 3614: 3611: 3605: 3602: 3599: 3595: 3591: 3588: 3585: 3582: 3578: 3575: 3572: 3568: 3563: 3559: 3555: 3550: 3546: 3542: 3538: 3532: 3529: 3526: 3522: 3518: 3515: 3510: 3505: 3500: 3495: 3492: 3467: 3463: 3451: 3450: 3439: 3436: 3433: 3429: 3426: 3423: 3419: 3414: 3410: 3404: 3400: 3396: 3393: 3390: 3387: 3384: 3381: 3377: 3371: 3366: 3363: 3360: 3357: 3352: 3347: 3344: 3341: 3338: 3333: 3329: 3323: 3319: 3315: 3312: 3307: 3301: 3295: 3290: 3287: 3284: 3281: 3276: 3271: 3268: 3265: 3262: 3257: 3252: 3247: 3242: 3239: 3210: 3186: 3173: 3172: 3160: 3156: 3152: 3148: 3145: 3142: 3139: 3136: 3133: 3129: 3126: 3123: 3119: 3116: 3112: 3108: 3104: 3100: 3097: 3094: 3087: 3083: 3079: 3072: 3068: 3065: 3062: 3059: 3056: 3053: 3050: 3047: 3044: 3040: 3033: 3030: 3027: 3020: 3016: 3013: 3010: 3007: 3004: 3001: 2998: 2995: 2992: 2988: 2983: 2978: 2975: 2972: 2969: 2964: 2960: 2956: 2952: 2948: 2945: 2940: 2936: 2932: 2928: 2922: 2918: 2914: 2911: 2908: 2905: 2902: 2897: 2891: 2885: 2880: 2877: 2854: 2834: 2816: 2815: 2800: 2796: 2793: 2790: 2787: 2784: 2781: 2778: 2775: 2772: 2768: 2763: 2758: 2755: 2752: 2748: 2744: 2741: 2738: 2715: 2712: 2709: 2706: 2703: 2700: 2684: 2681: 2662: 2661: 2642: 2638: 2635: 2632: 2629: 2626: 2623: 2620: 2617: 2614: 2610: 2605: 2600: 2597: 2594: 2591: 2589: 2587: 2581: 2578: 2574: 2570: 2567: 2564: 2560: 2555: 2552: 2549: 2546: 2544: 2542: 2539: 2536: 2532: 2528: 2525: 2522: 2519: 2516: 2513: 2510: 2507: 2505: 2502: 2499: 2496: 2492: 2491: 2477: 2476: 2465: 2460: 2455: 2451: 2448: 2445: 2442: 2439: 2436: 2433: 2430: 2427: 2423: 2416: 2413: 2410: 2406: 2402: 2399: 2396: 2350: 2347: 2339: 2338: 2327: 2324: 2321: 2318: 2315: 2311: 2308: 2305: 2301: 2298: 2295: 2292: 2289: 2286: 2282: 2279: 2275: 2272: 2269: 2266: 2263: 2260: 2257: 2254: 2250: 2247: 2244: 2241: 2238: 2221: 2220: 2205: 2201: 2197: 2192: 2189: 2186: 2183: 2178: 2175: 2172: 2168: 2164: 2161: 2158: 2155: 2152: 2142: 2138: 2137: 2122: 2118: 2114: 2109: 2106: 2103: 2099: 2090: 2087: 2084: 2080: 2074: 2070: 2063: 2060: 2057: 2053: 2047: 2044: 2041: 2037: 2026: 2022: 2021: 2006: 2002: 1998: 1993: 1990: 1987: 1982: 1979: 1976: 1972: 1961: 1957: 1956: 1953: 1943: 1940: 1931: 1930: 1917: 1913: 1910: 1907: 1904: 1901: 1898: 1895: 1892: 1889: 1885: 1881: 1878: 1858: 1855: 1852: 1832: 1829: 1826: 1823: 1820: 1817: 1814: 1810: 1807: 1786: 1765: 1761: 1758: 1755: 1752: 1749: 1746: 1743: 1740: 1737: 1733: 1722: 1709: 1705: 1701: 1696: 1693: 1673: 1659: 1658: 1643: 1639: 1636: 1633: 1630: 1622: 1619: 1616: 1613: 1610: 1607: 1603: 1598: 1593: 1590: 1587: 1584: 1581: 1578: 1575: 1572: 1568: 1565: 1562: 1536: 1535: 1520: 1516: 1509: 1505: 1501: 1498: 1492: 1489: 1479: 1475: 1474: 1462: 1454: 1450: 1446: 1443: 1435: 1432: 1427: 1421: 1418: 1415: 1411: 1408: 1404: 1400: 1393: 1389: 1386: 1376: 1372: 1371: 1360: 1357: 1353: 1345: 1341: 1337: 1334: 1330: 1325: 1321: 1318: 1308: 1304: 1303: 1290: 1285: 1281: 1275: 1272: 1269: 1266: 1259: 1255: 1251: 1246: 1243: 1233: 1229: 1228: 1225: 1221: 1220: 1207: 1203: 1200: 1197: 1194: 1191: 1188: 1185: 1182: 1179: 1175: 1171: 1166: 1162: 1150: 1140: 1137: 1136: 1135: 1121: 1118: 1115: 1111: 1108: 1104: 1099: 1096: 1092: 1089: 1084: 1080: 1077: 1071: 1068: 1065: 1061: 1055: 1052: 1049: 1046: 1043: 1040: 1037: 1034: 1031: 1028: 1024: 1021: 1009: 1008: 1005: 993: 974: 948: 938: 892: 885: 884: 866: 863: 859: 856: 851: 844: 841: 837: 834: 829: 822: 819: 816: 812: 806: 803: 800: 797: 794: 791: 787: 784: 744: 743:Term frequency 741: 738: 737: 720: 717: 713: 710: 705: 698: 695: 692: 688: 685: 681: 677: 670: 667: 664: 660: 654: 651: 648: 645: 642: 639: 636: 626: 622: 621: 604: 601: 597: 594: 589: 582: 579: 576: 572: 569: 565: 561: 554: 551: 548: 544: 538: 535: 532: 529: 519: 515: 514: 503: 498: 495: 492: 488: 484: 481: 478: 475: 472: 462: 458: 457: 442: 439: 435: 432: 427: 420: 417: 413: 410: 405: 398: 391: 388: 385: 381: 370: 366: 365: 352: 349: 346: 342: 331: 327: 326: 314: 311: 308: 297: 293: 292: 289: 280: 279: 276: 269:term frequency 263: 260: 246: 245: 242: 239: 235: 234: 231: 228: 224: 223: 220: 217: 213: 212: 209: 206: 202: 201: 198: 195: 191: 190: 187: 184: 180: 179: 176: 173: 169: 168: 165: 162: 158: 157: 154: 151: 147: 146: 143: 140: 117: 114: 95:search engines 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 6067: 6056: 6053: 6051: 6048: 6046: 6043: 6042: 6040: 6030: 6027: 6023: 6020: 6018: 6014: 6011: 6009: 6005: 6002: 6000: 5997: 5994: 5991: 5990: 5986: 5980: 5976: 5971: 5966: 5962: 5958: 5954: 5950: 5945: 5941: 5937: 5932: 5927: 5923: 5919: 5915: 5911: 5904: 5900: 5896: 5892: 5888: 5883: 5878: 5874: 5870: 5866: 5862: 5858: 5854: 5850: 5844: 5840: 5835: 5834: 5828: 5824: 5823: 5815: 5808: 5801: 5798: 5793: 5789: 5785: 5779: 5775: 5771: 5766: 5761: 5757: 5750: 5747: 5736:on 2020-09-22 5732: 5728: 5721: 5714: 5711: 5703: 5700:Seki, Yohei. 5696: 5693: 5688: 5684: 5680: 5674: 5670: 5666: 5662: 5661: 5653: 5650: 5645: 5641: 5637: 5631: 5627: 5623: 5619: 5612: 5609: 5604: 5600: 5596: 5592: 5588: 5584: 5577: 5575: 5571: 5567: 5563: 5557: 5554: 5549: 5545: 5541: 5537: 5533: 5532:Robertson, S. 5527: 5525: 5523: 5519: 5514: 5508: 5505: 5500: 5494: 5490: 5486: 5482: 5475: 5468: 5465: 5462: 5456: 5453: 5448: 5444: 5440: 5436: 5431: 5426: 5422: 5418: 5414: 5408: 5405: 5400: 5396: 5392: 5388: 5384: 5380: 5376: 5372: 5368: 5361: 5358: 5353: 5347: 5343: 5339: 5335: 5328: 5327:"Data Mining" 5321: 5318: 5311: 5306: 5303: 5301: 5298: 5296: 5293: 5291: 5288: 5286: 5283: 5281: 5278: 5276: 5273: 5271: 5268: 5266: 5263: 5261: 5258: 5256: 5253: 5252: 5247: 5245: 5238: 5236: 5229: 5227: 5225: 5206: 5203: 5200: 5197: 5194: 5191: 5185: 5182: 5176: 5150: 5148: 5128: 5120: 5116: 5112: 5106: 5080: 5078: 5061: 5055: 5052: 5047: 5043: 5039: 5033: 5007: 5005: 4975: 4961: 4958: 4955: 4952: 4949: 4946: 4940: 4937: 4931: 4905: 4903: 4883: 4875: 4871: 4867: 4861: 4835: 4833: 4816: 4810: 4807: 4802: 4798: 4794: 4788: 4762: 4760: 4730: 4729: 4728: 4711: 4708: 4704: 4699: 4696: 4691: 4687: 4684: 4681: 4675: 4672: 4666: 4640: 4638: 4614: 4600: 4597: 4592: 4589: 4584: 4576: 4572: 4568: 4562: 4536: 4534: 4513: 4499: 4496: 4491: 4488: 4483: 4475: 4471: 4467: 4461: 4435: 4433: 4412: 4411: 4410: 4393: 4390: 4387: 4384: 4381: 4378: 4372: 4369: 4364: 4360: 4356: 4350: 4333: 4331: 4301: 4287: 4284: 4281: 4278: 4275: 4272: 4266: 4263: 4258: 4254: 4250: 4244: 4227: 4225: 4195: 4194: 4193: 4176: 4173: 4169: 4164: 4161: 4156: 4152: 4149: 4146: 4140: 4137: 4131: 4114: 4112: 4088: 4087: 4086: 4084: 4065: 4062: 4057: 4054: 4049: 4041: 4037: 4033: 4027: 4010: 4008: 3987: 3973: 3970: 3965: 3962: 3957: 3949: 3945: 3941: 3935: 3918: 3916: 3895: 3894: 3893: 3889: 3881: 3878: 3877: 3873: 3870: 3869: 3865: 3862: 3861: 3857: 3854: 3853: 3849: 3846: 3845: 3834: 3831: 3830: 3826: 3823: 3822: 3818: 3815: 3814: 3810: 3807: 3806: 3802: 3799: 3798: 3792: 3786: 3784: 3766: 3760: 3743: 3737: 3734: 3731: 3715: 3712: 3709: 3705: 3693: 3684: 3679: 3673: 3656: 3645: 3636: 3631: 3625: 3622: 3619: 3603: 3600: 3597: 3593: 3589: 3583: 3566: 3561: 3557: 3553: 3548: 3540: 3536: 3530: 3527: 3524: 3520: 3516: 3503: 3490: 3483: 3482: 3481: 3465: 3461: 3434: 3417: 3412: 3408: 3402: 3398: 3394: 3385: 3382: 3379: 3361: 3358: 3342: 3336: 3331: 3327: 3321: 3317: 3313: 3285: 3282: 3266: 3263: 3250: 3237: 3230: 3229: 3228: 3226: 3154: 3146: 3143: 3140: 3134: 3117: 3114: 3106: 3098: 3095: 3092: 3081: 3063: 3060: 3057: 3054: 3051: 3048: 3045: 3031: 3028: 3025: 3011: 3008: 3005: 3002: 2999: 2996: 2993: 2981: 2976: 2973: 2970: 2967: 2962: 2954: 2950: 2946: 2943: 2938: 2930: 2926: 2920: 2916: 2912: 2909: 2903: 2900: 2875: 2868: 2867: 2866: 2852: 2832: 2824: 2819: 2791: 2788: 2785: 2782: 2779: 2776: 2773: 2761: 2756: 2750: 2742: 2736: 2729: 2728: 2727: 2710: 2707: 2704: 2698: 2690: 2682: 2680: 2678: 2674: 2670: 2665: 2633: 2630: 2627: 2624: 2621: 2618: 2615: 2603: 2598: 2595: 2592: 2590: 2576: 2568: 2562: 2558: 2553: 2550: 2547: 2545: 2534: 2526: 2520: 2517: 2514: 2511: 2508: 2506: 2482: 2481: 2480: 2463: 2458: 2446: 2443: 2440: 2437: 2434: 2431: 2428: 2414: 2408: 2400: 2394: 2387: 2386: 2385: 2375: 2374:probabilistic 2371: 2366: 2364: 2360: 2356: 2348: 2346: 2344: 2322: 2319: 2316: 2299: 2293: 2290: 2287: 2273: 2267: 2264: 2261: 2258: 2255: 2228: 2227: 2226: 2203: 2199: 2195: 2190: 2187: 2184: 2176: 2173: 2170: 2166: 2162: 2159: 2156: 2153: 2143: 2140: 2139: 2120: 2116: 2112: 2107: 2104: 2101: 2097: 2088: 2085: 2082: 2078: 2072: 2061: 2058: 2055: 2051: 2045: 2042: 2039: 2035: 2027: 2024: 2023: 2004: 2000: 1996: 1991: 1988: 1985: 1980: 1977: 1974: 1970: 1962: 1959: 1958: 1954: 1951: 1950: 1941: 1935: 1908: 1905: 1902: 1899: 1896: 1893: 1890: 1879: 1876: 1856: 1853: 1850: 1830: 1827: 1821: 1818: 1815: 1784: 1756: 1753: 1750: 1747: 1744: 1741: 1738: 1723: 1703: 1694: 1691: 1671: 1664: 1663: 1662: 1634: 1631: 1628: 1620: 1617: 1614: 1611: 1608: 1596: 1591: 1588: 1585: 1579: 1576: 1573: 1552: 1551: 1550: 1547: 1543: 1518: 1514: 1507: 1503: 1499: 1496: 1490: 1487: 1480: 1477: 1476: 1460: 1452: 1448: 1444: 1441: 1433: 1430: 1425: 1416: 1413: 1409: 1406: 1391: 1387: 1384: 1377: 1374: 1373: 1358: 1355: 1351: 1343: 1339: 1335: 1332: 1328: 1323: 1319: 1316: 1309: 1306: 1305: 1288: 1283: 1279: 1273: 1270: 1267: 1264: 1257: 1253: 1249: 1244: 1241: 1234: 1231: 1230: 1226: 1223: 1222: 1198: 1195: 1192: 1189: 1186: 1183: 1180: 1169: 1164: 1160: 1151: 1148: 1147: 1138: 1116: 1113: 1109: 1106: 1102: 1097: 1094: 1090: 1087: 1082: 1069: 1066: 1063: 1059: 1053: 1050: 1047: 1044: 1041: 1035: 1032: 1029: 1011: 1010: 1006: 1000: 996: 992: 989:) = log (1 + 988: 984: 978: 975: 962: 958: 952: 949: 945: 941: 937: 933: 929: 923: 922: 921: 918: 905: 899: 895: 891: 864: 861: 857: 854: 849: 842: 839: 835: 832: 827: 820: 817: 814: 810: 804: 798: 795: 792: 774: 773: 772: 769: 763: 756: 752: 742: 718: 715: 711: 708: 703: 693: 690: 686: 683: 668: 665: 662: 658: 649: 646: 643: 637: 634: 627: 624: 623: 602: 599: 595: 592: 587: 577: 574: 570: 567: 552: 549: 546: 542: 536: 533: 530: 527: 520: 517: 516: 496: 493: 490: 486: 482: 479: 473: 470: 463: 460: 459: 440: 437: 433: 430: 425: 418: 415: 411: 408: 403: 389: 386: 383: 379: 371: 368: 367: 350: 347: 344: 340: 332: 329: 328: 312: 309: 306: 298: 295: 294: 290: 287: 286: 277: 274: 270: 266: 265: 261: 259: 257: 253: 250:We see that " 243: 240: 237: 236: 232: 229: 226: 225: 221: 218: 215: 214: 210: 207: 204: 203: 199: 196: 193: 192: 188: 185: 182: 181: 177: 174: 171: 170: 166: 163: 160: 159: 155: 152: 149: 148: 144: 141: 138: 137: 132: 128: 124: 122: 115: 113: 111: 106: 104: 101:given a user 100: 96: 92: 91:user modeling 88: 84: 79: 77: 73: 69: 65: 61: 57: 54:), short for 53: 49: 45: 41: 37: 33: 19: 6017:scikit-learn 5952: 5948: 5913: 5909: 5864: 5860: 5832: 5813: 5800: 5755: 5749: 5738:. Retrieved 5731:the original 5726: 5713: 5695: 5659: 5652: 5617: 5611: 5589:(1): 45–65. 5586: 5582: 5565: 5556: 5539: 5535: 5507: 5480: 5467: 5455: 5423:(1): 11–21. 5420: 5416: 5407: 5374: 5370: 5360: 5333: 5320: 5242: 5233: 5230:Beyond terms 5221: 4726: 4408: 4191: 4082: 4080: 3890: 3887: 3790: 3781: 3452: 3174: 2820: 2817: 2686: 2673:event spaces 2666: 2663: 2478: 2367: 2352: 2340: 2224: 1660: 1541: 1539: 1152:idf weight ( 998: 994: 990: 986: 982: 960: 956: 943: 939: 935: 931: 927: 916: 903: 897: 893: 889: 886: 767: 761: 754: 750: 746: 272: 268: 249: 130: 126: 119: 107: 80: 55: 51: 47: 43: 39: 35: 29: 6006:as used in 5970:10397/10130 5839:McGraw-Hill 5814:IConference 5280:Noun phrase 5239:Derivatives 5222:(using the 3850:Term Count 3842:Document 1 3803:Term Count 3795:Document 2 979:frequency: 116:Motivations 87:text mining 6039:Categories 5899:Salton, G. 5857:Salton, G. 5740:2017-01-29 5312:References 5300:Word count 5285:Okapi BM25 2370:Zipf's law 1960:count-idf 969:occurs in 291:tf weight 262:Definition 72:word order 5931:1813/6721 5891:207180535 5882:1813/6351 5827:Salton, G 5760:CiteSeerX 5560:See also 5425:CiteSeerX 5399:207035184 5391:1432-5012 5204:≈ 5198:× 5129:× 4953:× 4884:× 4727:Finally, 4688:⁡ 4598:≈ 4385:× 4279:× 4153:⁡ 4063:≈ 3744:⋅ 3706:∑ 3657:⋅ 3632:⋅ 3594:∑ 3567:⋅ 3554:⋅ 3521:∑ 3418:⋅ 3399:∑ 3359:− 3337:⋅ 3318:∑ 3283:− 3147:⁡ 3118:− 3099:⁡ 3061:∈ 3049:∈ 3032:⁡ 3009:∈ 2997:∈ 2977:⁡ 2971:− 2947:⁡ 2917:∑ 2913:− 2789:∈ 2777:∈ 2726:is that: 2631:∈ 2619:∈ 2599:⁡ 2554:⁡ 2518:⁡ 2512:− 2444:∈ 2432:∈ 2359:heuristic 2343:frequency 2300:⋅ 2191:⁡ 2185:⋅ 2163:⁡ 2108:⁡ 2102:⋅ 1992:⁡ 1986:⋅ 1906:∈ 1894:∈ 1828:≠ 1754:∈ 1742:∈ 1632:∈ 1618:∈ 1592:⁡ 1500:− 1491:⁡ 1414:∈ 1388:⁡ 1320:⁡ 1274:⁡ 1268:− 1245:⁡ 1196:∈ 1184:∈ 1114:∈ 1054:⋅ 904:raw count 840:∈ 828:∑ 691:∈ 647:− 575:∈ 537:⋅ 474:⁡ 416:∈ 404:∑ 330:raw count 172:Falstaff 99:relevance 5979:18303048 5955:(3): 1. 5687:14457153 5603:45793141 5290:PageRank 5248:See also 5177:″ 5151:″ 5107:″ 5081:″ 5034:″ 5008:″ 4932:″ 4906:″ 4862:″ 4836:″ 4789:″ 4763:″ 4667:″ 4641:″ 4563:″ 4537:″ 4462:″ 4436:″ 4351:″ 4334:″ 4245:″ 4228:″ 4132:″ 4115:″ 4083:accounts 4028:″ 4011:″ 3936:″ 3919:″ 3832:example 3824:another 1434:′ 1410:′ 1110:′ 1091:′ 858:′ 836:′ 712:′ 687:′ 596:′ 571:′ 434:′ 412:′ 256:Falstaff 68:multiset 60:document 6025:tf–idf. 5940:7725217 5792:1049263 5644:3526393 5447:2996187 3879:sample 1955:tf-idf 951:Boolean 902:is the 194:battle 183:forest 6008:Lucene 5993:Gensim 5977:  5938:  5889:  5845:  5790:  5780:  5762:  5685:  5675:  5642:  5632:  5601:  5495:  5445:  5427:  5397:  5389:  5348:  887:where 296:binary 238:sweet 222:0.012 211:0.037 200:0.246 189:0.489 178:0.967 161:salad 150:Romeo 89:, and 64:corpus 52:Tf–idf 48:TF–IDF 40:TF*IDF 38:(also 36:tf–idf 18:Tf-idf 5975:S2CID 5936:S2CID 5906:(PDF) 5887:S2CID 5810:(PDF) 5788:S2CID 5734:(PDF) 5723:(PDF) 5705:(PDF) 5683:S2CID 5640:S2CID 5599:S2CID 5477:(PDF) 5443:S2CID 5395:S2CID 5330:(PDF) 5207:0.129 5201:0.301 5195:0.429 4956:0.301 4712:0.301 4601:0.429 3847:Term 3800:Term 1661:with 1224:unary 963:) = 1 252:Romeo 227:good 216:fool 167:1.27 156:1.57 139:Word 103:query 50:, or 44:TFIDF 5843:ISBN 5778:ISBN 5673:ISBN 5630:ISBN 5493:ISBN 5387:ISSN 5346:ISBN 4382:0.14 4066:0.14 3855:this 3808:this 3199:and 2821:The 1540:The 934:) = 271:and 254:", " 205:wit 145:idf 6015:in 5965:hdl 5957:doi 5926:hdl 5918:doi 5877:hdl 5869:doi 5770:doi 5665:doi 5622:doi 5591:doi 5564:in 5544:doi 5485:doi 5435:doi 5379:doi 5338:doi 5226:). 4685:log 4276:0.2 4150:log 3974:0.2 3863:is 3816:is 3144:log 3096:log 3029:log 2974:log 2944:log 2596:log 2551:log 2515:log 2188:log 2160:log 2105:log 2069:max 2046:0.5 2040:0.5 1989:log 1589:log 1488:log 1399:max 1385:log 1317:log 1271:log 1242:log 1076:max 1051:0.5 1045:0.5 981:tf( 965:if 955:tf( 926:tf( 771:, 749:tf( 676:max 560:max 534:0.5 528:0.5 471:log 241:37 230:37 219:36 208:34 197:21 186:12 142:df 30:In 6041:: 5973:. 5963:. 5953:26 5951:. 5934:. 5924:. 5914:24 5912:. 5908:. 5885:. 5875:. 5865:26 5863:. 5841:. 5837:. 5812:. 5786:. 5776:. 5768:. 5725:. 5681:. 5671:. 5638:. 5628:. 5597:. 5587:39 5585:. 5573:^ 5540:60 5538:. 5521:^ 5491:. 5479:. 5441:. 5433:. 5421:28 5419:. 5393:. 5385:. 5375:17 5373:. 5369:. 5344:. 5332:. 3882:1 3874:2 3871:a 3866:1 3858:1 3835:3 3827:2 3819:1 3811:1 1227:1 1219:) 244:0 233:0 175:4 164:2 153:1 105:. 46:, 42:, 34:, 5981:. 5967:: 5959:: 5942:. 5928:: 5920:: 5893:. 5879:: 5871:: 5851:. 5816:. 5794:. 5772:: 5743:. 5689:. 5667:: 5646:. 5624:: 5605:. 5593:: 5568:. 5550:. 5546:: 5515:. 5501:. 5487:: 5449:. 5437:: 5401:. 5381:: 5354:. 5340:: 5192:= 5189:) 5186:D 5183:, 5174:e 5170:l 5167:p 5164:m 5161:a 5158:x 5155:e 5143:( 5139:f 5136:d 5133:i 5126:) 5121:2 5117:d 5113:, 5104:e 5100:l 5097:p 5094:m 5091:a 5088:x 5085:e 5073:( 5069:f 5066:t 5062:= 5059:) 5056:D 5053:, 5048:2 5044:d 5040:, 5031:e 5027:l 5024:p 5021:m 5018:a 5015:x 5012:e 5000:( 4996:f 4993:d 4990:i 4987:f 4984:t 4962:0 4959:= 4950:0 4947:= 4944:) 4941:D 4938:, 4929:e 4925:l 4922:p 4919:m 4916:a 4913:x 4910:e 4898:( 4894:f 4891:d 4888:i 4881:) 4876:1 4872:d 4868:, 4859:e 4855:l 4852:p 4849:m 4846:a 4843:x 4840:e 4828:( 4824:f 4821:t 4817:= 4814:) 4811:D 4808:, 4803:1 4799:d 4795:, 4786:e 4782:l 4779:p 4776:m 4773:a 4770:x 4767:e 4755:( 4751:f 4748:d 4745:i 4742:f 4739:t 4709:= 4705:) 4700:1 4697:2 4692:( 4682:= 4679:) 4676:D 4673:, 4664:e 4660:l 4657:p 4654:m 4651:a 4648:x 4645:e 4633:( 4629:f 4626:d 4623:i 4593:7 4590:3 4585:= 4582:) 4577:2 4573:d 4569:, 4560:e 4556:l 4553:p 4550:m 4547:a 4544:x 4541:e 4529:( 4525:f 4522:t 4500:0 4497:= 4492:5 4489:0 4484:= 4481:) 4476:1 4472:d 4468:, 4459:e 4455:l 4452:p 4449:m 4446:a 4443:x 4440:e 4428:( 4424:f 4421:t 4394:0 4391:= 4388:0 4379:= 4376:) 4373:D 4370:, 4365:2 4361:d 4357:, 4348:s 4344:i 4341:h 4338:t 4326:( 4322:f 4319:d 4316:i 4313:f 4310:t 4288:0 4285:= 4282:0 4273:= 4270:) 4267:D 4264:, 4259:1 4255:d 4251:, 4242:s 4238:i 4235:h 4232:t 4220:( 4216:f 4213:d 4210:i 4207:f 4204:t 4177:0 4174:= 4170:) 4165:2 4162:2 4157:( 4147:= 4144:) 4141:D 4138:, 4129:s 4125:i 4122:h 4119:t 4107:( 4103:f 4100:d 4097:i 4058:7 4055:1 4050:= 4047:) 4042:2 4038:d 4034:, 4025:s 4021:i 4018:h 4015:t 4003:( 3999:f 3996:t 3971:= 3966:5 3963:1 3958:= 3955:) 3950:1 3946:d 3942:, 3933:s 3929:i 3926:h 3923:t 3911:( 3907:f 3904:t 3767:. 3764:) 3761:t 3758:( 3754:f 3751:d 3748:i 3741:) 3738:d 3735:, 3732:t 3729:( 3725:f 3722:t 3716:d 3713:, 3710:t 3698:| 3694:D 3690:| 3685:1 3680:= 3677:) 3674:t 3671:( 3667:f 3664:d 3661:i 3650:| 3646:D 3642:| 3637:1 3629:) 3626:d 3623:, 3620:t 3617:( 3613:f 3610:t 3604:d 3601:, 3598:t 3590:= 3587:) 3584:t 3581:( 3577:f 3574:d 3571:i 3562:d 3558:p 3549:d 3545:| 3541:t 3537:p 3531:d 3528:, 3525:t 3517:= 3514:) 3509:D 3504:; 3499:T 3494:( 3491:M 3466:t 3462:p 3438:) 3435:t 3432:( 3428:f 3425:d 3422:i 3413:t 3409:p 3403:t 3395:= 3392:) 3389:) 3386:t 3383:= 3380:W 3376:| 3370:D 3365:( 3362:H 3356:) 3351:D 3346:( 3343:H 3340:( 3332:t 3328:p 3322:t 3314:= 3311:) 3306:T 3300:| 3294:D 3289:( 3286:H 3280:) 3275:D 3270:( 3267:H 3264:= 3261:) 3256:D 3251:; 3246:T 3241:( 3238:M 3209:T 3185:D 3159:| 3155:D 3151:| 3141:+ 3138:) 3135:t 3132:( 3128:f 3125:d 3122:i 3115:= 3111:| 3107:D 3103:| 3093:+ 3086:| 3082:D 3078:| 3071:| 3067:} 3064:d 3058:t 3055:: 3052:D 3046:d 3043:{ 3039:| 3026:= 3019:| 3015:} 3012:d 3006:t 3003:: 3000:D 2994:d 2991:{ 2987:| 2982:1 2968:= 2963:t 2959:| 2955:d 2951:p 2939:t 2935:| 2931:d 2927:p 2921:d 2910:= 2907:) 2904:t 2901:= 2896:T 2890:| 2884:D 2879:( 2876:H 2853:t 2833:D 2799:| 2795:} 2792:d 2786:t 2783:: 2780:D 2774:d 2771:{ 2767:| 2762:1 2757:= 2754:) 2751:t 2747:| 2743:d 2740:( 2737:p 2714:) 2711:t 2708:, 2705:d 2702:( 2699:p 2641:| 2637:} 2634:d 2628:t 2625:: 2622:D 2616:d 2613:{ 2609:| 2604:N 2593:= 2580:) 2577:D 2573:| 2569:t 2566:( 2563:P 2559:1 2548:= 2538:) 2535:D 2531:| 2527:t 2524:( 2521:P 2509:= 2501:f 2498:d 2495:i 2464:, 2459:N 2454:| 2450:} 2447:d 2441:t 2438:: 2435:D 2429:d 2426:{ 2422:| 2415:= 2412:) 2409:D 2405:| 2401:t 2398:( 2395:P 2382:t 2378:d 2326:) 2323:D 2320:, 2317:t 2314:( 2310:f 2307:d 2304:i 2297:) 2294:d 2291:, 2288:t 2285:( 2281:f 2278:t 2274:= 2271:) 2268:D 2265:, 2262:d 2259:, 2256:t 2253:( 2249:f 2246:d 2243:i 2240:f 2237:t 2204:t 2200:n 2196:N 2182:) 2177:d 2174:, 2171:t 2167:f 2157:+ 2154:1 2151:( 2121:t 2117:n 2113:N 2098:) 2089:q 2086:, 2083:t 2079:f 2073:t 2062:q 2059:, 2056:t 2052:f 2043:+ 2036:( 2005:t 2001:n 1997:N 1981:d 1978:, 1975:t 1971:f 1929:. 1916:| 1912:} 1909:d 1903:t 1900:: 1897:D 1891:d 1888:{ 1884:| 1880:+ 1877:1 1857:N 1854:+ 1851:1 1831:0 1825:) 1822:d 1819:, 1816:t 1813:( 1809:f 1806:t 1785:t 1764:| 1760:} 1757:d 1751:t 1748:: 1745:D 1739:d 1736:{ 1732:| 1708:| 1704:D 1700:| 1695:= 1692:N 1672:N 1642:| 1638:} 1635:d 1629:t 1621:D 1615:d 1612:: 1609:d 1606:{ 1602:| 1597:N 1586:= 1583:) 1580:D 1577:, 1574:t 1571:( 1567:f 1564:d 1561:i 1519:t 1515:n 1508:t 1504:n 1497:N 1461:) 1453:t 1449:n 1445:+ 1442:1 1431:t 1426:n 1420:} 1417:d 1407:t 1403:{ 1392:( 1359:1 1356:+ 1352:) 1344:t 1340:n 1336:+ 1333:1 1329:N 1324:( 1289:N 1284:t 1280:n 1265:= 1258:t 1254:n 1250:N 1206:| 1202:} 1199:d 1193:t 1190:: 1187:D 1181:d 1178:{ 1174:| 1170:= 1165:t 1161:n 1120:} 1117:d 1107:t 1103:: 1098:d 1095:, 1088:t 1083:f 1079:{ 1070:d 1067:, 1064:t 1060:f 1048:+ 1042:= 1039:) 1036:d 1033:, 1030:t 1027:( 1023:f 1020:t 1004:; 1002:) 999:d 997:, 995:t 991:f 987:d 985:, 983:t 971:d 967:t 961:d 959:, 957:t 944:d 942:, 940:t 936:f 932:d 930:, 928:t 917:d 912:d 908:t 898:d 896:, 894:t 890:f 883:, 865:d 862:, 855:t 850:f 843:d 833:t 821:d 818:, 815:t 811:f 805:= 802:) 799:d 796:, 793:t 790:( 786:f 783:t 768:d 762:t 757:) 755:d 753:, 751:t 719:d 716:, 709:t 704:f 697:} 694:d 684:t 680:{ 669:d 666:, 663:t 659:f 653:) 650:K 644:1 641:( 638:+ 635:K 603:d 600:, 593:t 588:f 581:} 578:d 568:t 564:{ 553:d 550:, 547:t 543:f 531:+ 502:) 497:d 494:, 491:t 487:f 483:+ 480:1 477:( 441:d 438:, 431:t 426:f 419:d 409:t 397:/ 390:d 387:, 384:t 380:f 351:d 348:, 345:t 341:f 313:1 310:, 307:0 20:)

Index

Tf-idf
information retrieval
document
corpus
multiset
word order
bag-of-words model
weighting factor
text mining
user modeling
search engines
relevance
query
ranking functions
Karen Spärck Jones
Romeo
Falstaff
Boolean
logarithmically scaled
logarithmically scaled

frequency
Karen Spärck Jones
heuristic
information theoretic
Zipf's law
probabilistic
self-information
event spaces
probability distributions

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.