Probabilistic latent semantic analysis

79: 879:

Higher-order data: Although this is rarely discussed in the scientific literature, PLSA extends naturally to higher order data (three modes and higher), i.e. it can model co-occurrences over three or more variables. In the symmetric formulation above, this is done simply by adding conditional

1026: 789:. The number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents. 493: 46:

for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in

936: 864:

Generative models: The following models have been developed to address an often-criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents.

518:

being the words' topic. Note that the number of topics is a hyperparameter that must be chosen in advance and is not estimated from the data. The first formulation is the

942: 1137: 752: 715: 654: 617: 217: 160: 1097: 787: 321: 678: 580: 560: 540: 516: 281: 257: 237: 180: 123: 103: 754:. Although we have used words and documents in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way. 1122: 1046: 333: 1068:

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006"

959:

Pinoli, Pietro; et, al. (2013). "Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations".

1117: 976: 1142: 894: 880:

probability distributions for these additional variables. This is the probabilistic analogue to non-negative tensor factorisation.

1147: 1054: 63: 938:

Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization

867: 820: 323:

of words and documents, PLSA models the probability of each co-occurrence as a mixture of conditionally independent

1152: 906: 324: 55: 48: 871: 1101: 1080:

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing"

816: 812: 43: 1043: 1079: 1027:

A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

963:. The 13th IEEE International Conference on BioInformatics and BioEngineering. IEEE. pp. 1–4. 911: 260: 835: 916: 890: 67: 972: 66:), probabilistic latent semantic analysis is based on a mixture decomposition derived from a 1008: 964: 824: 720: 683: 622: 585: 185: 128: 760: 294: 1050: 284: 993: 828: 663: 565: 545: 525: 501: 266: 242: 222: 165: 108: 88: 82: 59: 1067: 1131: 805: 793: 1012: 839: 968: 946: 488:{\displaystyle P(w,d)=\sum _{c}P(c)P(d|c)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)} 857:

Symmetric: HPLSA ("Hierarchical Probabilistic Latent Semantic Analysis")

680:, a latent class is chosen conditionally to the document according to 17: 1053:, in "Advances in Information Retrieval -- Proceedings of the 24th 1091: 854:

Asymmetric: MASHA ("Multinomial ASymmetric Hierarchical Analysis")

78: 182:

is a word drawn from the word distribution of this word's topic,

897:. The present terminology was coined in 1999 by Thomas Hofmann. 125:

is a word's topic drawn from the document's topic distribution,

1044:

A Hierarchical Model for Clustering and Categorising Documents

838:

used in the probabilistic latent semantic analysis has severe

717:, and a word is then generated from that class according to 1042:

Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen,

85:

representing the PLSA model ("asymmetric" formulation).

1096:, Proceedings of the Twenty-Second Annual International 992:

Blei, David M.; Andrew Y. Ng; Michael I. Jordan (2003).

291:

Considering observations in the form of co-occurrences

763: 723: 686: 666: 625: 588: 582:

in similar ways (using the conditional probabilities

568: 548: 528: 504: 336: 297: 269: 245: 225: 188: 168: 131: 111: 91: 62:

and downsizes the occurrence tables (usually via a

42:, especially in information retrieval circles) is a 1057:

European Colloquium on IR Research (ECIR-02)", 2002

804:PLSA may be used in a discriminative setting, via 781: 746: 709: 672: 648: 611: 574: 554: 534: 510: 487: 315: 275: 251: 231: 211: 174: 154: 117: 97: 943:Advances in Neural Information Processing Systems 893:(see references therein), and it is related to 8: 874:prior on the per-document topic distribution 1100:Conference on Research and Development in 762: 757:So, the number of parameters is equal to 733: 722: 696: 685: 665: 656:), whereas the second formulation is the 635: 624: 598: 587: 567: 562:are both generated from the latent class 547: 527: 503: 474: 454: 439: 409: 389: 362: 335: 296: 268: 244: 224: 198: 187: 167: 141: 130: 110: 90: 77: 1138:Statistical natural language processing 928: 792:Their parameters are learned using the 1118:Probabilistic Latent Semantic Analysis 1093:Probabilistic Latent Semantic Indexing 1078:Chris Ding, Tao Li, Wei Peng (2008). " 1066:Chris Ding, Tao Li, Wei Peng (2006). " 660:formulation, where, for each document 36:probabilistic latent semantic indexing 28:Probabilistic latent semantic analysis 1031:Information Processing and Management 1025:Alexei Vinokourov and Mark Girolami, 7: 1001:Journal of Machine Learning Research 25: 895:non-negative matrix factorization 105:is the document index variable, 741: 734: 727: 704: 697: 690: 643: 636: 629: 606: 599: 592: 482: 475: 468: 462: 455: 448: 432: 426: 417: 410: 403: 397: 390: 383: 377: 371: 352: 340: 310: 298: 206: 199: 192: 149: 142: 135: 1: 994:"Latent Dirichlet Allocation" 961:Proceedings of IEEE BIBE 2013 64:singular value decomposition 1013:10.1162/jmlr.2003.3.4-5.993 868:Latent Dirichlet allocation 821:natural language processing 51:, from which PLSA evolved. 1169: 1143:Classification algorithms 969:10.1109/BIBE.2013.6701702 851:Hierarchical extensions: 811:PLSA has applications in 325:multinomial distributions 1123:Complete PLSA DEMO in C# 907:Compound term processing 889:This is an example of a 834:It is reported that the 56:latent semantic analysis 49:latent semantic analysis 1148:Latent variable models 783: 748: 747:{\displaystyle P(w|c)} 711: 710:{\displaystyle P(c|d)} 674: 650: 649:{\displaystyle P(w|c)} 613: 612:{\displaystyle P(d|c)} 576: 556: 536: 512: 489: 317: 288: 277: 253: 233: 213: 212:{\displaystyle P(w|c)} 176: 156: 155:{\displaystyle P(c|d)} 119: 99: 1102:Information Retrieval 831:, and related areas. 813:information retrieval 784: 782:{\displaystyle cd+wc} 749: 712: 675: 651: 614: 577: 557: 537: 513: 490: 318: 316:{\displaystyle (w,d)} 278: 254: 234: 214: 177: 157: 120: 100: 81: 54:Compared to standard 44:statistical technique 923:References and notes 761: 721: 684: 664: 623: 586: 566: 546: 526: 502: 334: 295: 267: 261:observable variables 243: 223: 186: 166: 129: 109: 89: 912:Pachinko allocation 522:formulation, where 1049:2016-03-04 at the 917:Vector space model 891:latent class model 779: 744: 707: 670: 646: 609: 572: 552: 532: 508: 485: 444: 367: 313: 289: 273: 249: 229: 209: 172: 152: 115: 95: 68:latent class model 1153:Language modeling 673:{\displaystyle d} 575:{\displaystyle c} 555:{\displaystyle d} 535:{\displaystyle w} 511:{\displaystyle c} 435: 358: 276:{\displaystyle c} 252:{\displaystyle w} 232:{\displaystyle d} 175:{\displaystyle w} 118:{\displaystyle c} 98:{\displaystyle d} 58:which stems from 34:), also known as 16:(Redirected from 1160: 1105: 1104:(SIGIR-99), 1999 1090:Thomas Hofmann, 1088: 1082: 1076: 1070: 1064: 1058: 1040: 1034: 1023: 1017: 1016: 998: 989: 983: 982: 956: 950: 945:12, pp-914-920, 935:Thomas Hofmann, 933: 825:machine learning 788: 786: 785: 780: 753: 751: 750: 745: 737: 716: 714: 713: 708: 700: 679: 677: 676: 671: 655: 653: 652: 647: 639: 618: 616: 615: 610: 602: 581: 579: 578: 573: 561: 559: 558: 553: 541: 539: 538: 533: 517: 515: 514: 509: 494: 492: 491: 486: 478: 458: 443: 413: 393: 366: 322: 320: 319: 314: 282: 280: 279: 274: 258: 256: 255: 250: 238: 236: 235: 230: 218: 216: 215: 210: 202: 181: 179: 178: 173: 161: 159: 158: 153: 145: 124: 122: 121: 116: 104: 102: 101: 96: 21: 1168: 1167: 1163: 1162: 1161: 1159: 1158: 1157: 1128: 1127: 1114: 1109: 1108: 1089: 1085: 1077: 1073: 1065: 1061: 1051:Wayback Machine 1041: 1037: 1024: 1020: 996: 991: 990: 986: 979: 978:978-147993163-7 958: 957: 953: 934: 930: 925: 903: 887: 848: 802: 759: 758: 719: 718: 682: 681: 662: 661: 621: 620: 584: 583: 564: 563: 544: 543: 524: 523: 500: 499: 332: 331: 293: 292: 285:latent variable 265: 264: 241: 240: 221: 220: 184: 183: 164: 163: 127: 126: 107: 106: 87: 86: 76: 23: 22: 15: 12: 11: 5: 1166: 1164: 1156: 1155: 1150: 1145: 1140: 1130: 1129: 1126: 1125: 1120: 1113: 1112:External links 1110: 1107: 1106: 1083: 1071: 1059: 1035: 1018: 984: 977: 951: 927: 926: 924: 921: 920: 919: 914: 909: 902: 899: 886: 883: 882: 881: 877: 876: 875: 861: 860: 859: 858: 855: 847: 844: 829:bioinformatics 806:Fisher kernels 801: 798: 778: 775: 772: 769: 766: 743: 740: 736: 732: 729: 726: 706: 703: 699: 695: 692: 689: 669: 645: 642: 638: 634: 631: 628: 608: 605: 601: 597: 594: 591: 571: 551: 531: 507: 496: 495: 484: 481: 477: 473: 470: 467: 464: 461: 457: 453: 450: 447: 442: 438: 434: 431: 428: 425: 422: 419: 416: 412: 408: 405: 402: 399: 396: 392: 388: 385: 382: 379: 376: 373: 370: 365: 361: 357: 354: 351: 348: 345: 342: 339: 312: 309: 306: 303: 300: 272: 248: 228: 208: 205: 201: 197: 194: 191: 171: 151: 148: 144: 140: 137: 134: 114: 94: 83:Plate notation 75: 72: 60:linear algebra 24: 14: 13: 10: 9: 6: 4: 3: 2: 1165: 1154: 1151: 1149: 1146: 1144: 1141: 1139: 1136: 1135: 1133: 1124: 1121: 1119: 1116: 1115: 1111: 1103: 1099: 1095: 1094: 1087: 1084: 1081: 1075: 1072: 1069: 1063: 1060: 1056: 1052: 1048: 1045: 1039: 1036: 1032: 1028: 1022: 1019: 1014: 1010: 1006: 1002: 995: 988: 985: 980: 974: 970: 966: 962: 955: 952: 948: 944: 940: 939: 932: 929: 922: 918: 915: 913: 910: 908: 905: 904: 900: 898: 896: 892: 884: 878: 873: 869: 866: 865: 863: 862: 856: 853: 852: 850: 849: 845: 843: 841: 837: 832: 830: 826: 822: 818: 814: 809: 807: 799: 797: 795: 790: 776: 773: 770: 767: 764: 755: 738: 730: 724: 701: 693: 687: 667: 659: 640: 632: 626: 603: 595: 589: 569: 549: 529: 521: 505: 479: 471: 465: 459: 451: 445: 440: 436: 429: 423: 420: 414: 406: 400: 394: 386: 380: 374: 368: 363: 359: 355: 349: 346: 343: 337: 330: 329: 328: 326: 307: 304: 301: 286: 270: 262: 246: 226: 203: 195: 189: 169: 146: 138: 132: 112: 92: 84: 80: 73: 71: 69: 65: 61: 57: 52: 50: 45: 41: 37: 33: 29: 19: 1092: 1086: 1074: 1062: 1038: 1030: 1021: 1007:: 993–1022. 1004: 1000: 987: 960: 954: 937: 931: 888: 836:aspect model 833: 810: 803: 794:EM algorithm 791: 756: 657: 519: 497: 290: 263:, the topic 53: 39: 35: 31: 27: 26: 840:overfitting 827:from text, 800:Application 1132:Categories 846:Extensions 842:problems. 658:asymmetric 947:MIT Press 872:Dirichlet 870:– adds a 817:filtering 520:symmetric 437:∑ 360:∑ 1055:BCS-IRSG 1047:Archived 901:See also 885:History 219:. The 1033:, 2002 975: 949:, 2000 162:, and 1098:SIGIR 1029:, in 997:(PDF) 498:with 283:is a 74:Model 973:ISBN 815:and 619:and 542:and 259:are 239:and 40:PLSI 32:PLSA 18:PLSA 1009:doi 965:doi 1134:: 1003:. 999:. 971:. 941:, 823:, 819:, 808:. 796:. 327:: 70:. 1015:. 1011:: 1005:3 981:. 967:: 777:c 774:w 771:+ 768:d 765:c 742:) 739:c 735:| 731:w 728:( 725:P 705:) 702:d 698:| 694:c 691:( 688:P 668:d 644:) 641:c 637:| 633:w 630:( 627:P 607:) 604:c 600:| 596:d 593:( 590:P 570:c 550:d 530:w 506:c 483:) 480:c 476:| 472:w 469:( 466:P 463:) 460:d 456:| 452:c 449:( 446:P 441:c 433:) 430:d 427:( 424:P 421:= 418:) 415:c 411:| 407:w 404:( 401:P 398:) 395:c 391:| 387:d 384:( 381:P 378:) 375:c 372:( 369:P 364:c 356:= 353:) 350:d 347:, 344:w 341:( 338:P 311:) 308:d 305:, 302:w 299:( 287:. 271:c 247:w 227:d 207:) 204:c 200:| 196:w 193:( 190:P 170:w 150:) 147:d 143:| 139:c 136:( 133:P 113:c 93:d 38:( 30:( 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.