Knowledge (XXG)

Online content analysis

Source 📝

123:
sample. The content of a site may also differ across users, requiring careful specification of the sampling frame. Some researchers have used search engines to construct sampling frames. This technique has disadvantages because search engine results are unsystematic and non-random making them unreliable for obtaining an unbiased sample. The sampling frame issue can be circumvented by using an entire population of interest, such as tweets by particular Twitter users or online archived content of certain newspapers as the sampling frame. Changes to online material can make categorizing content (step 3) more challenging. Because online content can change frequently it is particularly important to note the time period over which the sample is collected. A useful step is to archive the sample content in order to prevent changes from being made.
848: 298:
get a random sample, should researchers work with samples or should they try to collect all the text units that they observer? And on the other hand, sometimes researchers have to work with samples that are given to them by some search engines (i.e. Google) and online companies (i.e. Twitter) but the research do not have access to how these samples have been generated and whether they are random or not. Should researches use such samples?
259:) validity represents how well documents in each identified cluster represent a distinct, categorical unit. In a topic model, this would be the extent to which the documents in each cluster represent the same topic. This can be tested by creating a validation set that human coders use to manually validate topic choice or the relatedness of within-cluster documents compared to documents from different clusters. 836: 80:. As with the rise of newspapers, the proliferation of online content provides an expanded opportunity for researchers interested in content analysis. While the use of online sources presents new research problems and opportunities, the basic research procedure of online content analysis outlined by McMillan (2000) is virtually indistinguishable from content analysis using offline sources: 127:
context unit, without a clear definition of what they meant. Researchers recommend clearly and consistently defining what a ‘web page’ consists of, or reducing the size of the context unit to a feature on a website. Researchers have also made use of more discrete units of online communication such as web comments or tweets.
287:? Some social scientists argue that researchers should build their theory, expectations and methods (in this case specific categories they will use to classify different text units) before they start collecting and studying the data whereas some others support that defining a set of categories is a back-and-forth process. 247:
Results of supervised methods can be validated by drawing a distinct sub-sample of the corpus, called a 'validation set'. Documents in the validation set can be hand-coded and compared to the automatic coding output to evaluate how well the algorithm replicated human coding. This comparison can take
225:
Mixed membership models: According also to Grimmer and Stewart (17), mixed membership models "improve the output of single-membership models by including additional and problem-specific structure." Mixed membership FAC models classifies individual words within each document into categories, allowing
221:
Single membership models: these models automatically cluster texts into different categories that are mutually exclusive, and documents are coded into one and only one category. As pointed out by Grimmer and Stewart (16), "each algorithm has three components: (1) a definition of document similarity
208:
Supervised Ideological Scaling (i.e. wordscores) is used to place different text units along an ideological continuum. The researcher selects two sets of texts that represent each ideological extreme, which the algorithm can use to identify words that belong to each extreme point. The remainder of
152:
Automatic content analysis represents a slight departure from McMillan's online content analysis procedure in that human coders are being supplemented by a computational method, and some of these methods do not require categories to be defined in advanced. Quantitative textual analysis models often
297:
Random Samples. On the one hand, it is extremely hard to know how many units of one type of texts (for example blogposts) are in a certain time in the Internet. Thus, since most of the time the universe is unknown, how can researcher select a random sample? If in some cases is almost impossible to
216:
can be used when a set of categories for coding cannot be well-defined prior to analysis. Unlike supervised methods, human coders are not required to train the algorithm. One key choice for researchers when applying unsupervised methods is selecting the number of categories to sort documents into
237:
Unsupervised Ideological Scaling (i.e. wordsfish): algorithms that allocate text units into an ideological continuum depending on shared grammatical content. Contrary to supervised scaling methods such as wordscores, methods such as wordfish do not require that the researcher provides samples of
48:
Berelson’s (1952) definition provides an underlying basis for textual analysis as a "research technique for the objective, systematic and quantitative description of the manifest content of communication." Content analysis consists of categorizing units of texts (i.e. sentences, quasi-sentences,
181:
involve creating a coding scheme and manually coding a sub-sample of the documents that the researcher wants to analyze. Ideally, the sub-sample, called a 'training set' is representative of the sample as a whole. The coded training set is then used to 'teach' an algorithm how the words in the
126:
Online content is also non-linear. Printed text has clearly delineated boundaries that can be used to identify context units (e.g., a newspaper article). The bounds of online content to be used in a sample are less easily defined. Early online content analysts often specified a ‘Web site’ as a
122:
While offline content such as printed text remains static once produced, online content can frequently change. The dynamic nature of online material combined with the large and increasing volume of online content can make it challenging to construct a sampling frame from which to draw a random
148:
The rise of online content has dramatically increased the amount of digital text that can be used in research. The quantity of text available has motivated methodological innovations in order to make sense of textual datasets that are too large to be practically hand-coded as had been the
266:) validity is the extent to which shifts in the frequency of each cluster can be explained by external events. If clusters of topics are valid, the topics that are most prominent should respond across time in a predictable way as a result of outside events that occur. 118:
Since the rise of online communication, scholars have discussed how to adapt textual analysis techniques to study web-based content. The nature of online sources necessitates particular care in many of the steps of a content analysis compared to offline sources.
94:
Develop and implement a coding scheme that can be used to categorize content in order to answer the question identified in step 1. This necessitates specifying a time period, a context unit in which content is embedded, and a coding unit which categorizes the
149:
conventional methodological practice. Advances in methodology together with the increasing capacity and decreasing expense of computation has allowed researchers to use techniques that were previously unavailable to analyze large sets of textual content.
165:
that reduces the dimensionality of the text by reducing complex words to their root word. While these methods are fundamentally reductionist in the way they interpret text, they can be very useful if they are correctly applied and validated.
204:
Ensemble Methods: instead of using only one machine-learning algorithm, the researcher trains a set of them and uses the resulting multiple labels to label the rest of the observations (see Collingwood and Wiklerson 2011 for more
480:
DiMaggio, Paul; Nag, Manish; Blei, David (December 2013). "Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding".
49:
paragraphs, documents, web pages, etc.) according to their substantive characteristics in order to construct a dataset that allows the analyst to interpret texts and draw inferences. While content analysis is often
459:
Barberá, Pablo; Bonneau, Richard; Egan, Patrick; Jost, John; Nagler, Jonathan; Tucker, Joshua (2014). "Leaders or Followers? Measuring Political Responsiveness in the U.S. Congress Using Social Media Data".
23:
refers to a collection of research techniques used to describe and make inferences about online material through systematic coding and interpretation. Online content analysis is a form of
275:
Despite the continuous evolution of text-analysis in the social science, there are still some unsolved methodological concerns. This is a (non-exclusive) list with some of this concerns:
230:
represent one example of mixed membership FAC that can be used to analyze changes in focus of political actors or newspaper articles. One of the most used topic modeling technique is
294:
estimates, confusion matrices, etc.), some others do not. In particular, a larger number of academics are concerned about how some topic modeling techniques can hardly be validated.
651:
Slapin, Jonathan, and Sven-Oliver Proksch. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705–22.
756: 660:
King, Gary, Robert O. Keohane, & Sidney Verba. (1994). Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Prince University Press.
41:
Content analysis as a systematic examination and interpretation of communication dates back to at least the 17th century. However, it was not until the rise of the
130:
King (2008) used an ontology of terms trained from many thousands of pre-classified documents to analyse the subject matter of a number of search engines.
721:
Chuang, Jason, John D. Wilkerson, Rebecca Weiss, Dustin Tingley, Brandon M. Stewart, Margaret E. Roberts, Forough Poursabzi-Sangdeh, Justin Grimmer,
729:. Paper presented at the Conference on Neural Information Processing Systems (NIPS). Workshop on HumanPropelled Machine Learning. Montreal, Canada. 594: 182:
documents correspond to each coding category. The algorithm can be applied to automatically analyze the remained of the documents in the corpus.
749: 697: 355:
McMillan, Sally J. (March 2000). "The Microscope and the Moving Target: The Challenge of Applying Content Analysis to the World Wide Web".
248:
the form of inter-coder reliability scores like those used to validate the consistency of human coders in traditional textual analysis.
109:
Analyze and interpret the data. Test hypotheses advanced in step 1 and draw conclusions about the content represented in the dataset.
438: 45:
in the early 20th century that the mass production of printed material created a demand for quantitative analysis of printed words.
873: 742: 201:) using those labels. The machine labels the remainder of the observations by extrapolating information from the training set. 84:
Formulate a research question with a focus on identifying testable hypotheses that may lead to theoretical advancements.
580: 779: 290:
Validation. Although most researchers report validation measurements for their methods (i.e. inter-coder reliability,
231: 143: 222:
or distance; (2) an objective function that operationalizes and ideal clustering; and (3) an optimization algorithm."
157:' methods that remove word ordering, delete words that are very common and very uncommon, and simplify words through 726: 91:
that a sample will be drawn from, and construct a sample (often called a ‘corpus’) of content to be analyzed.
765: 139: 99: 675: 609: 198: 54: 815: 810: 213: 70: 61:
interpretation. Social scientists have used this technique to investigate research questions concerning
50: 635: 847: 670:
Herring, Susan C. (2009). "Web Content Analysis: Expanding the Paradigm". In Hunsinger, Jeremy (ed.).
209:
the texts in the corpus are scaled depending on how many words of each extreme reference they contain.
291: 190:) for each category. The machine then uses these keywords to classify each text unit into a category. 58: 680: 614: 76:
With the rise of online communication, content analysis techniques have been adapted and applied to
178: 727:
Computer-Assisted Content Analysis: Topic Models for Exploring Multiple Subjective Interpretations
549:"Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts" 805: 795: 712:
Saldana Johnny. (2009). The Coding Manual for Qualitative Research. London: SAGE Publication Ltd.
627: 372: 154: 396: 693: 434: 263: 256: 77: 800: 685: 619: 595:"Beyond the median: Voter preferences, district heterogeneity, and political representation" 560: 490: 364: 308: 194: 36: 24: 852: 103: 66: 462:
Prepared for Delivery at the Annual Meeting of the American Political Science Association
508:
Mishne, Gilad; Glance, Natalie (2006). "Leave a reply: An analysis of weblog comments".
840: 722: 169:
Grimmer and Stewart (2013) identify two main categories of automatic textual analysis:
88: 867: 376: 158: 631: 689: 494: 428: 313: 227: 368: 835: 62: 193:
Individual Methods: the researcher pre-labels a sample of texts and trains a
226:
the document as a whole to be a part of multiple categories simultaneously.
42: 734: 565: 548: 414:
Analyzing Media Messages: Using Quantitative Content Analysis in Research
162: 280: 251:
Validation of unsupervised methods can be carried out in several ways.
284: 187: 524: 98:
Train coders to consistently implement the coding scheme and verify
623: 581:
Tradeoffs in Accuracy and Efficiency in supervised Learning Methods
186:
Dictionary Methods: the researcher pre-selects a set of keywords (
583:, in The Journal of Information Technology and Politics, Paper 4. 738: 217:
rather than defining what the categories are in advance.
53:, researchers conceptualize the technique as inherently 412:
Riffe, Daniel; Lacy, Stephen; Fico, Frederick (1998).
430:
Agendas and Instability in American Politics. Chicago
337:
Content Analysis: An introduction to its methodology
788: 772: 510:
Third Annual Conference on the Weblogging Ecosystem
330: 328: 279:When should researchers define their categories? 57:because textual coding requires a high degree of 725:, Jordan Boyd-Graber, and Jeffrey Heer. (2014). 579:Collingwood, Loren and John Wilkerson. (2011). 416:. Mahwah, New Jersey, London: Lawrence Erlbaum. 542: 540: 538: 536: 475: 473: 471: 350: 348: 346: 27:for analysis of Internet-based communication. 750: 454: 452: 450: 390: 388: 386: 102:among coders. This is a key step in ensuring 8: 398:Content Analysis of Internet-Based Documents 672:International Handbook of Internet Research 529:(PhD). Queensland University of Techbology. 395:van Selm, Martine; Jankowski, Nick (2005). 357:Journalism and Mass Communication Quarterly 757: 743: 735: 674:. Springer Netherlands. pp. 233–249. 547:Grimmer, Justin; Stewart, Brandon (2013). 679: 613: 564: 427:Baumgartner, Frank; Jones, Bryan (1993). 593:Gerber, Elisabeth; Lewis, Jeff (2004). 324: 271:Challenges in online textual analysis 114:Content analysis in internet research 7: 14: 846: 834: 433:. University of Chicago Press. 526:Search Engine Content Analysis 1: 690:10.1007/978-1-4020-9789-8_14 602:Journal of Political Economy 495:10.1016/j.poetic.2013.08.004 335:Krippendorff, Klaus (2012). 780:Online qualitative research 144:Natural language processing 890: 369:10.1177/107769900007700107 339:. Thousand Oaks, CA: Sage. 238:extreme ideological texts. 137: 134:Automatic content analysis 34: 829: 789:Specific research methods 401:. Unpublished Manuscript. 874:Online research methods 821:Online content analysis 766:Online research methods 140:Document classification 21:online textual analysis 17:Online content analysis 523:King, John D. (2008). 31:History and definition 816:Web-based experiments 811:Online questionnaires 283:, back-and-forth, or 292:precision and recall 214:Unsupervised methods 853:Internet portal 641:on October 1, 2015. 841:Society portal 806:Online ethnography 796:Online focus group 566:10.1093/pan/mps028 553:Political Analysis 179:Supervised methods 861: 860: 699:978-1-4020-9788-1 78:internet research 881: 851: 850: 839: 838: 801:Online interview 759: 752: 745: 736: 730: 719: 713: 710: 704: 703: 683: 667: 661: 658: 652: 649: 643: 642: 640: 634:. Archived from 617: 599: 590: 584: 577: 571: 570: 568: 544: 531: 530: 520: 514: 513: 505: 499: 498: 477: 466: 465: 456: 445: 444: 424: 418: 417: 409: 403: 402: 392: 381: 380: 352: 341: 340: 332: 309:Content analysis 197:algorithm (i.e. 195:machine learning 106:of the analysis. 37:Content analysis 25:content analysis 889: 888: 884: 883: 882: 880: 879: 878: 864: 863: 862: 857: 845: 833: 825: 784: 768: 763: 733: 720: 716: 711: 707: 700: 681:10.1.1.476.6090 669: 668: 664: 659: 655: 650: 646: 638: 615:10.1.1.320.8707 597: 592: 591: 587: 578: 574: 546: 545: 534: 522: 521: 517: 507: 506: 502: 479: 478: 469: 458: 457: 448: 441: 426: 425: 421: 411: 410: 406: 394: 393: 384: 354: 353: 344: 334: 333: 326: 322: 305: 273: 262:Predictive (or 245: 146: 136: 116: 39: 33: 12: 11: 5: 887: 885: 877: 876: 866: 865: 859: 858: 856: 855: 843: 830: 827: 826: 824: 823: 818: 813: 808: 803: 798: 792: 790: 786: 785: 783: 782: 776: 774: 770: 769: 764: 762: 761: 754: 747: 739: 732: 731: 723:Leah Findlater 714: 705: 698: 662: 653: 644: 624:10.1086/424737 608:(6): 1364–83. 585: 572: 559:(3): 267–297. 532: 515: 500: 489:(6): 570–606. 467: 446: 439: 419: 404: 382: 342: 323: 321: 318: 317: 316: 311: 304: 301: 300: 299: 295: 288: 272: 269: 268: 267: 260: 244: 241: 240: 239: 235: 223: 211: 210: 206: 202: 191: 135: 132: 115: 112: 111: 110: 107: 96: 92: 89:sampling frame 85: 71:agenda setting 35:Main article: 32: 29: 13: 10: 9: 6: 4: 3: 2: 886: 875: 872: 871: 869: 854: 849: 844: 842: 837: 832: 831: 828: 822: 819: 817: 814: 812: 809: 807: 804: 802: 799: 797: 794: 793: 791: 787: 781: 778: 777: 775: 771: 767: 760: 755: 753: 748: 746: 741: 740: 737: 728: 724: 718: 715: 709: 706: 701: 695: 691: 687: 682: 677: 673: 666: 663: 657: 654: 648: 645: 637: 633: 629: 625: 621: 616: 611: 607: 603: 596: 589: 586: 582: 576: 573: 567: 562: 558: 554: 550: 543: 541: 539: 537: 533: 528: 527: 519: 516: 511: 504: 501: 496: 492: 488: 484: 476: 474: 472: 468: 463: 455: 453: 451: 447: 442: 440:9780226039534 436: 432: 431: 423: 420: 415: 408: 405: 400: 399: 391: 389: 387: 383: 378: 374: 370: 366: 362: 358: 351: 349: 347: 343: 338: 331: 329: 325: 319: 315: 312: 310: 307: 306: 302: 296: 293: 289: 286: 282: 278: 277: 276: 270: 265: 261: 258: 255:Semantic (or 254: 253: 252: 249: 242: 236: 233: 229: 224: 220: 219: 218: 215: 207: 203: 200: 199:SVM algorithm 196: 192: 189: 185: 184: 183: 180: 176: 172: 167: 164: 160: 159:lemmatisation 156: 150: 145: 141: 133: 131: 128: 124: 120: 113: 108: 105: 104:replicability 101: 97: 93: 90: 86: 83: 82: 81: 79: 74: 72: 68: 67:media effects 64: 60: 56: 55:mixed methods 52: 46: 44: 38: 30: 28: 26: 22: 18: 820: 717: 708: 671: 665: 656: 647: 636:the original 605: 601: 588: 575: 556: 552: 525: 518: 509: 503: 486: 482: 461: 429: 422: 413: 407: 397: 363:(1): 80–98. 360: 356: 336: 274: 250: 246: 228:Topic models 212: 175:unsupervised 174: 170: 168: 155:bag of words 151: 147: 129: 125: 121: 117: 75: 51:quantitative 47: 40: 20: 16: 15: 314:Text mining 100:reliability 59:qualitative 773:Categories 320:References 243:Validation 171:supervised 138:See also: 63:mass media 676:CiteSeerX 610:CiteSeerX 377:143760798 205:details). 177:methods. 87:Define a 43:newspaper 868:Category 632:16695697 303:See also 264:external 257:internal 163:stemming 153:employ ' 95:content. 483:Poetics 281:Ex-ante 696:  678:  630:  612:  437:  375:  285:ad-hoc 188:n-gram 639:(PDF) 628:S2CID 598:(PDF) 373:S2CID 694:ISBN 435:ISBN 173:and 142:and 69:and 686:doi 620:doi 606:112 561:doi 491:doi 365:doi 232:LDA 161:or 19:or 870:: 692:. 684:. 626:. 618:. 604:. 600:. 557:21 555:. 551:. 535:^ 487:41 485:. 470:^ 449:^ 385:^ 371:. 361:77 359:. 345:^ 327:^ 73:. 65:, 758:e 751:t 744:v 702:. 688:: 622:: 569:. 563:: 512:. 497:. 493:: 464:. 443:. 379:. 367:: 234:.

Index

content analysis
Content analysis
newspaper
quantitative
mixed methods
qualitative
mass media
media effects
agenda setting
internet research
sampling frame
reliability
replicability
Document classification
Natural language processing
bag of words
lemmatisation
stemming
Supervised methods
n-gram
machine learning
SVM algorithm
Unsupervised methods
Topic models
LDA
internal
external
Ex-ante
ad-hoc
precision and recall

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.