Knowledge (XXG)

Brown Corpus

Source 📝

17: 115:(British English from the early 1990s) and the Freiburg-Brown Corpus of American English (FROWN) (American English from the early 1990s). Tagging the corpus enabled far more sophisticated statistical analysis, such as the work programmed by Andrew Mackie, and documented in books on English grammar. 77:
The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of
110:
The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the model for many later corpora such as the
179:
Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.
171:
The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were
102:
The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under
797:
Francis, W. Nelson & Henry Kucera. 1979. BROWN CORPUS MANUAL: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers.
99:. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information. 58:, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961. 1045: 42:
of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by
829:
Leech, Geoffrey & Nicholas Smith. 2005. Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB.
130:. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are 930: 965: 810:
Hundt, Marianne, Andrea Sand & Rainer Siemund. 1998. Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN).
990: 152: 136:: words that occur only once in the corpus. This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by 788:
Francis, W. Nelson & Henry Kucera. 1967. Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.
1133: 1030: 862: 118:
One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a
893: 814: 1113: 923: 1005: 160: 187:
machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.
842:
Winthrop Nelson Francis and Henry Kučera. 1983. Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.
1230: 1225: 1010: 112: 95: 916: 1220: 1168: 1153: 1138: 1108: 79: 1215: 1083: 1078: 985: 955: 107:) helped considerably in this, but the high error rate meant that extensive manual proofreading was required. 1184: 1128: 1098: 970: 769: 156: 93:
publisher Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new
16: 104: 151:
Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the
1158: 1123: 1118: 1088: 1025: 1015: 1163: 1000: 137: 1235: 939: 858: 47: 38:, is an electronic collection of text samples of American English, the first major structured 1240: 1103: 1063: 86: 51: 21: 1093: 960: 818: 43: 190:
The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:
980: 811: 132: 1209: 1194: 145: 78:
linguistics, psychology, statistics, and sociology. It has been very widely used in
995: 55: 853: 39: 766:, a corpus of British English based on the same parameters as the Brown Corpus 763: 889:
Search, via Sketch Engine, in the Brown Corpus Annotated by the TreeTagger v2
119: 903: 1143: 1073: 1020: 184: 176:
published then, and were written by native speakers of American English.
1189: 1148: 1068: 1040: 888: 70:, which provided basic statistics on what is known today simply as the 20:
The Department of Cognitive Linguistic & Psychological Sciences at
82:, and was for many years among the most-cited resources in the field. 908: 798: 90: 878: 66:
In 1967, Kučera and Francis published their classic work, entitled
1035: 15: 883: 32:
Brown University Standard Corpus of Present-Day American English
912: 163:) tend to be much larger, on the order of 100 million words. 899:
Python software for convenient access to the Brown Corpus
68:"Computational Analysis of Present-Day American English" 898: 812:
http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
331:
H. MISCELLANEOUS: US Government & House Organs (
183:
The original data entry was done on upper-case only
126:-th most frequent word is roughly proportional to 1/ 1177: 1054: 946: 567:semantically superlative adjective (chief, top) 1046:Wellington Corpus of Spoken New Zealand English 1074:CorCenCC National Corpus of Contemporary Welsh 316:G. BELLES-LETTRES - Biography, Memoirs, etc. ( 924: 8: 400:L. FICTION: Mystery and Detective Fiction ( 931: 917: 909: 511:subordinating conjunction (if, although) 966:Bergen Corpus of London Teenage Language 480: 991:Corpus of Contemporary American English 894:More details on the Brown Corpus tagset 781: 153:Corpus of Contemporary American English 85:Shortly after publication of the first 503:cardinal numeral (one, two, 2, etc.) 7: 695:verb + Auxiliary, singular, present 445:P. FICTION: Romance and Love Story ( 27:Data set of American English in 1961 1134:Scottish Corpus of Texts and Speech 1031:Switchboard Telephone Speech Corpus 623:proper noun or part of name phrase 495:coordinating conjunction (and, or) 430:N. FICTION: Adventure and Western ( 799:http://icame.uib.no/brown/bcm.html 14: 374:Political Science, Law, Education 1114:Neo-Assyrian Text Corpus Project 711:verb, present participle/gerund 1006:International Corpus of English 161:International Corpus of English 371:Social and Behavioral Sciences 1: 142:The Psychobiology of Language 34:, better known as simply the 1011:Lancaster-Oslo-Bergen Corpus 904:PHP (Part Of Speech Tagging) 854:The Linguistics Encyclopedia 727:verb, 3rd. singular present 113:Lancaster-Oslo-Bergen Corpus 96:American Heritage Dictionary 857:, 2nd ed, Routledge, 2002, 639:personal pronoun, singular 1257: 631:proper noun + Conjunction 380:Technology and Engineering 1169:Thesaurus Linguae Graecae 1154:Tehran Monolingual Corpus 1139:Slovenian National Corpus 1109:National Corpus of Polish 884:Download the Brown Corpus 647:personal pronoun, plural 527:preposition (in, at, on) 80:computational linguistics 1084:Croatian National Corpus 1079:Croatian Language Corpus 986:Cambridge English Corpus 956:American National Corpus 559:Adjective + Conjunction 477:Part-of-speech tags used 1129:Russian National Corpus 1099:German Reference Corpus 971:British National Corpus 770:British National Corpus 551:adjective, Comparative 157:British National Corpus 122:: the frequency of the 719:verb, past participle 591:singular or mass noun 543:adjective + Auxiliary 286:E. SKILL AND HOBBIES ( 140:(for example, see his 105:part of speech tagging 24: 1159:Tekstaro de Esperanto 1124:Quranic Arabic Corpus 1119:Persian Speech Corpus 1089:Czech National Corpus 1026:Spoken English Corpus 1016:Oxford English Corpus 415:M. FICTION: Science ( 385:K. FICTION: General ( 234:Letters to the Editor 221:B. PRESS: Editorial ( 194:A. PRESS: Reportage ( 19: 1164:TenTen Corpus Family 350:Industry House organ 338:Government Documents 138:George Kingsley Zipf 1231:Linguistic research 1226:Applied linguistics 879:Brown Corpus Manual 679:superlative adverb 671:comparative adverb 655:Possessive pronoun 607:Noun + Conjunction 575:Adjective + Female 239:C. PRESS: Reviews ( 228:Institutional Daily 167:Sample distribution 144:), and is known as 940:Corpus linguistics 851:Kirsten Malmkjær, 817:2014-04-03 at the 519:existential there 341:Foundation Reports 25: 1203: 1202: 755: 754: 751:All Punctuations 703:verb, past tense 599:Noun + Auxiliary 583:Adjective + Male 301:F. POPULAR LORE ( 87:lexicostatistical 48:W. Nelson Francis 1248: 1221:Brown University 1104:Hamshahri Corpus 1064:Bijankhan Corpus 933: 926: 919: 910: 866: 849: 843: 840: 834: 827: 821: 808: 802: 795: 789: 786: 687:verb, base form 481: 362:Natural Sciences 344:Industry Reports 52:Brown University 22:Brown University 1256: 1255: 1251: 1250: 1249: 1247: 1246: 1245: 1216:English corpora 1206: 1205: 1204: 1199: 1173: 1094:Europarl Corpus 1056: 1050: 961:Bank of English 948: 942: 937: 875: 870: 869: 850: 846: 841: 837: 828: 824: 819:Wayback Machine 809: 805: 796: 792: 787: 783: 778: 760: 479: 347:College Catalog 169: 64: 28: 12: 11: 5: 1254: 1252: 1244: 1243: 1238: 1233: 1228: 1223: 1218: 1208: 1207: 1201: 1200: 1198: 1197: 1192: 1187: 1185:BNC consortium 1181: 1179: 1175: 1174: 1172: 1171: 1166: 1161: 1156: 1151: 1146: 1141: 1136: 1131: 1126: 1121: 1116: 1111: 1106: 1101: 1096: 1091: 1086: 1081: 1076: 1071: 1066: 1060: 1058: 1052: 1051: 1049: 1048: 1043: 1038: 1033: 1028: 1023: 1018: 1013: 1008: 1003: 998: 993: 988: 983: 981:Buckeye Corpus 978: 973: 968: 963: 958: 952: 950: 944: 943: 938: 936: 935: 928: 921: 913: 907: 906: 901: 896: 891: 886: 881: 874: 873:External links 871: 868: 867: 844: 835: 822: 803: 790: 780: 779: 777: 774: 773: 772: 767: 759: 756: 753: 752: 749: 745: 744: 741: 737: 736: 735:Foreign Words 733: 729: 728: 725: 721: 720: 717: 713: 712: 709: 705: 704: 701: 697: 696: 693: 689: 688: 685: 681: 680: 677: 673: 672: 669: 665: 664: 661: 657: 656: 653: 649: 648: 645: 641: 640: 637: 633: 632: 629: 625: 624: 621: 617: 616: 613: 609: 608: 605: 601: 600: 597: 593: 592: 589: 585: 584: 581: 577: 576: 573: 569: 568: 565: 561: 560: 557: 553: 552: 549: 545: 544: 541: 537: 536: 533: 529: 528: 525: 521: 520: 517: 513: 512: 509: 505: 504: 501: 497: 496: 493: 489: 488: 485: 478: 475: 474: 473: 472: 471: 468: 458: 457: 456: 453: 443: 442: 441: 438: 428: 427: 426: 423: 413: 412: 411: 408: 398: 397: 396: 393: 383: 382: 381: 378: 375: 372: 369: 366: 363: 353: 352: 351: 348: 345: 342: 339: 329: 328: 327: 324: 314: 313: 312: 309: 299: 298: 297: 294: 284: 283: 282: 279: 276: 266: 265: 264: 259: 254: 249: 237: 236: 235: 232: 229: 219: 218: 217: 214: 211: 208: 205: 202: 168: 165: 133:hapax legomena 63: 60: 26: 13: 10: 9: 6: 4: 3: 2: 1253: 1242: 1239: 1237: 1234: 1232: 1229: 1227: 1224: 1222: 1219: 1217: 1214: 1213: 1211: 1196: 1195:Sketch Engine 1193: 1191: 1188: 1186: 1183: 1182: 1180: 1178:Organizations 1176: 1170: 1167: 1165: 1162: 1160: 1157: 1155: 1152: 1150: 1147: 1145: 1142: 1140: 1137: 1135: 1132: 1130: 1127: 1125: 1122: 1120: 1117: 1115: 1112: 1110: 1107: 1105: 1102: 1100: 1097: 1095: 1092: 1090: 1087: 1085: 1082: 1080: 1077: 1075: 1072: 1070: 1067: 1065: 1062: 1061: 1059: 1055:Text corpora, 1053: 1047: 1044: 1042: 1039: 1037: 1034: 1032: 1029: 1027: 1024: 1022: 1019: 1017: 1014: 1012: 1009: 1007: 1004: 1002: 999: 997: 994: 992: 989: 987: 984: 982: 979: 977: 974: 972: 969: 967: 964: 962: 959: 957: 954: 953: 951: 947:Text corpora, 945: 941: 934: 929: 927: 922: 920: 915: 914: 911: 905: 902: 900: 897: 895: 892: 890: 887: 885: 882: 880: 877: 876: 872: 864: 863:0-415-22210-9 860: 856: 855: 848: 845: 839: 836: 832: 831:ICAME Journal 826: 823: 820: 816: 813: 807: 804: 800: 794: 791: 785: 782: 775: 771: 768: 765: 762: 761: 757: 750: 747: 746: 742: 739: 738: 734: 731: 730: 726: 723: 722: 718: 715: 714: 710: 707: 706: 702: 699: 698: 694: 691: 690: 686: 683: 682: 678: 675: 674: 670: 667: 666: 662: 659: 658: 654: 651: 650: 646: 643: 642: 638: 635: 634: 630: 627: 626: 622: 619: 618: 614: 611: 610: 606: 603: 602: 598: 595: 594: 590: 587: 586: 582: 579: 578: 574: 571: 570: 566: 563: 562: 558: 555: 554: 550: 547: 546: 542: 539: 538: 534: 531: 530: 526: 523: 522: 518: 515: 514: 510: 507: 506: 502: 499: 498: 494: 491: 490: 486: 483: 482: 476: 469: 466: 465: 463: 459: 455:Short Stories 454: 451: 450: 448: 444: 440:Short Stories 439: 436: 435: 433: 429: 425:Short Stories 424: 421: 420: 418: 414: 410:Short Stories 409: 406: 405: 403: 399: 395:Short Stories 394: 391: 390: 388: 384: 379: 376: 373: 370: 367: 364: 361: 360: 358: 354: 349: 346: 343: 340: 337: 336: 334: 330: 325: 322: 321: 319: 315: 310: 307: 306: 304: 300: 295: 292: 291: 289: 285: 280: 277: 274: 273: 271: 268:D. RELIGION ( 267: 263: 260: 258: 255: 253: 250: 248: 245: 244: 242: 238: 233: 230: 227: 226: 224: 220: 215: 212: 209: 206: 203: 200: 199: 197: 193: 192: 191: 188: 186: 181: 177: 175: 166: 164: 162: 158: 154: 149: 147: 143: 139: 135: 134: 129: 125: 121: 116: 114: 108: 106: 100: 98: 97: 92: 88: 83: 81: 75: 73: 69: 61: 59: 57: 53: 49: 45: 41: 37: 33: 23: 18: 996:Enron Corpus 976:Brown Corpus 975: 852: 847: 838: 830: 825: 806: 793: 784: 615:plural noun 470:Essays, etc. 461: 446: 431: 416: 401: 386: 356: 355:J. LEARNED ( 332: 317: 302: 287: 269: 261: 256: 251: 246: 240: 222: 195: 189: 182: 178: 173: 170: 150: 141: 131: 127: 123: 117: 109: 101: 94: 84: 76: 72:Brown Corpus 71: 67: 65: 56:Rhode Island 44:Henry Kučera 36:Brown Corpus 35: 31: 29: 1057:non-English 487:Definition 368:Mathematics 326:Periodicals 311:Periodicals 296:Periodicals 278:Periodicals 1236:1961 works 1210:Categories 833:29. 83–98. 776:References 764:LOB Corpus 535:adjective 460:R. HUMOR ( 377:Humanities 146:Zipf's law 89:analysis, 213:Financial 210:Spot News 201:Political 120:hyperbola 1144:TalkBank 1021:PropBank 1001:EnTenTen 865:, p. 87. 815:Archived 758:See also 743:Symbols 447:29 texts 432:29 texts 402:24 texts 387:29 texts 365:Medicine 357:80 texts 333:30 texts 318:75 texts 303:48 texts 288:36 texts 270:17 texts 241:17 texts 231:Personal 223:27 texts 216:Cultural 196:44 texts 185:keypunch 1241:Corpora 1190:COBUILD 1149:Tatoeba 1069:CHILDES 1041:VerbNet 949:English 663:adverb 462:9 texts 417:6 texts 247:theatre 207:Society 159:or the 62:History 861:  467:Novels 452:Novels 437:Novels 422:Novels 407:Novels 392:Novels 281:Tracts 204:Sports 155:, the 91:Boston 40:corpus 1036:TIMIT 652:PRP$ 323:Books 308:Books 293:Books 275:Books 262:dance 257:music 252:books 174:first 54:, in 859:ISBN 644:PRPS 628:NNPC 556:JJCC 484:Tag 46:and 30:The 748:PUN 740:SYM 724:VBZ 716:VBN 708:VBG 700:VBD 692:VBA 676:RBS 668:RBR 636:PRP 620:NNP 612:NNS 604:NNC 596:NNA 580:JJM 572:JJF 564:JJS 548:JJC 540:JJA 50:at 1212:: 732:FW 684:VB 660:RB 588:NN 532:JJ 524:IN 516:EX 508:CS 500:CD 492:CC 464:) 449:) 434:) 419:) 404:) 389:) 359:) 335:) 320:) 305:) 290:) 272:) 243:) 225:) 198:) 148:. 74:. 932:e 925:t 918:v 801:. 128:n 124:n

Index


Brown University
corpus
Henry Kučera
W. Nelson Francis
Brown University
Rhode Island
computational linguistics
lexicostatistical
Boston
American Heritage Dictionary
part of speech tagging
Lancaster-Oslo-Bergen Corpus
hyperbola
hapax legomena
George Kingsley Zipf
Zipf's law
Corpus of Contemporary American English
British National Corpus
International Corpus of English
keypunch
LOB Corpus
British National Corpus
http://icame.uib.no/brown/bcm.html
http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
Archived
Wayback Machine
The Linguistics Encyclopedia
ISBN
0-415-22210-9

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.