17:
115:(British English from the early 1990s) and the Freiburg-Brown Corpus of American English (FROWN) (American English from the early 1990s). Tagging the corpus enabled far more sophisticated statistical analysis, such as the work programmed by Andrew Mackie, and documented in books on English grammar.
77:
The Brown Corpus was a carefully compiled selection of current
American English, totalling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of
110:
The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the model for many later corpora such as the
179:
Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.
171:
The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were
102:
The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under
797:
Francis, W. Nelson & Henry Kucera. 1979. BROWN CORPUS MANUAL: Manual of
Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers.
99:. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.
58:, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.
1045:
42:
of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by
829:
Leech, Geoffrey & Nicholas Smith. 2005. Extending the possibilities of corpus-based research on
English in the twentieth century: A prequel to LOB and FLOB.
130:. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are
930:
965:
810:
Hundt, Marianne, Andrea Sand & Rainer
Siemund. 1998. Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN).
990:
152:
136:: words that occur only once in the corpus. This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by
788:
Francis, W. Nelson & Henry Kucera. 1967. Computational
Analysis of Present-Day American English. Providence, RI: Brown University Press.
1133:
1030:
862:
118:
One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a
893:
814:
1113:
923:
1005:
160:
187:
machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.
842:
Winthrop Nelson
Francis and Henry Kučera. 1983. Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.
1230:
1225:
1010:
112:
95:
916:
1220:
1168:
1153:
1138:
1108:
79:
1215:
1083:
1078:
985:
955:
107:) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.
1184:
1128:
1098:
970:
769:
156:
93:
publisher
Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new
16:
104:
151:
Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the
1158:
1123:
1118:
1088:
1025:
1015:
1163:
1000:
137:
1235:
939:
858:
47:
38:, is an electronic collection of text samples of American English, the first major structured
1240:
1103:
1063:
86:
51:
21:
1093:
960:
818:
43:
190:
The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:
980:
811:
132:
1209:
1194:
145:
78:
linguistics, psychology, statistics, and sociology. It has been very widely used in
995:
55:
853:
39:
766:, a corpus of British English based on the same parameters as the Brown Corpus
763:
889:
Search, via Sketch Engine, in the Brown Corpus
Annotated by the TreeTagger v2
119:
903:
1143:
1073:
1020:
184:
176:
published then, and were written by native speakers of
American English.
1189:
1148:
1068:
1040:
888:
70:, which provided basic statistics on what is known today simply as the
20:
The
Department of Cognitive Linguistic & Psychological Sciences at
82:, and was for many years among the most-cited resources in the field.
908:
798:
90:
878:
66:
In 1967, Kučera and Francis published their classic work, entitled
1035:
15:
883:
32:
Brown University Standard Corpus of Present-Day American English
912:
163:) tend to be much larger, on the order of 100 million words.
899:
Python software for convenient access to the Brown Corpus
68:"Computational Analysis of Present-Day American English"
898:
812:
http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
331:
H. MISCELLANEOUS: US Government & House Organs (
183:
The original data entry was done on upper-case only
126:-th most frequent word is roughly proportional to 1/
1177:
1054:
946:
567:semantically superlative adjective (chief, top)
1046:Wellington Corpus of Spoken New Zealand English
1074:CorCenCC National Corpus of Contemporary Welsh
316:G. BELLES-LETTRES - Biography, Memoirs, etc. (
924:
8:
400:L. FICTION: Mystery and Detective Fiction (
931:
917:
909:
511:subordinating conjunction (if, although)
966:Bergen Corpus of London Teenage Language
480:
991:Corpus of Contemporary American English
894:More details on the Brown Corpus tagset
781:
153:Corpus of Contemporary American English
85:Shortly after publication of the first
503:cardinal numeral (one, two, 2, etc.)
7:
695:verb + Auxiliary, singular, present
445:P. FICTION: Romance and Love Story (
27:Data set of American English in 1961
1134:Scottish Corpus of Texts and Speech
1031:Switchboard Telephone Speech Corpus
623:proper noun or part of name phrase
495:coordinating conjunction (and, or)
430:N. FICTION: Adventure and Western (
799:http://icame.uib.no/brown/bcm.html
14:
374:Political Science, Law, Education
1114:Neo-Assyrian Text Corpus Project
711:verb, present participle/gerund
1006:International Corpus of English
161:International Corpus of English
371:Social and Behavioral Sciences
1:
142:The Psychobiology of Language
34:, better known as simply the
1011:Lancaster-Oslo-Bergen Corpus
904:PHP (Part Of Speech Tagging)
854:The Linguistics Encyclopedia
727:verb, 3rd. singular present
113:Lancaster-Oslo-Bergen Corpus
96:American Heritage Dictionary
857:, 2nd ed, Routledge, 2002,
639:personal pronoun, singular
1257:
631:proper noun + Conjunction
380:Technology and Engineering
1169:Thesaurus Linguae Graecae
1154:Tehran Monolingual Corpus
1139:Slovenian National Corpus
1109:National Corpus of Polish
884:Download the Brown Corpus
647:personal pronoun, plural
527:preposition (in, at, on)
80:computational linguistics
1084:Croatian National Corpus
1079:Croatian Language Corpus
986:Cambridge English Corpus
956:American National Corpus
559:Adjective + Conjunction
477:Part-of-speech tags used
1129:Russian National Corpus
1099:German Reference Corpus
971:British National Corpus
770:British National Corpus
551:adjective, Comparative
157:British National Corpus
122:: the frequency of the
719:verb, past participle
591:singular or mass noun
543:adjective + Auxiliary
286:E. SKILL AND HOBBIES (
140:(for example, see his
105:part of speech tagging
24:
1159:Tekstaro de Esperanto
1124:Quranic Arabic Corpus
1119:Persian Speech Corpus
1089:Czech National Corpus
1026:Spoken English Corpus
1016:Oxford English Corpus
415:M. FICTION: Science (
385:K. FICTION: General (
234:Letters to the Editor
221:B. PRESS: Editorial (
194:A. PRESS: Reportage (
19:
1164:TenTen Corpus Family
350:Industry House organ
338:Government Documents
138:George Kingsley Zipf
1231:Linguistic research
1226:Applied linguistics
879:Brown Corpus Manual
679:superlative adverb
671:comparative adverb
655:Possessive pronoun
607:Noun + Conjunction
575:Adjective + Female
239:C. PRESS: Reviews (
228:Institutional Daily
167:Sample distribution
144:), and is known as
940:Corpus linguistics
851:Kirsten Malmkjær,
817:2014-04-03 at the
519:existential there
341:Foundation Reports
25:
1203:
1202:
755:
754:
751:All Punctuations
703:verb, past tense
599:Noun + Auxiliary
583:Adjective + Male
301:F. POPULAR LORE (
87:lexicostatistical
48:W. Nelson Francis
1248:
1221:Brown University
1104:Hamshahri Corpus
1064:Bijankhan Corpus
933:
926:
919:
910:
866:
849:
843:
840:
834:
827:
821:
808:
802:
795:
789:
786:
687:verb, base form
481:
362:Natural Sciences
344:Industry Reports
52:Brown University
22:Brown University
1256:
1255:
1251:
1250:
1249:
1247:
1246:
1245:
1216:English corpora
1206:
1205:
1204:
1199:
1173:
1094:Europarl Corpus
1056:
1050:
961:Bank of English
948:
942:
937:
875:
870:
869:
850:
846:
841:
837:
828:
824:
819:Wayback Machine
809:
805:
796:
792:
787:
783:
778:
760:
479:
347:College Catalog
169:
64:
28:
12:
11:
5:
1254:
1252:
1244:
1243:
1238:
1233:
1228:
1223:
1218:
1208:
1207:
1201:
1200:
1198:
1197:
1192:
1187:
1185:BNC consortium
1181:
1179:
1175:
1174:
1172:
1171:
1166:
1161:
1156:
1151:
1146:
1141:
1136:
1131:
1126:
1121:
1116:
1111:
1106:
1101:
1096:
1091:
1086:
1081:
1076:
1071:
1066:
1060:
1058:
1052:
1051:
1049:
1048:
1043:
1038:
1033:
1028:
1023:
1018:
1013:
1008:
1003:
998:
993:
988:
983:
981:Buckeye Corpus
978:
973:
968:
963:
958:
952:
950:
944:
943:
938:
936:
935:
928:
921:
913:
907:
906:
901:
896:
891:
886:
881:
874:
873:External links
871:
868:
867:
844:
835:
822:
803:
790:
780:
779:
777:
774:
773:
772:
767:
759:
756:
753:
752:
749:
745:
744:
741:
737:
736:
735:Foreign Words
733:
729:
728:
725:
721:
720:
717:
713:
712:
709:
705:
704:
701:
697:
696:
693:
689:
688:
685:
681:
680:
677:
673:
672:
669:
665:
664:
661:
657:
656:
653:
649:
648:
645:
641:
640:
637:
633:
632:
629:
625:
624:
621:
617:
616:
613:
609:
608:
605:
601:
600:
597:
593:
592:
589:
585:
584:
581:
577:
576:
573:
569:
568:
565:
561:
560:
557:
553:
552:
549:
545:
544:
541:
537:
536:
533:
529:
528:
525:
521:
520:
517:
513:
512:
509:
505:
504:
501:
497:
496:
493:
489:
488:
485:
478:
475:
474:
473:
472:
471:
468:
458:
457:
456:
453:
443:
442:
441:
438:
428:
427:
426:
423:
413:
412:
411:
408:
398:
397:
396:
393:
383:
382:
381:
378:
375:
372:
369:
366:
363:
353:
352:
351:
348:
345:
342:
339:
329:
328:
327:
324:
314:
313:
312:
309:
299:
298:
297:
294:
284:
283:
282:
279:
276:
266:
265:
264:
259:
254:
249:
237:
236:
235:
232:
229:
219:
218:
217:
214:
211:
208:
205:
202:
168:
165:
133:hapax legomena
63:
60:
26:
13:
10:
9:
6:
4:
3:
2:
1253:
1242:
1239:
1237:
1234:
1232:
1229:
1227:
1224:
1222:
1219:
1217:
1214:
1213:
1211:
1196:
1195:Sketch Engine
1193:
1191:
1188:
1186:
1183:
1182:
1180:
1178:Organizations
1176:
1170:
1167:
1165:
1162:
1160:
1157:
1155:
1152:
1150:
1147:
1145:
1142:
1140:
1137:
1135:
1132:
1130:
1127:
1125:
1122:
1120:
1117:
1115:
1112:
1110:
1107:
1105:
1102:
1100:
1097:
1095:
1092:
1090:
1087:
1085:
1082:
1080:
1077:
1075:
1072:
1070:
1067:
1065:
1062:
1061:
1059:
1055:Text corpora,
1053:
1047:
1044:
1042:
1039:
1037:
1034:
1032:
1029:
1027:
1024:
1022:
1019:
1017:
1014:
1012:
1009:
1007:
1004:
1002:
999:
997:
994:
992:
989:
987:
984:
982:
979:
977:
974:
972:
969:
967:
964:
962:
959:
957:
954:
953:
951:
947:Text corpora,
945:
941:
934:
929:
927:
922:
920:
915:
914:
911:
905:
902:
900:
897:
895:
892:
890:
887:
885:
882:
880:
877:
876:
872:
864:
863:0-415-22210-9
860:
856:
855:
848:
845:
839:
836:
832:
831:ICAME Journal
826:
823:
820:
816:
813:
807:
804:
800:
794:
791:
785:
782:
775:
771:
768:
765:
762:
761:
757:
750:
747:
746:
742:
739:
738:
734:
731:
730:
726:
723:
722:
718:
715:
714:
710:
707:
706:
702:
699:
698:
694:
691:
690:
686:
683:
682:
678:
675:
674:
670:
667:
666:
662:
659:
658:
654:
651:
650:
646:
643:
642:
638:
635:
634:
630:
627:
626:
622:
619:
618:
614:
611:
610:
606:
603:
602:
598:
595:
594:
590:
587:
586:
582:
579:
578:
574:
571:
570:
566:
563:
562:
558:
555:
554:
550:
547:
546:
542:
539:
538:
534:
531:
530:
526:
523:
522:
518:
515:
514:
510:
507:
506:
502:
499:
498:
494:
491:
490:
486:
483:
482:
476:
469:
466:
465:
463:
459:
455:Short Stories
454:
451:
450:
448:
444:
440:Short Stories
439:
436:
435:
433:
429:
425:Short Stories
424:
421:
420:
418:
414:
410:Short Stories
409:
406:
405:
403:
399:
395:Short Stories
394:
391:
390:
388:
384:
379:
376:
373:
370:
367:
364:
361:
360:
358:
354:
349:
346:
343:
340:
337:
336:
334:
330:
325:
322:
321:
319:
315:
310:
307:
306:
304:
300:
295:
292:
291:
289:
285:
280:
277:
274:
273:
271:
268:D. RELIGION (
267:
263:
260:
258:
255:
253:
250:
248:
245:
244:
242:
238:
233:
230:
227:
226:
224:
220:
215:
212:
209:
206:
203:
200:
199:
197:
193:
192:
191:
188:
186:
181:
177:
175:
166:
164:
162:
158:
154:
149:
147:
143:
139:
135:
134:
129:
125:
121:
116:
114:
108:
106:
100:
98:
97:
92:
88:
83:
81:
75:
73:
69:
61:
59:
57:
53:
49:
45:
41:
37:
33:
23:
18:
996:Enron Corpus
976:Brown Corpus
975:
852:
847:
838:
830:
825:
806:
793:
784:
615:plural noun
470:Essays, etc.
461:
446:
431:
416:
401:
386:
356:
355:J. LEARNED (
332:
317:
302:
287:
269:
261:
256:
251:
246:
240:
222:
195:
189:
182:
178:
173:
170:
150:
141:
131:
127:
123:
117:
109:
101:
94:
84:
76:
72:Brown Corpus
71:
67:
65:
56:Rhode Island
44:Henry Kučera
36:Brown Corpus
35:
31:
29:
1057:non-English
487:Definition
368:Mathematics
326:Periodicals
311:Periodicals
296:Periodicals
278:Periodicals
1236:1961 works
1210:Categories
833:29. 83–98.
776:References
764:LOB Corpus
535:adjective
460:R. HUMOR (
377:Humanities
146:Zipf's law
89:analysis,
213:Financial
210:Spot News
201:Political
120:hyperbola
1144:TalkBank
1021:PropBank
1001:EnTenTen
865:, p. 87.
815:Archived
758:See also
743:Symbols
447:29 texts
432:29 texts
402:24 texts
387:29 texts
365:Medicine
357:80 texts
333:30 texts
318:75 texts
303:48 texts
288:36 texts
270:17 texts
241:17 texts
231:Personal
223:27 texts
216:Cultural
196:44 texts
185:keypunch
1241:Corpora
1190:COBUILD
1149:Tatoeba
1069:CHILDES
1041:VerbNet
949:English
663:adverb
462:9 texts
417:6 texts
247:theatre
207:Society
159:or the
62:History
861:
467:Novels
452:Novels
437:Novels
422:Novels
407:Novels
392:Novels
281:Tracts
204:Sports
155:, the
91:Boston
40:corpus
1036:TIMIT
652:PRP$
323:Books
308:Books
293:Books
275:Books
262:dance
257:music
252:books
174:first
54:, in
859:ISBN
644:PRPS
628:NNPC
556:JJCC
484:Tag
46:and
30:The
748:PUN
740:SYM
724:VBZ
716:VBN
708:VBG
700:VBD
692:VBA
676:RBS
668:RBR
636:PRP
620:NNP
612:NNS
604:NNC
596:NNA
580:JJM
572:JJF
564:JJS
548:JJC
540:JJA
50:at
1212::
732:FW
684:VB
660:RB
588:NN
532:JJ
524:IN
516:EX
508:CS
500:CD
492:CC
464:)
449:)
434:)
419:)
404:)
389:)
359:)
335:)
320:)
305:)
290:)
272:)
243:)
225:)
198:)
148:.
74:.
932:e
925:t
918:v
801:.
128:n
124:n
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.