623:
566:
80:
The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140
664:
396:
713:
607:
281:
316:
341:
131:
images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image
Retrieval tasks.
718:
688:
657:
53:, one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group of
17:
484:
381:
600:
464:
274:
650:
356:
100:
The second release of the
Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:
57:. Later, a team headed by Abolfazl AleAhmad built on this corpus and created the first Persian text collection suitable for
703:
593:
698:
693:
361:
267:
519:
504:
489:
459:
166:
683:
434:
429:
336:
306:
245:
222:
630:
535:
479:
449:
321:
138:
708:
161:
509:
474:
469:
439:
376:
366:
178:
58:
514:
351:
54:
200:
290:
634:
577:
622:
414:
154:
37:
29:
573:
444:
311:
249:
226:
204:
331:
16:
677:
545:
346:
326:
173:
69:
40:
65:
49:
565:
137:
the news stories have been categorized semi-automatically (appropriate for
494:
424:
371:
540:
499:
419:
391:
64:
This corpus was created by crawling the online news articles from the
259:
386:
242:
219:
15:
254:
44:
263:
107:
323,616 Text
Stories in 3206 XML files (one file for each day)
68:'s website and processing the HTML pages to create a standard
84:
The corpus is available in several formats for download:
197:
638:
581:
145:
The corpus is available for download in XML format.
528:
405:
297:
397:Wellington Corpus of Spoken New Zealand English
425:CorCenCC National Corpus of Contemporary Welsh
72:for modern information retrieval experiments.
658:
601:
275:
8:
139:text categorization and classification tasks
665:
651:
608:
594:
282:
268:
260:
317:Bergen Corpus of London Teenage Language
342:Corpus of Contemporary American English
190:
714:Library and information science stubs
215:
213:
81:KB) with the average size of 1.8 KB.
7:
619:
617:
562:
560:
485:Scottish Corpus of Texts and Speech
382:Switchboard Telephone Speech Corpus
33:
637:. You can help Knowledge (XXG) by
580:. You can help Knowledge (XXG) by
14:
91:In SQL Server 2000 Tables: 712 MB
621:
564:
465:Neo-Assyrian Text Corpus Project
113:from 22 June 1996 to 13 May 2007
357:International Corpus of English
1:
719:Indo-European language stubs
362:Lancaster-Oslo-Bergen Corpus
689:Persian-language newspapers
255:irBlogs Collection Homepage
735:
616:
559:
520:Thesaurus Linguae Graecae
505:Tehran Monolingual Corpus
490:Slovenian National Corpus
460:National Corpus of Polish
243:Hamshahri Corpus Homepage
167:Tehran Monolingual Corpus
435:Croatian National Corpus
430:Croatian Language Corpus
337:Cambridge English Corpus
307:American National Corpus
631:Indo-European languages
480:Russian National Corpus
450:German Reference Corpus
322:British National Corpus
229:Database Research Group
207:Database Research Group
633:-related article is a
21:
572:This article about a
510:Tekstaro de Esperanto
475:Quranic Arabic Corpus
470:Persian Speech Corpus
440:Czech National Corpus
377:Spoken English Corpus
367:Oxford English Corpus
179:Information retrieval
59:information retrieval
20:Hamshahri Corpus logo
19:
704:Mass media in Tehran
515:TenTen Corpus Family
162:Persian Today Corpus
119:1.42 GB uncompressed
111:Increased Time Span:
55:University of Tehran
699:Linguistic research
694:Applied linguistics
123:Standard Container:
88:Tagged Text: 560 MB
291:Corpus linguistics
248:2017-05-14 at the
225:2017-05-14 at the
203:2017-05-15 at the
61:evaluation tasks.
22:
646:
645:
589:
588:
554:
553:
135:Categorized News:
726:
667:
660:
653:
625:
618:
610:
603:
596:
568:
561:
455:Hamshahri Corpus
415:Bijankhan Corpus
284:
277:
270:
261:
230:
217:
208:
195:
155:Bijankhan Corpus
129:Included Images:
35:
26:Hamshahri Corpus
734:
733:
729:
728:
727:
725:
724:
723:
684:Persian corpora
674:
673:
672:
671:
615:
614:
574:digital library
557:
555:
550:
524:
445:Europarl Corpus
407:
401:
312:Bank of English
299:
293:
288:
250:Wayback Machine
239:
234:
233:
227:Wayback Machine
218:
211:
205:Wayback Machine
196:
192:
187:
151:
117:Bigger in Size:
98:
78:
36:) is a sizable
12:
11:
5:
732:
730:
722:
721:
716:
711:
706:
701:
696:
691:
686:
676:
675:
670:
669:
662:
655:
647:
644:
643:
626:
613:
612:
605:
598:
590:
587:
586:
569:
552:
551:
549:
548:
543:
538:
536:BNC consortium
532:
530:
526:
525:
523:
522:
517:
512:
507:
502:
497:
492:
487:
482:
477:
472:
467:
462:
457:
452:
447:
442:
437:
432:
427:
422:
417:
411:
409:
403:
402:
400:
399:
394:
389:
384:
379:
374:
369:
364:
359:
354:
349:
344:
339:
334:
332:Buckeye Corpus
329:
324:
319:
314:
309:
303:
301:
295:
294:
289:
287:
286:
279:
272:
264:
258:
257:
252:
238:
237:External links
235:
232:
231:
209:
189:
188:
186:
183:
182:
181:
176:
170:
169:
164:
158:
157:
150:
147:
143:
142:
132:
126:
120:
114:
108:
97:
94:
93:
92:
89:
77:
74:
13:
10:
9:
6:
4:
3:
2:
731:
720:
717:
715:
712:
710:
709:Website stubs
707:
705:
702:
700:
697:
695:
692:
690:
687:
685:
682:
681:
679:
668:
663:
661:
656:
654:
649:
648:
642:
640:
636:
632:
627:
624:
620:
611:
606:
604:
599:
597:
592:
591:
585:
583:
579:
575:
570:
567:
563:
558:
547:
546:Sketch Engine
544:
542:
539:
537:
534:
533:
531:
529:Organizations
527:
521:
518:
516:
513:
511:
508:
506:
503:
501:
498:
496:
493:
491:
488:
486:
483:
481:
478:
476:
473:
471:
468:
466:
463:
461:
458:
456:
453:
451:
448:
446:
443:
441:
438:
436:
433:
431:
428:
426:
423:
421:
418:
416:
413:
412:
410:
406:Text corpora,
404:
398:
395:
393:
390:
388:
385:
383:
380:
378:
375:
373:
370:
368:
365:
363:
360:
358:
355:
353:
350:
348:
345:
343:
340:
338:
335:
333:
330:
328:
325:
323:
320:
318:
315:
313:
310:
308:
305:
304:
302:
298:Text corpora,
296:
292:
285:
280:
278:
273:
271:
266:
265:
262:
256:
253:
251:
247:
244:
241:
240:
236:
228:
224:
221:
216:
214:
210:
206:
202:
199:
194:
191:
184:
180:
177:
175:
172:
171:
168:
165:
163:
160:
159:
156:
153:
152:
148:
146:
140:
136:
133:
130:
127:
124:
121:
118:
115:
112:
109:
106:
103:
102:
101:
95:
90:
87:
86:
85:
82:
75:
73:
71:
67:
62:
60:
56:
52:
51:
46:
43:based on the
42:
39:
31:
27:
18:
639:expanding it
628:
582:expanding it
571:
556:
454:
347:Enron Corpus
327:Brown Corpus
193:
144:
134:
128:
122:
116:
110:
104:
99:
83:
79:
63:
48:
34:پیکره همشهری
25:
23:
408:non-English
174:Text corpus
125:Unicode XML
96:Version 2.0
76:Version 1.0
70:text corpus
678:Categories
185:References
105:More News:
47:newspaper
220:Hamshahri
198:DBRG News
66:Hamshahri
50:Hamshahri
495:TalkBank
372:PropBank
352:EnTenTen
246:Archived
223:Archived
201:Archived
149:See also
541:COBUILD
500:Tatoeba
420:CHILDES
392:VerbNet
300:English
45:Iranian
38:Persian
30:Persian
41:corpus
629:This
576:is a
387:TIMIT
635:stub
578:stub
24:The
680::
212:^
141:).
32::
666:e
659:t
652:v
641:.
609:e
602:t
595:v
584:.
283:e
276:t
269:v
28:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.