Knowledge (XXG)

Hamshahri Corpus

Source 📝

623: 566: 80:
The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140
664: 396: 713: 607: 281: 316: 341: 131:
images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
718: 688: 657: 53:, one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group of 17: 484: 381: 600: 464: 274: 650: 356: 100:
The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:
57:. Later, a team headed by Abolfazl AleAhmad built on this corpus and created the first Persian text collection suitable for 703: 593: 698: 693: 361: 267: 519: 504: 489: 459: 166: 683: 434: 429: 336: 306: 245: 222: 630: 535: 479: 449: 321: 138: 708: 161: 509: 474: 469: 439: 376: 366: 178: 58: 514: 351: 54: 200: 290: 634: 577: 622: 414: 154: 37: 29: 573: 444: 311: 249: 226: 204: 331: 16: 677: 545: 346: 326: 173: 69: 40: 65: 49: 565: 137:
the news stories have been categorized semi-automatically (appropriate for
494: 424: 371: 540: 499: 419: 391: 64:
This corpus was created by crawling the online news articles from the
259: 386: 242: 219: 15: 254: 44: 263: 107:
323,616 Text Stories in 3206 XML files (one file for each day)
68:'s website and processing the HTML pages to create a standard 84:
The corpus is available in several formats for download:
197: 638: 581: 145:
The corpus is available for download in XML format.
528: 405: 297: 397:Wellington Corpus of Spoken New Zealand English 425:CorCenCC National Corpus of Contemporary Welsh 72:for modern information retrieval experiments. 658: 601: 275: 8: 139:text categorization and classification tasks 665: 651: 608: 594: 282: 268: 260: 317:Bergen Corpus of London Teenage Language 342:Corpus of Contemporary American English 190: 714:Library and information science stubs 215: 213: 81:KB) with the average size of 1.8 KB. 7: 619: 617: 562: 560: 485:Scottish Corpus of Texts and Speech 382:Switchboard Telephone Speech Corpus 33: 637:. You can help Knowledge (XXG) by 580:. You can help Knowledge (XXG) by 14: 91:In SQL Server 2000 Tables: 712 MB 621: 564: 465:Neo-Assyrian Text Corpus Project 113:from 22 June 1996 to 13 May 2007 357:International Corpus of English 1: 719:Indo-European language stubs 362:Lancaster-Oslo-Bergen Corpus 689:Persian-language newspapers 255:irBlogs Collection Homepage 735: 616: 559: 520:Thesaurus Linguae Graecae 505:Tehran Monolingual Corpus 490:Slovenian National Corpus 460:National Corpus of Polish 243:Hamshahri Corpus Homepage 167:Tehran Monolingual Corpus 435:Croatian National Corpus 430:Croatian Language Corpus 337:Cambridge English Corpus 307:American National Corpus 631:Indo-European languages 480:Russian National Corpus 450:German Reference Corpus 322:British National Corpus 229:Database Research Group 207:Database Research Group 633:-related article is a 21: 572:This article about a 510:Tekstaro de Esperanto 475:Quranic Arabic Corpus 470:Persian Speech Corpus 440:Czech National Corpus 377:Spoken English Corpus 367:Oxford English Corpus 179:Information retrieval 59:information retrieval 20:Hamshahri Corpus logo 19: 704:Mass media in Tehran 515:TenTen Corpus Family 162:Persian Today Corpus 119:1.42 GB uncompressed 111:Increased Time Span: 55:University of Tehran 699:Linguistic research 694:Applied linguistics 123:Standard Container: 88:Tagged Text: 560 MB 291:Corpus linguistics 248:2017-05-14 at the 225:2017-05-14 at the 203:2017-05-15 at the 61:evaluation tasks. 22: 646: 645: 589: 588: 554: 553: 135:Categorized News: 726: 667: 660: 653: 625: 618: 610: 603: 596: 568: 561: 455:Hamshahri Corpus 415:Bijankhan Corpus 284: 277: 270: 261: 230: 217: 208: 195: 155:Bijankhan Corpus 129:Included Images: 35: 26:Hamshahri Corpus 734: 733: 729: 728: 727: 725: 724: 723: 684:Persian corpora 674: 673: 672: 671: 615: 614: 574:digital library 557: 555: 550: 524: 445:Europarl Corpus 407: 401: 312:Bank of English 299: 293: 288: 250:Wayback Machine 239: 234: 233: 227:Wayback Machine 218: 211: 205:Wayback Machine 196: 192: 187: 151: 117:Bigger in Size: 98: 78: 36:) is a sizable 12: 11: 5: 732: 730: 722: 721: 716: 711: 706: 701: 696: 691: 686: 676: 675: 670: 669: 662: 655: 647: 644: 643: 626: 613: 612: 605: 598: 590: 587: 586: 569: 552: 551: 549: 548: 543: 538: 536:BNC consortium 532: 530: 526: 525: 523: 522: 517: 512: 507: 502: 497: 492: 487: 482: 477: 472: 467: 462: 457: 452: 447: 442: 437: 432: 427: 422: 417: 411: 409: 403: 402: 400: 399: 394: 389: 384: 379: 374: 369: 364: 359: 354: 349: 344: 339: 334: 332:Buckeye Corpus 329: 324: 319: 314: 309: 303: 301: 295: 294: 289: 287: 286: 279: 272: 264: 258: 257: 252: 238: 237:External links 235: 232: 231: 209: 189: 188: 186: 183: 182: 181: 176: 170: 169: 164: 158: 157: 150: 147: 143: 142: 132: 126: 120: 114: 108: 97: 94: 93: 92: 89: 77: 74: 13: 10: 9: 6: 4: 3: 2: 731: 720: 717: 715: 712: 710: 709:Website stubs 707: 705: 702: 700: 697: 695: 692: 690: 687: 685: 682: 681: 679: 668: 663: 661: 656: 654: 649: 648: 642: 640: 636: 632: 627: 624: 620: 611: 606: 604: 599: 597: 592: 591: 585: 583: 579: 575: 570: 567: 563: 558: 547: 546:Sketch Engine 544: 542: 539: 537: 534: 533: 531: 529:Organizations 527: 521: 518: 516: 513: 511: 508: 506: 503: 501: 498: 496: 493: 491: 488: 486: 483: 481: 478: 476: 473: 471: 468: 466: 463: 461: 458: 456: 453: 451: 448: 446: 443: 441: 438: 436: 433: 431: 428: 426: 423: 421: 418: 416: 413: 412: 410: 406:Text corpora, 404: 398: 395: 393: 390: 388: 385: 383: 380: 378: 375: 373: 370: 368: 365: 363: 360: 358: 355: 353: 350: 348: 345: 343: 340: 338: 335: 333: 330: 328: 325: 323: 320: 318: 315: 313: 310: 308: 305: 304: 302: 298:Text corpora, 296: 292: 285: 280: 278: 273: 271: 266: 265: 262: 256: 253: 251: 247: 244: 241: 240: 236: 228: 224: 221: 216: 214: 210: 206: 202: 199: 194: 191: 184: 180: 177: 175: 172: 171: 168: 165: 163: 160: 159: 156: 153: 152: 148: 146: 140: 136: 133: 130: 127: 124: 121: 118: 115: 112: 109: 106: 103: 102: 101: 95: 90: 87: 86: 85: 82: 75: 73: 71: 67: 62: 60: 56: 52: 51: 46: 43:based on the 42: 39: 31: 27: 18: 639:expanding it 628: 582:expanding it 571: 556: 454: 347:Enron Corpus 327:Brown Corpus 193: 144: 134: 128: 122: 116: 110: 104: 99: 83: 79: 63: 48: 34:پیکره همشهری 25: 23: 408:non-English 174:Text corpus 125:Unicode XML 96:Version 2.0 76:Version 1.0 70:text corpus 678:Categories 185:References 105:More News: 47:newspaper 220:Hamshahri 198:DBRG News 66:Hamshahri 50:Hamshahri 495:TalkBank 372:PropBank 352:EnTenTen 246:Archived 223:Archived 201:Archived 149:See also 541:COBUILD 500:Tatoeba 420:CHILDES 392:VerbNet 300:English 45:Iranian 38:Persian 30:Persian 41:corpus 629:This 576:is a 387:TIMIT 635:stub 578:stub 24:The 680:: 212:^ 141:). 32:: 666:e 659:t 652:v 641:. 609:e 602:t 595:v 584:. 283:e 276:t 269:v 28:(

Index


Persian
Persian
corpus
Iranian
Hamshahri
University of Tehran
information retrieval
Hamshahri
text corpus
text categorization and classification tasks
Bijankhan Corpus
Persian Today Corpus
Tehran Monolingual Corpus
Text corpus
Information retrieval
DBRG News
Archived
Wayback Machine


Hamshahri
Archived
Wayback Machine
Hamshahri Corpus Homepage
Archived
Wayback Machine
irBlogs Collection Homepage
v
t

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.