Knowledge (XXG)

Text segmentation

Source 📝

38: 326: 530:
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems
307:
Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会
437:
of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other
241:
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where
531:
and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
408:/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, 382:
In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
153:. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of 234:
or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast,
609: 495:
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
769: 548:
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
461:
significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in
1474: 747: 534:
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
1158: 602: 1327: 1068: 759: 595: 419:
As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.
1322: 416:
When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
929: 55: 1083: 914: 366: 121: 854: 395: 102: 1271: 924: 74: 1402:
Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
919: 664: 309: 59: 1188: 909: 81: 1419: 881: 1226: 1211: 1183: 1048: 1043: 618: 562: 524: 150: 344: 1426: 963: 934: 712: 433:
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple
88: 48: 806: 659: 149:
used by humans when reading text, and to artificial processes implemented in computers, which are the subject of
1376:"也谈汉语书面语的分词问题——分词连写十大好处 (Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing)" 1332: 1256: 988: 944: 829: 727: 508: 273:
among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
70: 1236: 1206: 873: 582: 439: 428: 1093: 786: 764: 754: 722: 697: 401: 348: 211: 138: 379:
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
336: 1441: 953: 454: 1415: 1306: 982: 958: 811: 297:
text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
183:
Word segmentation is the problem of dividing a string of written language into its component words.
1286: 1216: 1173: 1129: 901: 891: 886: 774: 567: 473: 446: 255: 191: 161: 142: 1296: 1168: 1033: 796: 779: 637: 466: 458: 400:
Sentence segmentation is the problem of dividing a string of written language into its component
277: 95: 1301: 1013: 732: 270: 1178: 1063: 1038: 839: 742: 572: 542: 485: 434: 215: 1454: 1290: 1251: 1246: 1114: 844: 717: 692: 674: 557: 462: 301: 154: 262: 998: 978: 702: 489: 187: 146: 1468: 1261: 1073: 1053: 834: 503:
Processes may be required to segment text into segments besides mentioned, including
481: 477: 247: 203: 202:), although this concept has limits because of the variability with which languages 1241: 294: 251: 235: 195: 157:, such signals are sometimes ambiguous and not present in all written languages. 1198: 1078: 791: 707: 684: 632: 231: 207: 37: 1375: 801: 587: 577: 137:
is the process of dividing written text into meaningful units, such as words,
17: 238:
show less orthographic variation, with solidification being a stronger norm.
669: 512: 450: 405: 243: 199: 1394: 404:. In English and some other languages, using punctuation, particularly the 164:, the process of dividing speech into linguistically meaningful portions. 1144: 1124: 1109: 1088: 1058: 1003: 968: 849: 504: 1281: 1139: 1119: 993: 737: 652: 453:
turns might be useful in some natural processing tasks: it can improve
291: 266: 230:) with a corresponding variation in whether speakers think of them as 647: 642: 220: 1362: 1337: 973: 859: 178: 591: 254:, where phrases and sentences but not words are delimited, and 1134: 319: 308:不同意。" (The US Congress does not agree). For more details, see 226: 31: 541:
Annotate the sample corpus with boundary information and use
284:, exploring the issues of segmentation in multiscript texts. 438:
cases, one needs to use techniques similar to those used in
186:
In English and many other languages using some form of the
1395:"Advances in domain independent linear text segmentation" 527:
of implementing a computer process to segment text.
1315: 1270: 1225: 1197: 1157: 1102: 1024: 1012: 943: 900: 872: 820: 683: 625: 538:
Manual analysis of text and writing custom software
62:. Unsourced material may be challenged and removed. 1420:"Topic Segmentation: Algorithms and Applications" 472:Many different approaches have been tried: e.g. 300:Word splitting may also refer to the process of 258:, where syllables but not words are delimited. 603: 414:Mr. Smith went to the shops in Jones Street." 261:In some writing systems however, such as the 8: 347:. There might be a discussion about this on 1021: 817: 610: 596: 588: 1380:Journal of Chinese Information Processing 523:Automatic segmentation is the problem in 367:Learn how and when to remove this message 122:Learn how and when to remove this message 1355: 1450: 1439: 7: 1475:Tasks of natural language processing 1069:Simple Knowledge Organization System 60:adding citations to reliable sources 282:Standard Annex on Text Segmentation 218:are variably written (for example, 25: 1084:Thesaurus (information retrieval) 519:Automatic segmentation approaches 480:, passage similarity using word 396:Sentence boundary disambiguation 324: 36: 227:pig sty = pig-sty = pigsty 221:ice box = ice-box = icebox 47:needs additional citations for 665:Natural language understanding 310:Chinese word-segmented writing 1: 1189:Optical character recognition 246:but not words are delimited, 194:is a good approximation of a 882:Multi-document summarization 412:is not its own sentence in " 145:. The term applies both to 1212:Latent Dirichlet allocation 1184:Natural language generation 1049:Machine-readable dictionary 1044:Linguistic Linked Open Data 619:Natural language processing 563:Natural language processing 525:natural language processing 499:Other segmentation problems 179:Word § Word boundaries 151:natural language processing 1491: 1427:University of Pennsylvania 1393:Freddy Y. Y. Choi (2000). 964:Explicit semantic analysis 713:Deep linguistic processing 426: 393: 176: 1374:Zhang, Xiao-heng (1998). 807:Word-sense disambiguation 660:Computational linguistics 465:and tracking systems and 445:Segmenting the text into 1333:Natural Language Toolkit 1257:Pronunciation assessment 1159:Automatic identification 989:Latent semantic analysis 945:Distributional semantics 830:Compound-term processing 728:Named-entity recognition 1382:. 12 (1998) (3): 58–64. 1237:Automated essay scoring 1207:Document classification 874:Automatic summarization 507:(a task usually called 440:document classification 429:Document classification 1449:Cite journal requires 1094:Universal Dependencies 787:Terminology extraction 770:Semantic decomposition 765:Semantic role labeling 755:Part-of-speech tagging 723:Information extraction 708:Coreference resolution 698:Collocation extraction 509:morphological analysis 216:English compound nouns 27:Human writing practice 855:Sentence segmentation 455:information retrieval 390:Sentence segmentation 236:German compound nouns 168:Segmentation problems 1307:Voice user interface 1018:datasets and corpora 959:Document-term matrix 812:Word-sense induction 337:confusing or unclear 56:improve this article 1287:Interactive fiction 1217:Pachinko allocation 1174:Speech segmentation 1130:Google Ngram Viewer 902:Machine translation 892:Text simplification 887:Sentence extraction 775:Semantic similarity 568:Speech segmentation 345:clarify the section 316:Intent segmentation 162:speech segmentation 71:"Text segmentation" 1297:Question answering 1169:Speech recognition 1034:Corpus linguistics 1014:Language resources 797:Textual entailment 780:Sentiment analysis 459:speech recognition 423:Topic segmentation 290:is the process of 278:Unicode Consortium 1416:Jeffrey C. Reynar 1404:. pp. 26–33. 1346: 1345: 1302:Virtual assistant 1227:Computer-assisted 1153: 1152: 910:Computer-assisted 868: 867: 860:Word segmentation 822:Text segmentation 760:Semantic analysis 748:Syntactic parsing 733:Ontology learning 377: 376: 369: 173:Word segmentation 135:Text segmentation 132: 131: 124: 106: 16:(Redirected from 1482: 1459: 1458: 1452: 1447: 1445: 1437: 1435: 1433: 1424: 1412: 1406: 1405: 1399: 1390: 1384: 1383: 1371: 1365: 1360: 1323:Formal semantics 1272:Natural language 1179:Speech synthesis 1161:and data capture 1064:Semantic network 1039:Lexical resource 1022: 840:Lexical analysis 818: 743:Semantic parsing 612: 605: 598: 589: 573:Lexical analysis 543:machine learning 467:text summarizing 372: 365: 361: 358: 352: 328: 327: 320: 280:has published a 147:mental processes 127: 120: 116: 113: 107: 105: 64: 40: 32: 21: 1490: 1489: 1485: 1484: 1483: 1481: 1480: 1479: 1465: 1464: 1463: 1462: 1448: 1438: 1431: 1429: 1422: 1414: 1413: 1409: 1397: 1392: 1391: 1387: 1373: 1372: 1368: 1361: 1357: 1352: 1347: 1342: 1311: 1291:Syntax guessing 1273: 1266: 1252:Predictive text 1247:Grammar checker 1228: 1221: 1193: 1160: 1149: 1115:Bank of English 1098: 1026: 1017: 1008: 939: 896: 864: 816: 718:Distant reading 693:Argument mining 679: 675:Text processing 621: 616: 554: 521: 501: 463:topic detection 431: 425: 398: 392: 373: 362: 356: 353: 342: 329: 325: 318: 181: 175: 170: 128: 117: 111: 108: 65: 63: 53: 41: 28: 23: 22: 15: 12: 11: 5: 1488: 1486: 1478: 1477: 1467: 1466: 1461: 1460: 1451:|journal= 1425:. IRCS-98-21. 1407: 1385: 1366: 1354: 1353: 1351: 1348: 1344: 1343: 1341: 1340: 1335: 1330: 1325: 1319: 1317: 1313: 1312: 1310: 1309: 1304: 1299: 1294: 1284: 1278: 1276: 1274:user interface 1268: 1267: 1265: 1264: 1259: 1254: 1249: 1244: 1239: 1233: 1231: 1223: 1222: 1220: 1219: 1214: 1209: 1203: 1201: 1195: 1194: 1192: 1191: 1186: 1181: 1176: 1171: 1165: 1163: 1155: 1154: 1151: 1150: 1148: 1147: 1142: 1137: 1132: 1127: 1122: 1117: 1112: 1106: 1104: 1100: 1099: 1097: 1096: 1091: 1086: 1081: 1076: 1071: 1066: 1061: 1056: 1051: 1046: 1041: 1036: 1030: 1028: 1019: 1010: 1009: 1007: 1006: 1001: 999:Word embedding 996: 991: 986: 979:Language model 976: 971: 966: 961: 956: 950: 948: 941: 940: 938: 937: 932: 930:Transfer-based 927: 922: 917: 912: 906: 904: 898: 897: 895: 894: 889: 884: 878: 876: 870: 869: 866: 865: 863: 862: 857: 852: 847: 842: 837: 832: 826: 824: 815: 814: 809: 804: 799: 794: 789: 783: 782: 777: 772: 767: 762: 757: 752: 751: 750: 745: 735: 730: 725: 720: 715: 710: 705: 703:Concept mining 700: 695: 689: 687: 681: 680: 678: 677: 672: 667: 662: 657: 656: 655: 650: 640: 635: 629: 627: 623: 622: 617: 615: 614: 607: 600: 592: 586: 585: 580: 575: 570: 565: 560: 553: 550: 546: 545: 539: 520: 517: 500: 497: 490:topic modeling 478:lexical chains 435:classification 424: 421: 391: 388: 386:". , , ." 375: 374: 357:September 2019 332: 330: 323: 317: 314: 288:Word splitting 188:Latin alphabet 174: 171: 169: 166: 130: 129: 44: 42: 35: 26: 24: 18:Word splitting 14: 13: 10: 9: 6: 4: 3: 2: 1487: 1476: 1473: 1472: 1470: 1456: 1443: 1428: 1421: 1417: 1411: 1408: 1403: 1396: 1389: 1386: 1381: 1377: 1370: 1367: 1364: 1359: 1356: 1349: 1339: 1336: 1334: 1331: 1329: 1328:Hallucination 1326: 1324: 1321: 1320: 1318: 1314: 1308: 1305: 1303: 1300: 1298: 1295: 1292: 1288: 1285: 1283: 1280: 1279: 1277: 1275: 1269: 1263: 1262:Spell checker 1260: 1258: 1255: 1253: 1250: 1248: 1245: 1243: 1240: 1238: 1235: 1234: 1232: 1230: 1224: 1218: 1215: 1213: 1210: 1208: 1205: 1204: 1202: 1200: 1196: 1190: 1187: 1185: 1182: 1180: 1177: 1175: 1172: 1170: 1167: 1166: 1164: 1162: 1156: 1146: 1143: 1141: 1138: 1136: 1133: 1131: 1128: 1126: 1123: 1121: 1118: 1116: 1113: 1111: 1108: 1107: 1105: 1101: 1095: 1092: 1090: 1087: 1085: 1082: 1080: 1077: 1075: 1074:Speech corpus 1072: 1070: 1067: 1065: 1062: 1060: 1057: 1055: 1054:Parallel text 1052: 1050: 1047: 1045: 1042: 1040: 1037: 1035: 1032: 1031: 1029: 1023: 1020: 1015: 1011: 1005: 1002: 1000: 997: 995: 992: 990: 987: 984: 980: 977: 975: 972: 970: 967: 965: 962: 960: 957: 955: 952: 951: 949: 946: 942: 936: 933: 931: 928: 926: 923: 921: 918: 916: 915:Example-based 913: 911: 908: 907: 905: 903: 899: 893: 890: 888: 885: 883: 880: 879: 877: 875: 871: 861: 858: 856: 853: 851: 848: 846: 845:Text chunking 843: 841: 838: 836: 835:Lemmatisation 833: 831: 828: 827: 825: 823: 819: 813: 810: 808: 805: 803: 800: 798: 795: 793: 790: 788: 785: 784: 781: 778: 776: 773: 771: 768: 766: 763: 761: 758: 756: 753: 749: 746: 744: 741: 740: 739: 736: 734: 731: 729: 726: 724: 721: 719: 716: 714: 711: 709: 706: 704: 701: 699: 696: 694: 691: 690: 688: 686: 685:Text analysis 682: 676: 673: 671: 668: 666: 663: 661: 658: 654: 651: 649: 646: 645: 644: 641: 639: 636: 634: 631: 630: 628: 626:General terms 624: 620: 613: 608: 606: 601: 599: 594: 593: 590: 584: 583:Line breaking 581: 579: 576: 574: 571: 569: 566: 564: 561: 559: 556: 555: 551: 549: 544: 540: 537: 536: 535: 532: 528: 526: 518: 516: 514: 510: 506: 498: 496: 493: 491: 487: 483: 482:co-occurrence 479: 475: 470: 468: 464: 460: 456: 452: 448: 443: 441: 436: 430: 422: 420: 417: 415: 411: 407: 403: 397: 389: 387: 384: 380: 371: 368: 360: 350: 349:the talk page 346: 340: 338: 333:This section 331: 322: 321: 315: 313: 311: 305: 303: 298: 296: 293: 289: 285: 283: 279: 274: 272: 268: 264: 259: 257: 253: 249: 245: 239: 237: 233: 229: 228: 223: 222: 217: 213: 209: 205: 201: 197: 193: 189: 184: 180: 172: 167: 165: 163: 158: 156: 152: 148: 144: 140: 136: 126: 123: 115: 104: 101: 97: 94: 90: 87: 83: 80: 76: 73: –  72: 68: 67:Find sources: 61: 57: 51: 50: 45:This article 43: 39: 34: 33: 30: 19: 1442:cite journal 1430:. Retrieved 1410: 1401: 1388: 1379: 1369: 1358: 1242:Concordancer 821: 638:Bag-of-words 547: 533: 529: 522: 502: 494: 471: 444: 432: 418: 413: 409: 399: 385: 381: 378: 363: 354: 343:Please help 334: 306: 299: 295:concatenated 287: 286: 281: 275: 263:Ge'ez script 260: 240: 232:noun phrases 225: 219: 208:collocations 196:word divider 185: 182: 159: 134: 133: 118: 112:October 2011 109: 99: 92: 85: 78: 66: 54:Please help 49:verification 46: 29: 1199:Topic model 1079:Text corpus 925:Statistical 792:Text mining 633:AI-complete 558:Hyphenation 302:hyphenation 1432:8 November 1350:References 920:Rule-based 802:Truecasing 670:Stop words 578:Word count 513:paragraphs 486:clustering 469:problems. 427:See also: 394:See also: 339:to readers 256:Vietnamese 177:See also: 82:newspapers 1229:reviewing 1027:standards 1025:Types and 505:morphemes 451:discourse 406:full stop 402:sentences 265:used for 244:sentences 212:compounds 200:delimiter 139:sentences 1469:Category 1418:(1998). 1145:Wikidata 1125:FrameNet 1110:BabelNet 1089:Treebank 1059:PropBank 1004:Word2vec 969:fastText 850:Stemming 552:See also 271:Tigrinya 204:emically 160:Compare 1363:UAX #29 1316:Related 1282:Chatbot 1140:WordNet 1120:DBpedia 994:Seq2seq 738:Parsing 653:Trigram 492:, etc. 335:may be 292:parsing 267:Amharic 214:. Many 206:regard 96:scholar 1289:(c.f. 947:models 935:Neural 648:Bigram 643:n-gram 447:topics 198:(word 190:, the 155:Arabic 143:topics 98:  91:  84:  77:  69:  1423:(PDF) 1398:(PDF) 1338:spaCy 983:large 974:GloVe 511:) or 192:space 141:, or 103:JSTOR 89:books 1455:help 1434:2007 1103:Data 954:BERT 276:The 269:and 250:and 248:Thai 210:and 75:news 1135:UBY 474:HMM 457:or 449:or 410:Mr. 252:Lao 58:by 1471:: 1446:: 1444:}} 1440:{{ 1400:. 1378:. 515:. 488:, 484:, 476:, 442:. 312:. 304:. 224:; 1457:) 1453:( 1436:. 1293:) 1016:, 985:) 981:( 611:e 604:t 597:v 370:) 364:( 359:) 355:( 351:. 341:. 125:) 119:( 114:) 110:( 100:· 93:· 86:· 79:· 52:. 20:)

Index

Word splitting

verification
improve this article
adding citations to reliable sources
"Text segmentation"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
sentences
topics
mental processes
natural language processing
Arabic
speech segmentation
Word § Word boundaries
Latin alphabet
space
word divider
delimiter
emically
collocations
compounds
English compound nouns
ice box = ice-box = icebox
pig sty = pig-sty = pigsty
noun phrases

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.