38:
326:
530:
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems
307:
Some scholars have suggested that modern
Chinese should be written in word segmentation, with spaces between words like written English. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会
437:
of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other
241:
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include
Chinese, Japanese, where
531:
and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
408:/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example,
382:
In
English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
153:. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of
234:
or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast,
609:
495:
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
769:
548:
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
461:
significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in
1474:
747:
534:
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
1158:
602:
1327:
1068:
759:
595:
419:
As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.
1322:
416:
When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
929:
55:
1083:
914:
366:
121:
854:
395:
102:
1271:
924:
74:
1402:
Proceedings of the 1st
Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
919:
664:
309:
59:
1188:
909:
81:
1419:
881:
1226:
1211:
1183:
1048:
1043:
618:
562:
524:
150:
344:
1426:
963:
934:
712:
433:
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple
88:
48:
806:
659:
149:
used by humans when reading text, and to artificial processes implemented in computers, which are the subject of
1376:"也谈汉语书面语的分词问题——分词连写十大好处 (Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing)"
1332:
1256:
988:
944:
829:
727:
508:
273:
among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
70:
1236:
1206:
873:
582:
439:
428:
1093:
786:
764:
754:
722:
697:
401:
348:
211:
138:
379:
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
336:
1441:
953:
454:
1415:
1306:
982:
958:
811:
297:
text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
183:
Word segmentation is the problem of dividing a string of written language into its component words.
1286:
1216:
1173:
1129:
901:
891:
886:
774:
567:
473:
446:
255:
191:
161:
142:
1296:
1168:
1033:
796:
779:
637:
466:
458:
400:
Sentence segmentation is the problem of dividing a string of written language into its component
277:
95:
1301:
1013:
732:
270:
1178:
1063:
1038:
839:
742:
572:
542:
485:
434:
215:
1454:
1290:
1251:
1246:
1114:
844:
717:
692:
674:
557:
462:
301:
154:
262:
998:
978:
702:
489:
187:
146:
1468:
1261:
1073:
1053:
834:
503:
Processes may be required to segment text into segments besides mentioned, including
481:
477:
247:
203:
202:), although this concept has limits because of the variability with which languages
1241:
294:
251:
235:
195:
157:, such signals are sometimes ambiguous and not present in all written languages.
1198:
1078:
791:
707:
684:
632:
231:
207:
37:
1375:
801:
587:
577:
137:
is the process of dividing written text into meaningful units, such as words,
17:
238:
show less orthographic variation, with solidification being a stronger norm.
669:
512:
450:
405:
243:
199:
1394:
404:. In English and some other languages, using punctuation, particularly the
164:, the process of dividing speech into linguistically meaningful portions.
1144:
1124:
1109:
1088:
1058:
1003:
968:
849:
504:
1281:
1139:
1119:
993:
737:
652:
453:
turns might be useful in some natural processing tasks: it can improve
291:
266:
230:) with a corresponding variation in whether speakers think of them as
647:
642:
220:
1362:
1337:
973:
859:
178:
591:
254:, where phrases and sentences but not words are delimited, and
1134:
319:
308:不同意。" (The US Congress does not agree). For more details, see
226:
31:
541:
Annotate the sample corpus with boundary information and use
284:, exploring the issues of segmentation in multiscript texts.
438:
cases, one needs to use techniques similar to those used in
186:
In
English and many other languages using some form of the
1395:"Advances in domain independent linear text segmentation"
527:
of implementing a computer process to segment text.
1315:
1270:
1225:
1197:
1157:
1102:
1024:
1012:
943:
900:
872:
820:
683:
625:
538:
Manual analysis of text and writing custom software
62:. Unsourced material may be challenged and removed.
1420:"Topic Segmentation: Algorithms and Applications"
472:Many different approaches have been tried: e.g.
300:Word splitting may also refer to the process of
258:, where syllables but not words are delimited.
603:
414:Mr. Smith went to the shops in Jones Street."
261:In some writing systems however, such as the
8:
347:. There might be a discussion about this on
1021:
817:
610:
596:
588:
1380:Journal of Chinese Information Processing
523:Automatic segmentation is the problem in
367:Learn how and when to remove this message
122:Learn how and when to remove this message
1355:
1450:
1439:
7:
1475:Tasks of natural language processing
1069:Simple Knowledge Organization System
60:adding citations to reliable sources
282:Standard Annex on Text Segmentation
218:are variably written (for example,
25:
1084:Thesaurus (information retrieval)
519:Automatic segmentation approaches
480:, passage similarity using word
396:Sentence boundary disambiguation
324:
36:
227:pig sty = pig-sty = pigsty
221:ice box = ice-box = icebox
47:needs additional citations for
665:Natural language understanding
310:Chinese word-segmented writing
1:
1189:Optical character recognition
246:but not words are delimited,
194:is a good approximation of a
882:Multi-document summarization
412:is not its own sentence in "
145:. The term applies both to
1212:Latent Dirichlet allocation
1184:Natural language generation
1049:Machine-readable dictionary
1044:Linguistic Linked Open Data
619:Natural language processing
563:Natural language processing
525:natural language processing
499:Other segmentation problems
179:Word § Word boundaries
151:natural language processing
1491:
1427:University of Pennsylvania
1393:Freddy Y. Y. Choi (2000).
964:Explicit semantic analysis
713:Deep linguistic processing
426:
393:
176:
1374:Zhang, Xiao-heng (1998).
807:Word-sense disambiguation
660:Computational linguistics
465:and tracking systems and
445:Segmenting the text into
1333:Natural Language Toolkit
1257:Pronunciation assessment
1159:Automatic identification
989:Latent semantic analysis
945:Distributional semantics
830:Compound-term processing
728:Named-entity recognition
1382:. 12 (1998) (3): 58–64.
1237:Automated essay scoring
1207:Document classification
874:Automatic summarization
507:(a task usually called
440:document classification
429:Document classification
1449:Cite journal requires
1094:Universal Dependencies
787:Terminology extraction
770:Semantic decomposition
765:Semantic role labeling
755:Part-of-speech tagging
723:Information extraction
708:Coreference resolution
698:Collocation extraction
509:morphological analysis
216:English compound nouns
27:Human writing practice
855:Sentence segmentation
455:information retrieval
390:Sentence segmentation
236:German compound nouns
168:Segmentation problems
1307:Voice user interface
1018:datasets and corpora
959:Document-term matrix
812:Word-sense induction
337:confusing or unclear
56:improve this article
1287:Interactive fiction
1217:Pachinko allocation
1174:Speech segmentation
1130:Google Ngram Viewer
902:Machine translation
892:Text simplification
887:Sentence extraction
775:Semantic similarity
568:Speech segmentation
345:clarify the section
316:Intent segmentation
162:speech segmentation
71:"Text segmentation"
1297:Question answering
1169:Speech recognition
1034:Corpus linguistics
1014:Language resources
797:Textual entailment
780:Sentiment analysis
459:speech recognition
423:Topic segmentation
290:is the process of
278:Unicode Consortium
1416:Jeffrey C. Reynar
1404:. pp. 26–33.
1346:
1345:
1302:Virtual assistant
1227:Computer-assisted
1153:
1152:
910:Computer-assisted
868:
867:
860:Word segmentation
822:Text segmentation
760:Semantic analysis
748:Syntactic parsing
733:Ontology learning
377:
376:
369:
173:Word segmentation
135:Text segmentation
132:
131:
124:
106:
16:(Redirected from
1482:
1459:
1458:
1452:
1447:
1445:
1437:
1435:
1433:
1424:
1412:
1406:
1405:
1399:
1390:
1384:
1383:
1371:
1365:
1360:
1323:Formal semantics
1272:Natural language
1179:Speech synthesis
1161:and data capture
1064:Semantic network
1039:Lexical resource
1022:
840:Lexical analysis
818:
743:Semantic parsing
612:
605:
598:
589:
573:Lexical analysis
543:machine learning
467:text summarizing
372:
365:
361:
358:
352:
328:
327:
320:
280:has published a
147:mental processes
127:
120:
116:
113:
107:
105:
64:
40:
32:
21:
1490:
1489:
1485:
1484:
1483:
1481:
1480:
1479:
1465:
1464:
1463:
1462:
1448:
1438:
1431:
1429:
1422:
1414:
1413:
1409:
1397:
1392:
1391:
1387:
1373:
1372:
1368:
1361:
1357:
1352:
1347:
1342:
1311:
1291:Syntax guessing
1273:
1266:
1252:Predictive text
1247:Grammar checker
1228:
1221:
1193:
1160:
1149:
1115:Bank of English
1098:
1026:
1017:
1008:
939:
896:
864:
816:
718:Distant reading
693:Argument mining
679:
675:Text processing
621:
616:
554:
521:
501:
463:topic detection
431:
425:
398:
392:
373:
362:
356:
353:
342:
329:
325:
318:
181:
175:
170:
128:
117:
111:
108:
65:
63:
53:
41:
28:
23:
22:
15:
12:
11:
5:
1488:
1486:
1478:
1477:
1467:
1466:
1461:
1460:
1451:|journal=
1425:. IRCS-98-21.
1407:
1385:
1366:
1354:
1353:
1351:
1348:
1344:
1343:
1341:
1340:
1335:
1330:
1325:
1319:
1317:
1313:
1312:
1310:
1309:
1304:
1299:
1294:
1284:
1278:
1276:
1274:user interface
1268:
1267:
1265:
1264:
1259:
1254:
1249:
1244:
1239:
1233:
1231:
1223:
1222:
1220:
1219:
1214:
1209:
1203:
1201:
1195:
1194:
1192:
1191:
1186:
1181:
1176:
1171:
1165:
1163:
1155:
1154:
1151:
1150:
1148:
1147:
1142:
1137:
1132:
1127:
1122:
1117:
1112:
1106:
1104:
1100:
1099:
1097:
1096:
1091:
1086:
1081:
1076:
1071:
1066:
1061:
1056:
1051:
1046:
1041:
1036:
1030:
1028:
1019:
1010:
1009:
1007:
1006:
1001:
999:Word embedding
996:
991:
986:
979:Language model
976:
971:
966:
961:
956:
950:
948:
941:
940:
938:
937:
932:
930:Transfer-based
927:
922:
917:
912:
906:
904:
898:
897:
895:
894:
889:
884:
878:
876:
870:
869:
866:
865:
863:
862:
857:
852:
847:
842:
837:
832:
826:
824:
815:
814:
809:
804:
799:
794:
789:
783:
782:
777:
772:
767:
762:
757:
752:
751:
750:
745:
735:
730:
725:
720:
715:
710:
705:
703:Concept mining
700:
695:
689:
687:
681:
680:
678:
677:
672:
667:
662:
657:
656:
655:
650:
640:
635:
629:
627:
623:
622:
617:
615:
614:
607:
600:
592:
586:
585:
580:
575:
570:
565:
560:
553:
550:
546:
545:
539:
520:
517:
500:
497:
490:topic modeling
478:lexical chains
435:classification
424:
421:
391:
388:
386:". , , ."
375:
374:
357:September 2019
332:
330:
323:
317:
314:
288:Word splitting
188:Latin alphabet
174:
171:
169:
166:
130:
129:
44:
42:
35:
26:
24:
18:Word splitting
14:
13:
10:
9:
6:
4:
3:
2:
1487:
1476:
1473:
1472:
1470:
1456:
1443:
1428:
1421:
1417:
1411:
1408:
1403:
1396:
1389:
1386:
1381:
1377:
1370:
1367:
1364:
1359:
1356:
1349:
1339:
1336:
1334:
1331:
1329:
1328:Hallucination
1326:
1324:
1321:
1320:
1318:
1314:
1308:
1305:
1303:
1300:
1298:
1295:
1292:
1288:
1285:
1283:
1280:
1279:
1277:
1275:
1269:
1263:
1262:Spell checker
1260:
1258:
1255:
1253:
1250:
1248:
1245:
1243:
1240:
1238:
1235:
1234:
1232:
1230:
1224:
1218:
1215:
1213:
1210:
1208:
1205:
1204:
1202:
1200:
1196:
1190:
1187:
1185:
1182:
1180:
1177:
1175:
1172:
1170:
1167:
1166:
1164:
1162:
1156:
1146:
1143:
1141:
1138:
1136:
1133:
1131:
1128:
1126:
1123:
1121:
1118:
1116:
1113:
1111:
1108:
1107:
1105:
1101:
1095:
1092:
1090:
1087:
1085:
1082:
1080:
1077:
1075:
1074:Speech corpus
1072:
1070:
1067:
1065:
1062:
1060:
1057:
1055:
1054:Parallel text
1052:
1050:
1047:
1045:
1042:
1040:
1037:
1035:
1032:
1031:
1029:
1023:
1020:
1015:
1011:
1005:
1002:
1000:
997:
995:
992:
990:
987:
984:
980:
977:
975:
972:
970:
967:
965:
962:
960:
957:
955:
952:
951:
949:
946:
942:
936:
933:
931:
928:
926:
923:
921:
918:
916:
915:Example-based
913:
911:
908:
907:
905:
903:
899:
893:
890:
888:
885:
883:
880:
879:
877:
875:
871:
861:
858:
856:
853:
851:
848:
846:
845:Text chunking
843:
841:
838:
836:
835:Lemmatisation
833:
831:
828:
827:
825:
823:
819:
813:
810:
808:
805:
803:
800:
798:
795:
793:
790:
788:
785:
784:
781:
778:
776:
773:
771:
768:
766:
763:
761:
758:
756:
753:
749:
746:
744:
741:
740:
739:
736:
734:
731:
729:
726:
724:
721:
719:
716:
714:
711:
709:
706:
704:
701:
699:
696:
694:
691:
690:
688:
686:
685:Text analysis
682:
676:
673:
671:
668:
666:
663:
661:
658:
654:
651:
649:
646:
645:
644:
641:
639:
636:
634:
631:
630:
628:
626:General terms
624:
620:
613:
608:
606:
601:
599:
594:
593:
590:
584:
583:Line breaking
581:
579:
576:
574:
571:
569:
566:
564:
561:
559:
556:
555:
551:
549:
544:
540:
537:
536:
535:
532:
528:
526:
518:
516:
514:
510:
506:
498:
496:
493:
491:
487:
483:
482:co-occurrence
479:
475:
470:
468:
464:
460:
456:
452:
448:
443:
441:
436:
430:
422:
420:
417:
415:
411:
407:
403:
397:
389:
387:
384:
380:
371:
368:
360:
350:
349:the talk page
346:
340:
338:
333:This section
331:
322:
321:
315:
313:
311:
305:
303:
298:
296:
293:
289:
285:
283:
279:
274:
272:
268:
264:
259:
257:
253:
249:
245:
239:
237:
233:
229:
228:
223:
222:
217:
213:
209:
205:
201:
197:
193:
189:
184:
180:
172:
167:
165:
163:
158:
156:
152:
148:
144:
140:
136:
126:
123:
115:
104:
101:
97:
94:
90:
87:
83:
80:
76:
73: –
72:
68:
67:Find sources:
61:
57:
51:
50:
45:This article
43:
39:
34:
33:
30:
19:
1442:cite journal
1430:. Retrieved
1410:
1401:
1388:
1379:
1369:
1358:
1242:Concordancer
821:
638:Bag-of-words
547:
533:
529:
522:
502:
494:
471:
444:
432:
418:
413:
409:
399:
385:
381:
378:
363:
354:
343:Please help
334:
306:
299:
295:concatenated
287:
286:
281:
275:
263:Ge'ez script
260:
240:
232:noun phrases
225:
219:
208:collocations
196:word divider
185:
182:
159:
134:
133:
118:
112:October 2011
109:
99:
92:
85:
78:
66:
54:Please help
49:verification
46:
29:
1199:Topic model
1079:Text corpus
925:Statistical
792:Text mining
633:AI-complete
558:Hyphenation
302:hyphenation
1432:8 November
1350:References
920:Rule-based
802:Truecasing
670:Stop words
578:Word count
513:paragraphs
486:clustering
469:problems.
427:See also:
394:See also:
339:to readers
256:Vietnamese
177:See also:
82:newspapers
1229:reviewing
1027:standards
1025:Types and
505:morphemes
451:discourse
406:full stop
402:sentences
265:used for
244:sentences
212:compounds
200:delimiter
139:sentences
1469:Category
1418:(1998).
1145:Wikidata
1125:FrameNet
1110:BabelNet
1089:Treebank
1059:PropBank
1004:Word2vec
969:fastText
850:Stemming
552:See also
271:Tigrinya
204:emically
160:Compare
1363:UAX #29
1316:Related
1282:Chatbot
1140:WordNet
1120:DBpedia
994:Seq2seq
738:Parsing
653:Trigram
492:, etc.
335:may be
292:parsing
267:Amharic
214:. Many
206:regard
96:scholar
1289:(c.f.
947:models
935:Neural
648:Bigram
643:n-gram
447:topics
198:(word
190:, the
155:Arabic
143:topics
98:
91:
84:
77:
69:
1423:(PDF)
1398:(PDF)
1338:spaCy
983:large
974:GloVe
511:) or
192:space
141:, or
103:JSTOR
89:books
1455:help
1434:2007
1103:Data
954:BERT
276:The
269:and
250:and
248:Thai
210:and
75:news
1135:UBY
474:HMM
457:or
449:or
410:Mr.
252:Lao
58:by
1471::
1446::
1444:}}
1440:{{
1400:.
1378:.
515:.
488:,
484:,
476:,
442:.
312:.
304:.
224:;
1457:)
1453:(
1436:.
1293:)
1016:,
985:)
981:(
611:e
604:t
597:v
370:)
364:(
359:)
355:(
351:.
341:.
125:)
119:(
114:)
110:(
100:·
93:·
86:·
79:·
52:.
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.