805:
A mere twenty years ago, information technology systems expressed and stored data in a multitude of formats and systems. The
Internet and Web protocols have done much to overcome these sources of differences. While there is a large number of categories of semantic heterogeneity, these categories are
802:, among many others. From the conceptual to actual data, there are differences in perspective, vocabularies, measures and conventions once any two data sources are brought together. Explicit attention to these semantic heterogeneities is one means to get the information to integrate or interoperate.
120:
Michael
Bergman expanded upon this schema by adding a fourth major explicit category of language, and also added some examples of each kind of semantic heterogeneity, resulting in about 40 distinct potential categories . This table shows the combined 40 possible sources of semantic heterogeneities
112:
conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying sources. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents
103:
conflicts arise when the semantics of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the schema and using knowledge about the underlying data domains. The class of domain conflicts includes schematic
94:
conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying schema. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path
765:
et al. Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of
378:
When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to "phone" but the other schema has multiple elements such as "home phone", "work phone" and "cell phone"
84:
One of the most comprehensive classifications is from
Pluempitiwiriyawej and Hammer, "Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources". They classify heterogeneities into three broad classes:
554:
Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.
806:
also patterned and can be anticipated and corrected. These patterned sources inform what kind of work must be done to overcome semantic differences where they still reside.
422:
Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)
61:
differences. Decomposing the various sources of semantic heterogeneities provides a basis for understanding how to map and transform data to overcome these differences.
95:
discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
34:
for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. Beyond
825:
799:
117:
Moreover, mismatches or conflicts can occur between set elements (a "population" mismatch) or attributes (a "description" mismatch).
840:
767:
17:
774:; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.'s main point is that
1022:
1047:
830:
51:
938:
734:
URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs
635:
492:
466:
When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example,
755:
Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ
1042:
1037:
990:
787:
433:
Differences in set enumerations or including items or not (say, US territories) in a listing of US states
964:
744:
522:
When attributes referring to the same thing have different cardinalities or disjointedness assertions
850:
795:
39:
57:
Yet, for multiple data sources to interoperate with one another, it is essential to reconcile these
621:
529:
70:
506:
When the same item is characterized by different types, such as a person being typed as an animal
484:
When two individuals are asserted as being the same when they are actually distinct (for example,
1052:
714:
Such as when the same name refers to more than one attribute, such as Name referring to a person
654:
commas; various date formats; using exponents or aggregate units (such as thousands or millions)
151:
358:
Such as when the same name refers to more than one concept, such as Name referring to a person
855:
775:
771:
474:
47:
1002:
835:
815:
73:
is from
William Kent more than two decades ago. Kent's approach dealt more with structural
845:
557:
Many of the other semantic heterogeneities herein also contribute to schema discrepancies
485:
78:
27:
778:(FOL) or description logic is inadequate alone to properly capture the needed semantics.
761:
A different approach toward classifying semantics and integration approaches is taken by
921:"A classification scheme for semantic and schematic heterogeneities in XML data sources"
90:
43:
35:
1031:
820:
791:
199:
Mis-recognition of search tokens because not being parsed with the proper encoding
74:
457:
Differences in scope coverage between two or more datasets for the same attribute
441:
Differences in scope coverage between two or more datasets for the same concept
236:
99:
50:. Semantic heterogeneity is one of the more important sources of differences in
38:, the problem of semantic heterogeneity is compounded due to the flexibility of
920:
762:
546:
Fur) may refer to the same attribute, or when same attribute names (say, Hair
240:
1006:
880:
860:
179:
Mis-recognition of tokens because not being parsed with the proper encoding
58:
991:"Semantics for the semantic Web: the implicit, the formal and the powerful"
926:. Gainesville, Florida: University of Florida. Technical Report TR00-004.
210:
Variations in how parsers handle, say, stemming, white spaces or hyphens
143:
104:
discrepancy, scale or unit, precision, and data representation conflicts.
31:
903:
16:
This article is about semantic differences in data. For other uses, see
352:
332:
304:
467:
743:
A common problem, more acute with closed world approaches than with
449:
Differences in attribute completeness between two or more datasets
919:
542:
One of four errors that may occur when attribute names (say, Hair
244:
168:
162:
390:
When the same population is divided differently (such as, Census
661:
108:
989:
Amit P. Sheth; Cartic
Ramakrishnan; Christopher Thomas (2005).
995:
International
Journal on Semantic Web and Information Systems
790:
that depend on reconciling semantic heterogeneities include
123:
414:
May occur when sums or counts are included as set members
550:
Hair) may refer to different attribute scopes (say, Hair
69:
One of the first known classification schemes applied to
939:"Sources and classification of semantic heterogeneities"
908:. Proceedings of the IEEE COMPCON. San Francisco. 13 pp.
77:
issues than differences in meaning, which he pointed to
611:For example, a value of 4.1 inches in one dataset
786:Besides data interoperability, relevant areas in
215:Parsing / Morphological Analysis Errors (many)
631:Confusion often arises in the use of literals
8:
902:William Kent (February 27 – March 3, 1989).
587:English measurement systems, or currencies
572:Attribute-value to Attribute-label Mapping
965:"Big structure and data interoperability"
567:Element-value to Attribute-label Mapping
562:Attribute-value to Element-label Mapping
1023:Classification of semantic heterogeneity
871:
539:Element-value to Element-label Mapping
231:Ambiguous sentence references, such as
278:
751:
738:
730:
667:
658:
619:
607:
576:
541:
535:
526:
518:
502:
461:
426:
418:
402:United Kingdom, or full person names
383:
374:
283:
203:
149:
140:
7:
394:Federal regions for states, England
826:Enterprise information integration
800:enterprise information integration
233:I'm glad I'm a man, and so is Lola
222:Romance languages (left-to-right)
14:
218:Arabic languages (right-to-left)
583:Differences, say, in the metric
375:Generalization / Specialization
46:methods applied to documents or
963:M.K. Bergman (12 August 2014).
905:The many forms of a single fact
841:Ontology-based data integration
702:For example, currency symbols
650:Delimiting decimals by period
18:Heterogeneity (disambiguation)
1:
831:Heterogeneous database system
937:M.K. Bergman (6 June 2006).
595:Differences, say, in meters
446:Attribute List Discrepancy
1069:
969:AI3:::Adaptive Information
943:AI3:::Adaptive Information
731:ID Mismatch or Missing ID
419:Internal Path Discrepancy
227:Syntactical Errors (many)
15:
881:"Why your data won't mix"
718:Name referring to a book
690:For example, centimeters
615:4.106 in another dataset
477:the official city-state)
362:Name referring to a book
157:Ingest Encoding Mismatch
113:and the attribute values.
1007:10.4018/jswis.2005010101
253:Semantics Errors (many)
184:Query Encoding Mismatch
176:Ingest Encoding Lacking
81:as potentially solving.
196:Query Encoding Lacking
788:information technology
536:Schematic Discrepancy
497:the aircraft carrier)
52:heterogeneous datasets
24:Semantic heterogeneity
782:Relevant applications
1048:Knowledge management
879:Alon Halevy (2005).
851:Semantic integration
796:semantic integration
627:Primitive Data Type
519:Constraint Mismatch
430:Content Discrepancy
40:semi-structured data
622:Data representation
406:first-middle-last)
187:For example, ASCII
772:description logics
454:Missing Attribute
411:Inter-aggregation
387:Intra-aggregation
856:Semantic matching
776:first-order logic
759:
758:
752:Element Ordering
671:Case Sensitivity
580:Measurement Type
462:Item Equivalence
287:Case Sensitivity
79:data dictionaries
48:unstructured data
1060:
1043:Interoperability
1011:
1010:
986:
980:
979:
977:
975:
960:
954:
953:
951:
949:
934:
928:
927:
925:
916:
910:
909:
899:
893:
892:
876:
836:Interoperability
816:Data integration
438:Missing Content
207:Script Mismatch
191:UTF-8 in search
124:
121:across sources:
1068:
1067:
1063:
1062:
1061:
1059:
1058:
1057:
1038:Data management
1028:
1027:
1019:
1017:Further reading
1014:
988:
987:
983:
973:
971:
962:
961:
957:
947:
945:
936:
935:
931:
923:
918:
917:
913:
901:
900:
896:
878:
877:
873:
869:
846:Schema matching
812:
784:
745:open world ones
706:currency names
577:Scale or Units
556:
555:
494:John F. Kennedy
486:John F. Kennedy
67:
36:structured data
28:database schema
21:
12:
11:
5:
1066:
1064:
1056:
1055:
1050:
1045:
1040:
1030:
1029:
1026:
1025:
1018:
1015:
1013:
1012:
981:
955:
929:
911:
894:
870:
868:
865:
864:
863:
858:
853:
848:
843:
838:
833:
828:
823:
818:
811:
808:
783:
780:
757:
756:
753:
749:
748:
740:
736:
735:
732:
728:
727:
724:
720:
719:
712:
708:
707:
700:
696:
695:
688:
684:
683:
672:
669:
666:
656:
655:
648:
644:
643:
628:
625:
617:
616:
609:
605:
604:
593:
589:
588:
581:
578:
574:
573:
569:
568:
564:
563:
559:
558:
540:
537:
534:
524:
523:
520:
516:
515:
504:
503:Type Mismatch
500:
499:
488:the president
480:
479:
463:
459:
458:
455:
451:
450:
447:
443:
442:
439:
435:
434:
431:
428:
424:
423:
420:
416:
415:
412:
408:
407:
398:Great Britain
388:
385:
381:
380:
376:
372:
371:
368:
364:
363:
356:
348:
347:
338:United States
336:
328:
327:
310:United States
308:
300:
299:
288:
285:
282:
276:
275:
254:
250:
249:
228:
224:
223:
216:
212:
211:
208:
205:
201:
200:
197:
193:
192:
185:
181:
180:
177:
173:
172:
158:
155:
148:
138:
137:
134:
131:
128:
115:
114:
105:
96:
71:data semantics
66:
65:Classification
63:
13:
10:
9:
6:
4:
3:
2:
1065:
1054:
1051:
1049:
1046:
1044:
1041:
1039:
1036:
1035:
1033:
1024:
1021:
1020:
1016:
1008:
1004:
1000:
996:
992:
985:
982:
970:
966:
959:
956:
944:
940:
933:
930:
922:
915:
912:
907:
906:
898:
895:
890:
886:
882:
875:
872:
866:
862:
859:
857:
854:
852:
849:
847:
844:
842:
839:
837:
834:
832:
829:
827:
824:
822:
819:
817:
814:
813:
809:
807:
803:
801:
797:
793:
789:
781:
779:
777:
773:
769:
764:
754:
750:
747:
746:
741:
739:Missing Data
737:
733:
729:
725:
723:Misspellings
722:
721:
717:
713:
710:
709:
705:
701:
698:
697:
693:
689:
686:
685:
681:
677:
673:
670:
665:
664:
663:
657:
653:
649:
646:
645:
642:
641:object types
640:
637:
634:
629:
626:
624:
623:
618:
614:
610:
606:
602:
598:
594:
591:
590:
586:
582:
579:
575:
571:
570:
566:
565:
561:
560:
553:
549:
545:
538:
533:
532:
531:
525:
521:
517:
513:
509:
505:
501:
498:
496:
495:
491:
487:
482:
481:
478:
476:
473:
469:
464:
460:
456:
453:
452:
448:
445:
444:
440:
437:
436:
432:
429:
427:Missing Item
425:
421:
417:
413:
410:
409:
405:
401:
397:
393:
389:
386:
382:
377:
373:
369:
367:Misspellings
366:
365:
361:
357:
355:
354:
350:
349:
345:
341:
337:
335:
334:
330:
329:
325:
321:
317:
313:
309:
307:
306:
302:
301:
297:
293:
289:
286:
281:
277:
273:
269:
266:
262:
259:
255:
252:
251:
248:
246:
242:
238:
234:
229:
226:
225:
221:
217:
214:
213:
209:
206:
202:
198:
195:
194:
190:
186:
183:
182:
178:
175:
174:
171:
170:
167:
164:
161:For example,
159:
156:
154:
153:
147:
146:
145:
139:
135:
132:
129:
126:
125:
122:
118:
111:
110:
106:
102:
101:
97:
93:
92:
88:
87:
86:
82:
80:
76:
72:
64:
62:
60:
55:
53:
49:
45:
41:
37:
33:
29:
25:
19:
998:
994:
984:
974:28 September
972:. Retrieved
968:
958:
948:28 September
946:. Retrieved
942:
932:
914:
904:
897:
888:
884:
874:
821:Data mapping
804:
792:data mapping
785:
760:
742:
715:
703:
691:
679:
675:
660:
659:
651:
647:Data Format
638:
632:
630:
620:
612:
603:millimeters
600:
599:centimeters
596:
584:
551:
547:
543:
528:
527:
511:
510:human being
507:
493:
489:
483:
471:
465:
403:
399:
395:
391:
384:Aggregation
359:
351:
343:
339:
331:
326:Great Satan
323:
319:
315:
311:
303:
295:
291:
279:
271:
267:
264:
260:
257:
232:
230:
219:
188:
165:
160:
150:
142:
141:
133:Subcategory
119:
116:
107:
98:
89:
83:
68:
56:
42:and various
23:
22:
1001:(1): 1–18.
682:Camel case
678:lower case
298:Camel case
294:lower case
1032:Categories
867:References
768:ontologies
726:As stated
674:Uppercase
608:Precision
370:As stated
322:Uncle Sam
290:Uppercase
280:Conceptual
270:billiards
241:Ray Davies
204:Languages
91:Structural
1053:Semantics
861:Semantics
770:or other
711:Homonyms
699:Acronyms
687:Synonyms
470:the city
136:Examples
130:Category
810:See also
353:Homonyms
333:Acronyms
318:America
305:Synonyms
243:and the
152:Encoding
144:Language
59:semantic
32:datasets
26:is when
668:Naming
514:person
284:Naming
75:mapping
44:tagging
798:, and
592:Units
530:Domain
475:Berlin
468:Berlin
263:money
256:River
127:Class
100:Domain
924:(PDF)
885:Queue
763:Sheth
274:shot
245:Kinks
169:UTF-8
163:ASCII
976:2014
950:2014
891:(8).
662:Data
636:URIs
342:USA
314:USA
272:bank
265:bank
258:bank
237:Lola
109:Data
1003:doi
694:cm
346:US
239:by
30:or
1034::
997:.
993:.
967:.
941:.
887:.
883:.
794:,
247:)
54:.
1009:.
1005::
999:1
978:.
952:.
889:3
716:v
704:v
692:v
680:v
676:v
652:v
639:v
633:v
613:v
601:v
597:v
585:v
552:v
548:v
544:v
512:v
508:v
490:v
472:v
404:v
400:v
396:v
392:v
360:v
344:v
340:v
324:v
320:v
316:v
312:v
296:v
292:v
268:v
261:v
235:(
220:v
189:v
166:v
20:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.