Knowledge (XXG)

Semantic heterogeneity

Source 📝

805:
A mere twenty years ago, information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences. While there is a large number of categories of semantic heterogeneity, these categories are
802:, among many others. From the conceptual to actual data, there are differences in perspective, vocabularies, measures and conventions once any two data sources are brought together. Explicit attention to these semantic heterogeneities is one means to get the information to integrate or interoperate. 120:
Michael Bergman expanded upon this schema by adding a fourth major explicit category of language, and also added some examples of each kind of semantic heterogeneity, resulting in about 40 distinct potential categories . This table shows the combined 40 possible sources of semantic heterogeneities
112:
conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying sources. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents
103:
conflicts arise when the semantics of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the schema and using knowledge about the underlying data domains. The class of domain conflicts includes schematic
94:
conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying schema. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path
765:
et al. Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of
378:
When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to "phone" but the other schema has multiple elements such as "home phone", "work phone" and "cell phone"
84:
One of the most comprehensive classifications is from Pluempitiwiriyawej and Hammer, "Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources". They classify heterogeneities into three broad classes:
554:
Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.
806:
also patterned and can be anticipated and corrected. These patterned sources inform what kind of work must be done to overcome semantic differences where they still reside.
422:
Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)
61:
differences. Decomposing the various sources of semantic heterogeneities provides a basis for understanding how to map and transform data to overcome these differences.
95:
discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
34:
for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. Beyond
825: 799: 117:
Moreover, mismatches or conflicts can occur between set elements (a "population" mismatch) or attributes (a "description" mismatch).
840: 767: 17: 774:; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.'s main point is that 1022: 1047: 830: 51: 938: 734:
URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs
635: 492: 466:
When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example,
755:
Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ
1042: 1037: 990: 787: 433:
Differences in set enumerations or including items or not (say, US territories) in a listing of US states
964: 744: 522:
When attributes referring to the same thing have different cardinalities or disjointedness assertions
850: 795: 39: 57:
Yet, for multiple data sources to interoperate with one another, it is essential to reconcile these
621: 529: 70: 506:
When the same item is characterized by different types, such as a person being typed as an animal
484:
When two individuals are asserted as being the same when they are actually distinct (for example,
1052: 714:
Such as when the same name refers to more than one attribute, such as Name referring to a person
654:
commas; various date formats; using exponents or aggregate units (such as thousands or millions)
151: 358:
Such as when the same name refers to more than one concept, such as Name referring to a person
855: 775: 771: 474: 47: 1002: 835: 815: 73:
is from William Kent more than two decades ago. Kent's approach dealt more with structural
845: 557:
Many of the other semantic heterogeneities herein also contribute to schema discrepancies
485: 78: 27: 778:(FOL) or description logic is inadequate alone to properly capture the needed semantics. 761:
A different approach toward classifying semantics and integration approaches is taken by
921:"A classification scheme for semantic and schematic heterogeneities in XML data sources" 90: 43: 35: 1031: 820: 791: 199:
Mis-recognition of search tokens because not being parsed with the proper encoding
74: 457:
Differences in scope coverage between two or more datasets for the same attribute
441:
Differences in scope coverage between two or more datasets for the same concept
236: 99: 50:. Semantic heterogeneity is one of the more important sources of differences in 38:, the problem of semantic heterogeneity is compounded due to the flexibility of 920: 762: 546:
Fur) may refer to the same attribute, or when same attribute names (say, Hair
240: 1006: 880: 860: 179:
Mis-recognition of tokens because not being parsed with the proper encoding
58: 991:"Semantics for the semantic Web: the implicit, the formal and the powerful" 926:. Gainesville, Florida: University of Florida. Technical Report TR00-004. 210:
Variations in how parsers handle, say, stemming, white spaces or hyphens
143: 104:
discrepancy, scale or unit, precision, and data representation conflicts.
31: 903: 16:
This article is about semantic differences in data. For other uses, see
352: 332: 304: 467: 743:
A common problem, more acute with closed world approaches than with
449:
Differences in attribute completeness between two or more datasets
919:
Charnyote Pluempitiwiriyawej and Joachim Hammer (September 2000).
542:
One of four errors that may occur when attribute names (say, Hair
244: 168: 162: 390:
When the same population is divided differently (such as, Census
661: 108: 989:
Amit P. Sheth; Cartic Ramakrishnan; Christopher Thomas (2005).
995:
International Journal on Semantic Web and Information Systems
790:
that depend on reconciling semantic heterogeneities include
123: 414:
May occur when sums or counts are included as set members
550:
Hair) may refer to different attribute scopes (say, Hair
69:
One of the first known classification schemes applied to
939:"Sources and classification of semantic heterogeneities" 908:. Proceedings of the IEEE COMPCON. San Francisco. 13 pp. 77:
issues than differences in meaning, which he pointed to
611:For example, a value of 4.1 inches in one dataset 786:Besides data interoperability, relevant areas in 215:Parsing / Morphological Analysis Errors (many) 631:Confusion often arises in the use of literals 8: 902:William Kent (February 27 – March 3, 1989). 587:English measurement systems, or currencies 572:Attribute-value to Attribute-label Mapping 965:"Big structure and data interoperability" 567:Element-value to Attribute-label Mapping 562:Attribute-value to Element-label Mapping 1023:Classification of semantic heterogeneity 871: 539:Element-value to Element-label Mapping 231:Ambiguous sentence references, such as 278: 751: 738: 730: 667: 658: 619: 607: 576: 541: 535: 526: 518: 502: 461: 426: 418: 402:United Kingdom, or full person names 383: 374: 283: 203: 149: 140: 7: 394:Federal regions for states, England 826:Enterprise information integration 800:enterprise information integration 233:I'm glad I'm a man, and so is Lola 222:Romance languages (left-to-right) 14: 218:Arabic languages (right-to-left) 583:Differences, say, in the metric 375:Generalization / Specialization 46:methods applied to documents or 963:M.K. Bergman (12 August 2014). 905:The many forms of a single fact 841:Ontology-based data integration 702:For example, currency symbols 650:Delimiting decimals by period 18:Heterogeneity (disambiguation) 1: 831:Heterogeneous database system 937:M.K. Bergman (6 June 2006). 595:Differences, say, in meters 446:Attribute List Discrepancy 1069: 969:AI3:::Adaptive Information 943:AI3:::Adaptive Information 731:ID Mismatch or Missing ID 419:Internal Path Discrepancy 227:Syntactical Errors (many) 15: 881:"Why your data won't mix" 718:Name referring to a book 690:For example, centimeters 615:4.106 in another dataset 477:the official city-state) 362:Name referring to a book 157:Ingest Encoding Mismatch 113:and the attribute values. 1007:10.4018/jswis.2005010101 253:Semantics Errors (many) 184:Query Encoding Mismatch 176:Ingest Encoding Lacking 81:as potentially solving. 196:Query Encoding Lacking 788:information technology 536:Schematic Discrepancy 497:the aircraft carrier) 52:heterogeneous datasets 24:Semantic heterogeneity 782:Relevant applications 1048:Knowledge management 879:Alon Halevy (2005). 851:Semantic integration 796:semantic integration 627:Primitive Data Type 519:Constraint Mismatch 430:Content Discrepancy 40:semi-structured data 622:Data representation 406:first-middle-last) 187:For example, ASCII 772:description logics 454:Missing Attribute 411:Inter-aggregation 387:Intra-aggregation 856:Semantic matching 776:first-order logic 759: 758: 752:Element Ordering 671:Case Sensitivity 580:Measurement Type 462:Item Equivalence 287:Case Sensitivity 79:data dictionaries 48:unstructured data 1060: 1043:Interoperability 1011: 1010: 986: 980: 979: 977: 975: 960: 954: 953: 951: 949: 934: 928: 927: 925: 916: 910: 909: 899: 893: 892: 876: 836:Interoperability 816:Data integration 438:Missing Content 207:Script Mismatch 191:UTF-8 in search 124: 121:across sources: 1068: 1067: 1063: 1062: 1061: 1059: 1058: 1057: 1038:Data management 1028: 1027: 1019: 1017:Further reading 1014: 988: 987: 983: 973: 971: 962: 961: 957: 947: 945: 936: 935: 931: 923: 918: 917: 913: 901: 900: 896: 878: 877: 873: 869: 846:Schema matching 812: 784: 745:open world ones 706:currency names 577:Scale or Units 556: 555: 494:John F. Kennedy 486:John F. Kennedy 67: 36:structured data 28:database schema 21: 12: 11: 5: 1066: 1064: 1056: 1055: 1050: 1045: 1040: 1030: 1029: 1026: 1025: 1018: 1015: 1013: 1012: 981: 955: 929: 911: 894: 870: 868: 865: 864: 863: 858: 853: 848: 843: 838: 833: 828: 823: 818: 811: 808: 783: 780: 757: 756: 753: 749: 748: 740: 736: 735: 732: 728: 727: 724: 720: 719: 712: 708: 707: 700: 696: 695: 688: 684: 683: 672: 669: 666: 656: 655: 648: 644: 643: 628: 625: 617: 616: 609: 605: 604: 593: 589: 588: 581: 578: 574: 573: 569: 568: 564: 563: 559: 558: 540: 537: 534: 524: 523: 520: 516: 515: 504: 503:Type Mismatch 500: 499: 488:the president 480: 479: 463: 459: 458: 455: 451: 450: 447: 443: 442: 439: 435: 434: 431: 428: 424: 423: 420: 416: 415: 412: 408: 407: 398:Great Britain 388: 385: 381: 380: 376: 372: 371: 368: 364: 363: 356: 348: 347: 338:United States 336: 328: 327: 310:United States 308: 300: 299: 288: 285: 282: 276: 275: 254: 250: 249: 228: 224: 223: 216: 212: 211: 208: 205: 201: 200: 197: 193: 192: 185: 181: 180: 177: 173: 172: 158: 155: 148: 138: 137: 134: 131: 128: 115: 114: 105: 96: 71:data semantics 66: 65:Classification 63: 13: 10: 9: 6: 4: 3: 2: 1065: 1054: 1051: 1049: 1046: 1044: 1041: 1039: 1036: 1035: 1033: 1024: 1021: 1020: 1016: 1008: 1004: 1000: 996: 992: 985: 982: 970: 966: 959: 956: 944: 940: 933: 930: 922: 915: 912: 907: 906: 898: 895: 890: 886: 882: 875: 872: 866: 862: 859: 857: 854: 852: 849: 847: 844: 842: 839: 837: 834: 832: 829: 827: 824: 822: 819: 817: 814: 813: 809: 807: 803: 801: 797: 793: 789: 781: 779: 777: 773: 769: 764: 754: 750: 747: 746: 741: 739:Missing Data 737: 733: 729: 725: 723:Misspellings 722: 721: 717: 713: 710: 709: 705: 701: 698: 697: 693: 689: 686: 685: 681: 677: 673: 670: 665: 664: 663: 657: 653: 649: 646: 645: 642: 641:object types 640: 637: 634: 629: 626: 624: 623: 618: 614: 610: 606: 602: 598: 594: 591: 590: 586: 582: 579: 575: 571: 570: 566: 565: 561: 560: 553: 549: 545: 538: 533: 532: 531: 525: 521: 517: 513: 509: 505: 501: 498: 496: 495: 491: 487: 482: 481: 478: 476: 473: 469: 464: 460: 456: 453: 452: 448: 445: 444: 440: 437: 436: 432: 429: 427:Missing Item 425: 421: 417: 413: 410: 409: 405: 401: 397: 393: 389: 386: 382: 377: 373: 369: 367:Misspellings 366: 365: 361: 357: 355: 354: 350: 349: 345: 341: 337: 335: 334: 330: 329: 325: 321: 317: 313: 309: 307: 306: 302: 301: 297: 293: 289: 286: 281: 277: 273: 269: 266: 262: 259: 255: 252: 251: 248: 246: 242: 238: 234: 229: 226: 225: 221: 217: 214: 213: 209: 206: 202: 198: 195: 194: 190: 186: 183: 182: 178: 175: 174: 171: 170: 167: 164: 161:For example, 159: 156: 154: 153: 147: 146: 145: 139: 135: 132: 129: 126: 125: 122: 118: 111: 110: 106: 102: 101: 97: 93: 92: 88: 87: 86: 82: 80: 76: 72: 64: 62: 60: 55: 53: 49: 45: 41: 37: 33: 29: 25: 19: 998: 994: 984: 974:28 September 972:. Retrieved 968: 958: 948:28 September 946:. Retrieved 942: 932: 914: 904: 897: 888: 884: 874: 821:Data mapping 804: 792:data mapping 785: 760: 742: 715: 703: 691: 679: 675: 660: 659: 651: 647:Data Format 638: 632: 630: 620: 612: 603:millimeters 600: 599:centimeters 596: 584: 551: 547: 543: 528: 527: 511: 510:human being 507: 493: 489: 483: 471: 465: 403: 399: 395: 391: 384:Aggregation 359: 351: 343: 339: 331: 326:Great Satan 323: 319: 315: 311: 303: 295: 291: 279: 271: 267: 264: 260: 257: 232: 230: 219: 188: 165: 160: 150: 142: 141: 133:Subcategory 119: 116: 107: 98: 89: 83: 68: 56: 42:and various 23: 22: 1001:(1): 1–18. 682:Camel case 678:lower case 298:Camel case 294:lower case 1032:Categories 867:References 768:ontologies 726:As stated 674:Uppercase 608:Precision 370:As stated 322:Uncle Sam 290:Uppercase 280:Conceptual 270:billiards 241:Ray Davies 204:Languages 91:Structural 1053:Semantics 861:Semantics 770:or other 711:Homonyms 699:Acronyms 687:Synonyms 470:the city 136:Examples 130:Category 810:See also 353:Homonyms 333:Acronyms 318:America 305:Synonyms 243:and the 152:Encoding 144:Language 59:semantic 32:datasets 26:is when 668:Naming 514:person 284:Naming 75:mapping 44:tagging 798:, and 592:Units 530:Domain 475:Berlin 468:Berlin 263:money 256:River 127:Class 100:Domain 924:(PDF) 885:Queue 763:Sheth 274:shot 245:Kinks 169:UTF-8 163:ASCII 976:2014 950:2014 891:(8). 662:Data 636:URIs 342:USA 314:USA 272:bank 265:bank 258:bank 237:Lola 109:Data 1003:doi 694:cm 346:US 239:by 30:or 1034:: 997:. 993:. 967:. 941:. 887:. 883:. 794:, 247:) 54:. 1009:. 1005:: 999:1 978:. 952:. 889:3 716:v 704:v 692:v 680:v 676:v 652:v 639:v 633:v 613:v 601:v 597:v 585:v 552:v 548:v 544:v 512:v 508:v 490:v 472:v 404:v 400:v 396:v 392:v 360:v 344:v 340:v 324:v 320:v 316:v 312:v 296:v 292:v 268:v 261:v 235:( 220:v 189:v 166:v 20:.

Index

Heterogeneity (disambiguation)
database schema
datasets
structured data
semi-structured data
tagging
unstructured data
heterogeneous datasets
semantic
data semantics
mapping
data dictionaries
Structural
Domain
Data
Language
Encoding
ASCII
UTF-8
Lola
Ray Davies
Kinks
Synonyms
Acronyms
Homonyms
Berlin
Berlin
John F. Kennedy
John F. Kennedy
Domain

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.