Knowledge (XXG)

Data profiling

Source 📝

22: 256:
The benefits of data profiling are to improve data quality, shorten the implementation cycle of major projects, and improve users' understanding of data. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. Data profiling is one of the
213:
in order to clarify the structure, content, relationships, and derivation rules of the data. Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata. The result of the analysis is used to determine the suitability of the
243:
Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to
226:
Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally,
239:
According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been
222:
Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values,
247:
Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements.
230:
Normally, purpose-built tools are used for data profiling to ease the process. The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.
240:
satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated.
223:
uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates.
189:
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
501:
Singh, Ranjit; Singh, Kawaljeet; et al. (May 2010). "A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing".
214:
candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design.
460: 430: 390: 227:
overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.
105: 39: 86: 43: 58: 65: 32: 521: 72: 577: 54: 572: 281: 276: 193: 169: 126: 121:
is the process of examining the data available from an existing information source (e.g. a database or a
582: 181: 177: 552: 382: 291: 286: 422: 479:
Rahm, Erhard; Hai Do, Hong (December 2000). "Data Cleaning: Problems and Current Approaches".
456: 426: 386: 544: 414: 374: 344: 155: 79: 186:
Assess whether known metadata accurately describes the actual values in the source database
271: 197: 545: 375: 210: 137: 566: 415: 296: 173: 129:
or informative summaries about that data. The purpose of these statistics may be to:
122: 266: 148: 453:
Business Intelligence: The Savvy Manager's Guide, Getting Onboard with Emerging IT
332: 257:
most effective technologies for improving data accuracy in corporate databases.
21: 159: 141: 348: 316:
Johnson, Theodore (2009). "Data Profiling". In Springer, Heidelberg (ed.).
165: 151:, including whether the data conforms to particular standards or patterns 333:"A classification of data quality assessment and improvement methods" 133:
Find out whether existing data can be easily used for other purposes
209:
Data profiling refers to the analysis of information for use in a
15: 331:
Woodall, Philip; Oberhofer, Martin; Borek, Alexander (2014).
522:"Kimball Design Tip #59: Surprising Value of Data Profiling" 481:
Bulletin of the Technical Committee on Data Engineering
192:
Have an enterprise view of all data, for uses such as
503:
IJCSI International Journal of Computer Science Issue
168:
of the source database, including value patterns and
46:. Unsourced material may be challenged and removed. 158:in new applications, including the challenges of 368: 366: 364: 362: 360: 358: 144:, descriptions, or assigning it to a category 8: 446: 444: 442: 408: 406: 404: 402: 337:International Journal of Information Quality 538: 536: 534: 496: 494: 474: 472: 106:Learn how and when to remove this message 308: 136:Improve the ability to search data by 455:. Morgan Kaufmann. pp. 110–111. 7: 547:Data Quality: The Accuracy Dimension 377:The Data Warehouse Lifecycle Toolkit 373:Kimball, Ralph; et al. (2008). 44:adding citations to reliable sources 381:(Second ed.). Wiley. pp.  14: 235:When is data profiling conducted? 318:Encyclopedia of Database Systems 20: 218:How data profiling is conducted 196:, where key data is needed, or 31:needs additional citations for 1: 551:. Morgan Kaufmann. pp.  421:. Morgan Kaufmann. pp.  154:Assess the risk involved in 487:(4). IEEE Computer Society. 200:for improving data quality. 599: 349:10.1504/ijiq.2014.068656 543:Olson, Jack E. (2003). 520:Kimball, Ralph (2004). 244:apply to the data set. 182:functional dependencies 451:Loshin, David (2003). 417:Master Data Management 413:Loshin, David (2009). 282:Database normalization 277:Master data management 194:master data management 178:foreign-key candidates 252:Benefits and examples 40:improve this article 292:Analysis paralysis 287:Data visualization 125:) and collecting 116: 115: 108: 90: 590: 557: 556: 550: 540: 529: 528: 527:. Kimball Group. 526: 517: 511: 510: 498: 489: 488: 476: 467: 466: 448: 437: 436: 420: 410: 397: 396: 380: 370: 353: 352: 328: 322: 321: 313: 156:integrating data 111: 104: 100: 97: 91: 89: 55:"Data profiling" 48: 24: 16: 598: 597: 593: 592: 591: 589: 588: 587: 578:Data management 563: 562: 561: 560: 542: 541: 532: 524: 519: 518: 514: 500: 499: 492: 478: 477: 470: 463: 450: 449: 440: 433: 412: 411: 400: 393: 372: 371: 356: 330: 329: 325: 315: 314: 310: 305: 272:Data governance 263: 254: 237: 220: 207: 198:data governance 112: 101: 95: 92: 49: 47: 37: 25: 12: 11: 5: 596: 594: 586: 585: 580: 575: 565: 564: 559: 558: 530: 512: 490: 468: 461: 438: 431: 398: 391: 354: 323: 307: 306: 304: 301: 300: 299: 294: 289: 284: 279: 274: 269: 262: 259: 253: 250: 236: 233: 219: 216: 211:data warehouse 206: 203: 202: 201: 190: 187: 184: 174:key candidates 162: 152: 145: 134: 119:Data profiling 114: 113: 28: 26: 19: 13: 10: 9: 6: 4: 3: 2: 595: 584: 581: 579: 576: 574: 573:Data analysis 571: 570: 568: 554: 549: 548: 539: 537: 535: 531: 523: 516: 513: 508: 504: 497: 495: 491: 486: 482: 475: 473: 469: 464: 462:9781558609167 458: 454: 447: 445: 443: 439: 434: 432:9780123742254 428: 424: 419: 418: 409: 407: 405: 403: 399: 394: 392:9780470149775 388: 384: 379: 378: 369: 367: 365: 363: 361: 359: 355: 350: 346: 342: 338: 334: 327: 324: 319: 312: 309: 302: 298: 297:Data analysis 295: 293: 290: 288: 285: 283: 280: 278: 275: 273: 270: 268: 265: 264: 260: 258: 251: 249: 245: 241: 234: 232: 228: 224: 217: 215: 212: 204: 199: 195: 191: 188: 185: 183: 179: 175: 171: 170:distributions 167: 163: 161: 157: 153: 150: 146: 143: 139: 135: 132: 131: 130: 128: 124: 120: 110: 107: 99: 88: 85: 81: 78: 74: 71: 67: 64: 60: 57: –  56: 52: 51:Find sources: 45: 41: 35: 34: 29:This article 27: 23: 18: 17: 583:Data quality 546: 515: 506: 502: 484: 480: 452: 416: 376: 340: 336: 326: 317: 311: 267:Data quality 255: 246: 242: 238: 229: 225: 221: 208: 205:Introduction 149:data quality 118: 117: 102: 93: 83: 76: 69: 62: 50: 38:Please help 33:verification 30: 96:August 2010 567:Categories 343:(4): 298. 303:References 127:statistics 66:newspapers 164:Discover 261:See also 166:metadata 142:keywords 140:it with 147:Assess 138:tagging 80:scholar 459:  429:  389:  180:, and 82:  75:  68:  61:  53:  555:–142. 525:(PDF) 505:. 2. 425:–96. 160:joins 87:JSTOR 73:books 509:(3). 457:ISBN 427:ISBN 387:ISBN 123:file 59:news 553:140 383:376 345:doi 42:by 569:: 533:^ 493:^ 485:23 483:. 471:^ 441:^ 423:94 401:^ 385:. 357:^ 339:. 335:. 176:, 172:, 507:7 465:. 435:. 395:. 351:. 347:: 341:3 320:. 109:) 103:( 98:) 94:( 84:· 77:· 70:· 63:· 36:.

Index


verification
improve this article
adding citations to reliable sources
"Data profiling"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
file
statistics
tagging
keywords
data quality
integrating data
joins
metadata
distributions
key candidates
foreign-key candidates
functional dependencies
master data management
data governance
data warehouse
Data quality
Data governance
Master data management
Database normalization

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.