Data lineage - Knowledge

741:

flow or they can be augmented by new actors to change the data-flow. The improved data-flow can be replayed to test the validity of it. Debugging faulty actors include recursively performing coarse-grain replay on actors in the data-flow, which can be expensive in resources for long dataflows. Another approach is to manually inspect lineage logs to find anomalies, which can be tedious and time-consuming across several stages of a data-flow. Furthermore, these approaches work only when the data scientist can discover bad outputs. To debug analytics without known bad outputs, the data scientist need to analyze the data-flow for suspicious behavior in general. However, often, a user may not know the expected normal behavior and cannot specify predicates. This section describes a debugging methodology for retrospectively analyzing lineage to identify faulty actors in a multi-stage data-flow. We believe that sudden changes in an actor's behavior, such as its average selectivity, processing rate or output size, is characteristic of an anomaly. Lineage can reflect such changes in actor behavior over time and across different actor instances. Thus, mining lineage to identify such changes can be useful in debugging faulty actors in a data-flow.

496:, and different use cases of lineage can dictate the level at which lineage needs to be captured. Lineage can be captured at the level of the job, using files and giving lineage tuples of form {IF i, M RJob, OF i }, lineage can also be captured at the level of each task, using records and giving, for example, lineage tuples of form {(k rr, v rr ), map, (k m, v m )}. The first form of lineage is called coarse-grain lineage, while the second form is called fine-grain lineage. Integrating lineage across different granularities enables users to ask questions such as "Which file read by a MapReduce job produced this particular output record?" and can be useful in debugging across different operator and data granularities within a dataflow. 824:

re-computing outputs when a bad input has been fixed. However, sometimes a user may want to remove the bad input and replay the lineage of outputs previously affected by the error to produce error-free outputs. We call this exclusive replay. Another use of replay in debugging involves replaying bad inputs for step-wise debugging (called selective replay). Current approaches to using lineage in DISC systems do not address these. Thus, there is a need for a lineage system that can perform both exclusive and selective replays to address different debugging needs.

653:

using a series of equality joins based on the actors themselves. In few scenarios the tables might also be joined using inputs as the key. Indexes can also be used to improve the efficiency of a join. The joined tables need to be stored on a single instance or a machine to further continue processing. There are multiple schemes that are used to pick a machine where a join would be computed. The easiest one being the one with minimum CPU load. Space constraints should also be kept in mind while picking the instance where join would happen.

623:

that specific machine. The lineage store typically stores association tables. Each actor is represented by its own association table. The rows are the associations themselves and columns represent inputs and outputs. This design solves 2 problems. It allows horizontal scaling of the lineage store. If a single centralized lineage store was used, then this information had to be carried over the network, which would cause additional network latency. The network latency is also avoided by the use of a distributed lineage store.

284:

the entire analytics through a debugger for step-wise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, sharing of data between scientific communities and use of third-party data in business enterprises. These problems will only become larger and more acute as these systems and data continue to grow. As such, more cost-efficient ways of analyzing

640:

combined together to finish the job. Tasks running on different machines perform multiple transformations on the data in the machine. All the transformations applied to the data on a machines is stored in the local lineage store of that machines. This information needs to be combined together to get the lineage of the entire job. The lineage of the entire job should help the data scientist understand the data flow of the job and he/she can use the data flow to debug the

583:

is a triplet {i, T, o} that relates an input i with an output o for an actor T . The instrumentation thus captures lineage in a dataflow one actor at a time, piecing it into a set of associations for each actor. The system developer needs to capture the data an actor reads (from other actors) and the data an actor writes (to other actors). For example, a developer can treat the Hadoop Job Tracker as an actor by recording the set of files read and written by each job.

757: 489:. Backward tracing is useful for debugging, while forward tracing is useful for tracking error propagation. Tracing queries also form the basis for replaying an original dataflow. However, to efficiently use lineage in a DISC system, we need to be able to capture lineage at multiple levels (or granularities) of operators and data, capture accurate lineage for DISC processing constructs and be able to trace through multiple dataflow stages efficiently. 745: 327:

documents. Note that while these sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly often many times faster than structured databases are growing. "

459: 112:, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis. " 815:

perform efficient forward queries. Lipstick, a lineage system for Pig, while able to perform both backward and forward tracing, is specific to Pig and SQL operators and can only perform coarse-grain tracing for black-box operators. Thus, there is a need for a lineage system that enables efficient forward and backward tracing for generic DISC systems and dataflows with black-box operators.

264:

days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the largest genome 12 sequencing houses in the world now store petabytes of data apiece." It is very difficult for a data scientist to trace an unknown or an unanticipated result.

515:. "Operator containment implies that the contained (or child) operator performs a part of the logical operation of the containing (or parent) operator." For example, a MapReduce task is contained in a job. Similar containment relationships exist for data as well, called data containment. Data containment implies that the contained data is a subset of the containing data (superset). 36: 592:

actor is represented by its own association table. An association itself looks like {i, T, o} where i is the set of inputs to the actor T and o is set of outputs given produced by the actor. Associations are the basic units of Data Lineage. Individual associations are later clubbed together to construct the entire history of transformations that were applied to the data.

792:

lineage generated by the failed task and duplicate lineage produced by the restarted task. A lineage system should also be able to gracefully handle multiple instances of local lineage systems going down. This can be achieved by storing replicas of lineage associations in multiple machines. The replica can act like a backup in the event of the real copy being lost.

500: 753:

outliers in the data. This problem can be solved by removing the set of outliers from the data and replaying the entire data-flow. It can also be solved by modifying the machine learning algorithm by adding, removing or moving actors in the data-flow. The changes in the data-flow are successful if the replayed data-flow does not produce bad outputs.

627: 800:

Lineage systems for DISC dataflows must be able to capture accurate lineage across black-box operators to enable fine-grain debugging. Current approaches to this include Prober, which seeks to find the minimal set of inputs that can produce a specified output for a black-box operator by replaying the

782:

DISC systems are primarily batch processing systems designed for high throughput. They execute several jobs per analytics, with several tasks per job. The overall number of operators executing at any time in a cluster can range from hundreds to thousands depending on the cluster size. Lineage capture

706:

In distributed systems, sometimes there are implicit links, which are not specified during execution. For example, an implicit link exists between an actor that wrote to a file and another actor that read from it. Such links connect actors which use a common data set for execution. The dataset is the

622:

The best-case scenario is to use a local lineage store for every machine in the distributed system network. This allows the lineage store also to scale horizontally. In this design, the lineage of data transformations applied to the data on a particular machine is stored on the local lineage store of

326:

usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business

187:

of the data points at the highest granular level, but presentation of the lineage may be done at various zoom levels to simplify the vast information, similar to analytic web maps. Data Lineage can be visualized at various levels based on the granularity of the view. At a very high level data lineage

832:

One of the primary debugging concerns in DISC systems is identifying faulty operators. In long dataflows with several hundreds of operators or tasks, manual inspection can be tedious and prohibitive. Even if lineage is used to narrow the subset of operators to examine, the lineage of a single output

791:

Lineage capture systems must also be fault tolerant to avoid rerunning data flows to capture lineage. At the same time, they must also accommodate failures in the DISC system. To do so, they must be able to identify a failed DISC task and avoid storing duplicate copies of lineage between the partial

697:

jobs in the data flow, and linking all map instances with all reduce instances can create false links. To prevent this, such links are restricted to actor instances contained within a common actor instance of a containing (or parent) actor type. Thus, map and reduce instances are only linked to each

670:

The simplest link is an explicitly specified link between two actors. These links are explicitly specified in the code of a machine learning algorithm. When an actor is aware of its exact upstream or downstream actor, it can communicate this information to lineage API. This information is later used

661:

The second step in data flow reconstruction is computing an association graph from the lineage information. The graph represents the steps in the data flow. The actors act as vertices and the associations act as edges. Each actor T is linked to its upstream and downstream actors in the data flow. An

603:

systems scale horizontally i.e. increase capacity by adding new hardware or software entities into the distributed system. The distributed system acts as a single entity in the logical level even though it comprises multiple hardware and software entities. The system should continue to maintain this

582:

An actor is an entity that transforms data; it may be a Dryad vertex, individual map and reduce operators, a MapReduce job, or an entire dataflow pipeline. Actors act as black-boxes and the inputs and outputs of an actor are tapped to capture lineage in the form of associations, where an association

569:

Lazy lineage collection typically captures only coarse-grain lineage at run time. These systems incur low capture overheads due to the small amount of lineage they capture. However, to answer fine-grain tracing queries, they must replay the data flow on all (or a large part) of its input and collect

471:

is defined as a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated

147:

for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view would be to be able

773:

pipelines, the process is not simple. The challenges include scalability of the lineage store, fault tolerance of the lineage store, accurate capture of lineage for black box operators and many others. These challenges must be considered carefully and trade offs between them need to be evaluated to

573:

Active collection systems capture entire lineage of the data flow at run time. The kind of lineage they capture may be coarse-grain or fine-grain, but they do not require any further computations on the data flow after its execution. Active fine-grain lineage collection systems incur higher capture

419:

provides a historical record of the data and its origins. The provenance of data which is generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources

342:

In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One

283:

The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running

263:

analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3

805:

operators through binary rewriting to compute dynamic slices. Although producing highly accurate lineage, such techniques can incur significant time overheads for capture or tracing, and it may be preferable to instead trade some accuracy for better performance. Thus, there is a need for a lineage

752:

The second problem i.e. the existence of outliers can also be identified by running the data-flow step wise and looking at the transformed outputs. The data scientist finds a subset of outputs that are not in accordance to the rest of outputs. The inputs which are causing these bad outputs are the

740:

The first case can be debugged by tracing the data-flow. By using lineage and data-flow information together a data scientist can figure out how the inputs are converted into outputs. During the process actors that behave unexpectedly can be caught. Either these actors can be removed from the data

652:

The first stage of the data flow reconstruction is the computation of the association tables. The association tables exists for each actor in each local lineage store. The entire association table for an actor can be computed by combining these individual association tables. This is generally done

591:

Association is a combination of the inputs, outputs and the operation itself. The operation is represented in terms of a black box also known as the actor. The associations describe the transformations that are applied on the data. The associations are stored in the association tables. Each unique

449:

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. The PROV Family of Documents defines a model, corresponding serializations and other supporting

338:

The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand, and prepare for analytic use. Beyond issues of structure, is the sheer volume of this type of data. Because of this, current data mining

823:

Replaying only specific inputs or portions of a data-flow is crucial for efficient debugging and simulating what-if scenarios. Ikeda et al. present a methodology for lineage-based refresh, which selectively replays updated inputs to recompute affected outputs. This is useful during debugging for

814:

Tracing is essential for debugging, during which, a user can issue multiple tracing queries. Thus, it is important that tracing has fast turnaround times. Ikeda et al. can perform efficient backward tracing queries for MapReduce dataflows, but are not generic to different DISC systems and do not

484:

Intuitively, for an operator T producing output o, lineage consists of triplets of form {I, T, o}, where I is the set of inputs to T used to derive o. Capturing lineage for each operator T in a dataflow enables users to ask questions such as "Which outputs were produced by an input i on operator

1000:

Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of the 2010 international conference on Managementof data, SIGMOD '10, pages 135–146, New York, NY, USA, 2010.

639:

The information stored in terms of associations needs to be combined by some means to get the data flow of a particular job. In a distributed system a job is broken down into multiple tasks. One or more instances run a particular task. The results produced on these individual machines are later

427:

The use of data provenance is proposed in distributed systems to trace records through a dataflow, replay the dataflow on a subset of its original inputs and debug data flows. To do so, one needs to keep track of the set of inputs to each operator, which were used to derive each of its outputs.

355:

approach, where many machines are used to solve a problem. Both approaches allow organizations to explore huge data volumes. Even this level of sophisticated hardware and software, few of the image processing tasks in large scale take a few days to few weeks. Debugging of the data processing is

732:

debugging. The captured lineage is combined and processed to obtain the data flow of the pipeline. The data flow helps the data scientist or a developer to look deeply into the actors and their transformations. This step allows the data scientist to figure out the part of the algorithm that is

560:

Many certified compliance reports require provenance of data flow as well as the end state data for a specific instance. With these types of situations, any deviation from the prescribed path need to be accounted and potentially remediated. This marks a shift from purely "looking back" to a

538:

The terms 'data lineage' and 'provenance' generally describe the sequence of steps or processes through which a dataset has passed to reach its current state. However, looking back at the audit or log correlations to determine the lineage from a forensic point of view fails for certain data

977:

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference onComputer Systems 2007, EuroSys '07, pages 59–72, New York, NY, USA, 2007.

127:

to discover the data flow/movement from its source to destination via various changes and hops on its way in the enterprise environment, how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. A simple

472:

with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, where users find information that is often contradictory or questionable, provenance can help those users to make trust judgements.

510:

To capture end-to-end lineage in a DISC system, we use the Ibis model, which introduces the notion of containment hierarchies for operators and data. Specifically, Ibis proposes that an operator can be contained within another and such a relationship between two operators is called

152:

of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data for the purposes of tracking errors, implementing changes in processes, and implementing

519: 687:

to each logical actor. A data flow archetype explains how the children types of an actor type arrange themselves in a data flow. With the help of this information, one can infer a link between each actor of a source type and a destination type. For example, in the

392:

pipeline becomes very challenging because of the very nature of the system. It will not be an easy task for the data scientist to figure out which machine's data has the outliers and unknown features causing a particular algorithm to give unexpected results.

1483:

Robert Ikeda, Semih Salihoglu, and Jennifer Widom. Provenance-based refresh in data-oriented workflows. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM '11, pages 1659–1668, New York, NY, USA, 2011.

188:

provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the data point level where it can provide the details of the data point and its historical behavior, attribute properties, and trends and

719:

of the association graph. The directed graph created in the previous step is topologically sorted to obtain the order in which the actors have modified the data. This inherit order of the actors defines the data flow of the big data pipeline or task.

833:

can still span several operators. There is a need for an inexpensive automated debugging system, which can substantially narrow the set of potentially faulty operators, with reasonable accuracy, to minimize the amount of manual examination required.

662:

upstream actor of T is one that produced the input of T, while a downstream actor is one that consumes the output of T . Containment relationships are always considered while creating the links. The graph consists of three types of links or edges.

279:

algorithms etc. to the data which transforms the data. Due to the humongous size of the data, there could be unknown features in the data, possibly even outliers. It is pretty difficult for a data scientist to actually debug an unexpected result.

242:

linked to the data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. To represent disparate systems into one common view,

604:

property after horizontal scaling. An important advantage of horizontal scalability is that it can provide the ability to increase capacity on the fly. The biggest plus point is that horizontal scaling can be done using commodity hardware.

1048:

Ian Foster, Jens Vockler, Michael Wilde, and Yong Zhao. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In 14th International Conference on Scientific and Statistical Database Management, July

128:

representation of the Data Lineage can be shown with dots and lines, where dot represents a data container for data points and lines connecting them represents the transformations the data point undergoes, between the data containers.

374:

Another method to track data lineage is spreadsheet programs such as Excel that do offer users cell-level lineage, or the ability to see what cells are dependent on another, but the structure of the transformation is lost. Similarly,

1061:

Benjamin H. Sigelman, Luiz Andr Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google Inc,

1403:

Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging DISC analytics. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 17, 15

420:

of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a

485:

T ?" and "Which inputs produced output o in operator T ?" A query that finds the inputs deriving an output is called a backward tracing query, while one that finds the outputs produced by an input is called a

206:

helps in enriching the data lineage with more business value. Even though the final representation of data lineage is provided in one interface but the way the metadata is harvested and exposed to the data lineage

1085:. Data provenance: Some basic issues. In Proceedings of the 20th Conference on Foundations of SoftwareTechnology and Theoretical Computer Science, FST TCS 2000, pages 87–93, London, UK, UK, 2000. Springer-Verlag 692:

architecture, the map actor type is the source for reduce, and vice versa. The system infers this from the data flow archetypes and duly links map instances with reduce instances. However, there may be several

387:

platforms have a very complicated structure. Data is distributed among several machines. Typically the jobs are mapped into several machines and results are later combined by reduce operations. Debugging of a

1474:

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: A not-so-foreign language for data processing. In Proc. of ACM SIGMOD, Vancouver, Canada, June 2008.

211:

could be entirely different. Thus, data lineage can be broadly divided into three categories based on the way metadata is harvested: data lineage involving software packages for structured data,

275:

analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. They apply

1465:

Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, and Julia Stoyanovich. Putting lipstick on a pig: Enabling database-style workflow provenance. In Proc. of VLDB, August 2011.

611:

systems should be taken into account while creating the architecture of lineage store. This is essential because the lineage store itself should also be able to scale in parallel with the

1214:

Pasquier, Thomas; Lau, Matthew K.; Trisovic, Ana; Boose, Emery R.; Couturier, Ben; Crosas, Mercè; Ellison, Aaron M.; Gibson, Valerie; Jones, Chris R.; Seltzer, Margo (5 September 2017).

222:

Data lineage information includes technical metadata involving data transformations. Enriched data lineage information may include data quality test results, reference data values,

1456:

Mingwu Zhang, Xiangyu Zhang, Xiang Zhang, and Sunil Prabhakar. Tracing lineage beyond relational operators. In Proc. Conference on Very Large Data Bases (VLDB), September 2007.

619:

systems makes the use of a single lineage store not appropriate and impossible to scale. The immediate solution to this problem is to distribute the lineage store itself.

615:

system. The number of associations and amount of storage required to store lineage will increase with the increase in size and capacity of the system. The architecture of

806:

collection system for DISC dataflows that can capture lineage from arbitrary operators with reasonable accuracy, and without significant overheads in capture or tracing.

363:

with visual data discovery, enabling analysts to simultaneously prepare and visualize data side-by-side in an interactive analysis environment offered by newer companies

539:

management cases. For instance, it is impossible to determine with certainty if the route a data workflow took was correct or in compliance without the logic model.

1345: 737:

pipeline can go wrong in two broad ways. The first is a presence of a suspicious actor in the data-flow. The second being the existence of outliers in the data.

52:

Please remove or replace such wording and instead of making proclamations about a subject's importance, use facts and attribution to demonstrate that importance.

259:, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, 1376: 1447:

Anish Das Sarma, Alpa Jain, and Philip Bohannon. PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines. Technical report, Yahoo, April 2010.

227: 1148: 405:

pipeline easier. This necessitates the collection of data about data transformations. The below section will explain data provenance in more detail.

87:

origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a

926:

De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from:

45: 1504: 1164: 783:

for these systems must be able scale to both large volumes of data and numerous operators to avoid being a bottleneck for the DISC analytics.

1190: 570:

fine-grain lineage during the replay. This approach is suitable for forensic systems, where a user wants to debug an observed bad output.

135:

and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with

428:

Although there are several forms of provenance, such as copy-provenance and how-provenance, the information we need is a simple form of

1326:

Robert Ikeda, Hyunjung Park, and Jennifer Widom. Provenance for generalized map and reduce workflows. In Proc. of CIDR, January 2011.

968:

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

67: 383:

to distinguish between transforms that are logically independent (e.g. transforms that operate on distinct columns) or dependent.

1010:

Shimin Chen and Steven W. Schlosser. Map-reduce meets wider varieties of applications. Technical report, Intel Research, 2008.

450:

definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web

940: 148:

to simplify the view by temporarily masking unwanted peripheral data points. Tools that have the masking feature enable

801:

data-flow several times to deduce the minimal set, and dynamic slicing, as used by Zhang et al. to capture lineage for

1352: 379:

or mapping software provide transform-level lineage, yet this view typically doesn't display data and is too coarse-

1096:"New Digital Universe Study Reveals Big Data Gap Less Than 1 of World s Data is Analyzed Less Than 20 is Protected" 164:

The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually,

124: 1499: 1335:

C. Olston and A. Das Sarma. Ibis: A provenance manager for multi-layer systems. In Proc. of CIDR, January 2011.

547: 504: 376: 285: 208: 1389: 339:

techniques often leave out valuable information and make analyzing unstructured data laborious and expensive.

847: 535:

combines the logical model (entity) of how that data should flow with the actual lineage for that instance.

1282:

Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB Journal, 12(1), 2003.

1145: 1036:, and Dennis Gannon. A survey of data prove- nance in e-science. SIGMOD Rec., 34(3):31–36, September 2005. 842: 203: 158: 1270:

Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report, Stanford University, 2009.

1227: 1020: 212: 716: 424:, track the creation of intellectual property, and provide an audit trail for regulatory purposes. 132: 542:

Only by combining a logical model with atomic forensic events can proper activities be validated:

675:

architecture, each map instance knows the exact record reader instance whose output it consumes.

574:

overheads than lazy collection systems. However, they enable sophisticated replay and debugging.

348: 344: 244: 235: 756: 1509: 1370: 1253: 1121: 323: 239: 198:

plays a key role in metadata management for guidelines, strategies, policies, implementation.

1243: 1235: 360: 276: 154: 1095: 1152: 867: 744: 195: 169: 165: 103: 1108: 1231: 331:

can include both structured and unstructured data, but IDC estimates that 90 percent of

1248: 1215: 1082: 1078: 421: 352: 176:, enterprise data management strategy, data impact, reporting attributes, and critical 88: 1433:

Fonseca, Rodrigo; Porter, George; Katz, Randy H.; Shenker, Scott; Stoica, Ion (2007).

518: 139:, leading to the final destination's data points and its intermediate data flows with 1493: 1415: 1074: 231: 199: 189: 177: 458: 157:

to save significant amounts of time and resources, thereby tremendously improving

850:, a staging area which tracks the whole change history of a source table or query 1306: 948: 499: 184: 149: 17: 1033: 769:

Even though the use of data lineage approaches is a novel way of debugging of

684: 413: 380: 256: 223: 173: 108: 927: 694: 689: 672: 99: 95: 1257: 1239: 626: 343:

possible solution is hardware. Some vendors are using increased memory and

1422:. Proceedings of 23rd ACM Symposium on Operating System Principles (SOSP). 308:

the digital universe will double every two years between now and 2020, and

882: 770: 734: 729: 641: 616: 612: 608: 600: 402: 389: 384: 364: 332: 328: 272: 260: 216: 192:

of the data passed through that specific data point in the data lineage.

347:

to crunch large volumes of data quickly. Another method is putting data

1292: 1146:

http://www.sas.com/resources/asset/five-big-data-challenges-article.pdf

368: 707:

output of the first actor and is the input of the actor following it.

401:

Data provenance or data lineage can be used to make the debugging of

671:

to link these actors during the tracing query. For example, in the

561:

framework, which is better suited to capture compliance workflows.

311:

there will be approximately 5.2TB of data for every person in 2020.

1021:

https://www-304.ibm.com/connections/blogs/ibmhealthcare/entry/data

802: 755: 743: 625: 553:

Mapping of processing to the systems that those process are run on

517: 498: 457: 493: 84: 1414:

Zhou, Wenchao; Fei, Qiong; Narayan, Arjun; Haeberlen, Andreas;

359:

A third approach of advanced data discovery solutions combines

441: 29: 315:

Working with this scale of data has become very challenging.

94:

It also enables replaying specific portions or inputs of the

1165:"5 Requirements for Effective Self-Service Data Preparation" 988: 454:"PROV-Overview, An Overview of the PROV Family of Documents" 1191:"Tracking Data Lineage in Financial Services - Trifacta" 1122:"Differences Between Structured & Unstructured Data" 492:

DISC system consists of several levels of operators and

172:

determines the scope of the data lineage based on their

1044: 1042: 1322: 1320: 1318: 1316: 715:

The final step in the data flow reconstruction is the

644:

pipeline. The data flow is reconstructed in 3 stages.

883:"Data Lineage Helps Drives Business Value - Trifacta" 288:(DISC) are crucial to their continued effective use. 1057: 1055: 868:"What is Data Lineage? - Definition from Techopedia" 131:Representation broadly depends on the scope of the 774:make a realistic design for data lineage capture. 305:2.8ZB of data were created and replicated in 2012, 1070: 1068: 922: 1435:X-trace: A pervasive network tracing framework 920: 918: 916: 914: 912: 910: 908: 906: 904: 902: 556:Ad-Hoc versus established processing sequences 143:. These views can be combined with end-to-end 8: 1399: 1397: 46:promotes the subject in a subjective manner 1247: 928:https://escholarship.org/uc/item/3170p7zn 68:Learn how and when to remove this message 1278: 1276: 859: 702:Implicit links through data set sharing 247:" or standardization may be necessary. 1375:: CS1 maint: archived copy as title ( 1368: 698:other if they belong to the same job. 356:extremely hard due to long run times. 7: 1023:overload in genomics3?lang=de, 2010. 733:generating the unexpected output. A 43:This article contains wording that 607:The horizontal scaling feature of 48:without imparting real information 25: 1390:SEC Small Entity Compliance Guide 941:"What is Data Lineage? - Octopai" 728:This is the most crucial step in 286:data intensive scalable computing 1418:; Sherr, Micah (December 2011). 683:Developers can attach data flow 292:Challenges in big data debugging 255:Distributed systems like Google 34: 630:Architecture of lineage systems 301:According to an EMC/IDC study: 1505:Distributed computing problems 1307:"PROV-DM: The PROV Data Model" 1120:Schaefer, Paige (2016-08-24). 475:"PROV-DM: The PROV Data Model" 240:enterprise information systems 236:program management information 1: 1019:The data deluge in genomics. 881:Hoang, Natalie (2017-03-16). 546:Authorized copies, joins, or 106:use such information, called 102:or regenerating lost output. 939:Drori, Amanon (2020-05-18). 760:Tracing outliers in the data 1189:Kandel, Sean (2016-11-04). 432:, as defined by Cui et al. 1526: 1216:"If these data could talk" 666:Explicitly specified links 565:Active versus lazy lineage 430:why-provenance, or lineage 183:Data lineage provides the 27:Origins and events of data 1437:. Proceedings of NSDI'07. 1420:Secure network provenance 533:prescriptive data lineage 527:Prescriptive data lineage 505:containment relationships 989:http://hadoop.apache.org 748:Tracing anomalous actors 679:Logically inferred links 635:Data flow reconstruction 444:recommendation of 2013, 209:graphical user interface 848:Persistent staging area 503:Map Reduce Job showing 335:is unstructured data." 1240:10.1038/sdata.2017.114 1169:www.itbusinessedge.com 843:Directed acyclic graph 761: 749: 631: 523: 507: 463: 361:self-service data prep 204:master data management 759: 747: 629: 522:Containment Hierarchy 521: 502: 487:forward tracing query 461: 213:programming languages 180:of the organization. 137:backward data lineage 819:Sophisticated replay 513:operator containment 462:PROV Core Structures 141:forward data lineage 125:represented visually 123:Data lineage can be 116:is a simple type of 1232:2017NatSD...470114P 1032:Yogesh L. Simmhan, 796:Black-box operators 717:topological sorting 711:Topological sorting 345:parallel processing 228:business vocabulary 133:metadata management 1171:. 18 February 2016 1151:2014-12-20 at the 762: 750: 724:Tracing and replay 648:Association tables 632: 524: 508: 464: 268:Big data debugging 245:data normalization 828:Anomaly detection 810:Efficient tracing 657:Association graph 397:Proposed solution 324:Unstructured data 319:Unstructured data 155:system migrations 78: 77: 70: 16:(Redirected from 1517: 1485: 1481: 1475: 1472: 1466: 1463: 1457: 1454: 1448: 1445: 1439: 1438: 1430: 1424: 1423: 1411: 1405: 1401: 1392: 1387: 1381: 1380: 1374: 1366: 1364: 1363: 1357: 1351:. Archived from 1350: 1342: 1336: 1333: 1327: 1324: 1311: 1310: 1303: 1297: 1296: 1289: 1283: 1280: 1271: 1268: 1262: 1261: 1251: 1211: 1205: 1204: 1202: 1201: 1186: 1180: 1179: 1177: 1176: 1161: 1155: 1142: 1136: 1135: 1133: 1132: 1117: 1111: 1106: 1100: 1099: 1092: 1086: 1072: 1063: 1059: 1050: 1046: 1037: 1030: 1024: 1017: 1011: 1008: 1002: 998: 992: 985: 979: 975: 969: 966: 960: 959: 957: 956: 947:. Archived from 936: 930: 924: 897: 896: 894: 893: 878: 872: 871: 864: 277:machine learning 104:Database systems 73: 66: 62: 59: 53: 38: 37: 30: 21: 1525: 1524: 1520: 1519: 1518: 1516: 1515: 1514: 1500:Data management 1490: 1489: 1488: 1482: 1478: 1473: 1469: 1464: 1460: 1455: 1451: 1446: 1442: 1432: 1431: 1427: 1413: 1412: 1408: 1402: 1395: 1388: 1384: 1367: 1361: 1359: 1355: 1348: 1346:"Archived copy" 1344: 1343: 1339: 1334: 1330: 1325: 1314: 1305: 1304: 1300: 1293:"PROV-Overview" 1291: 1290: 1286: 1281: 1274: 1269: 1265: 1220:Scientific Data 1213: 1212: 1208: 1199: 1197: 1188: 1187: 1183: 1174: 1172: 1163: 1162: 1158: 1153:Wayback Machine 1143: 1139: 1130: 1128: 1119: 1118: 1114: 1107: 1103: 1094: 1093: 1089: 1073: 1066: 1060: 1053: 1047: 1040: 1031: 1027: 1018: 1014: 1009: 1005: 999: 995: 987:Apache Hadoop. 986: 982: 976: 972: 967: 963: 954: 952: 938: 937: 933: 925: 900: 891: 889: 880: 879: 875: 866: 865: 861: 857: 839: 830: 821: 812: 798: 789: 787:Fault tolerance 780: 767: 726: 713: 704: 681: 668: 659: 650: 637: 598: 589: 580: 567: 531:The concept of 529: 482: 480:Lineage capture 474: 453: 438: 436:PROV Data Model 416:data provenance 411: 409:Data provenance 399: 321: 299: 294: 270: 253: 196:Data governance 170:data management 166:data governance 109:data provenance 74: 63: 57: 54: 51: 39: 35: 28: 23: 22: 18:Data provenance 15: 12: 11: 5: 1523: 1521: 1513: 1512: 1507: 1502: 1492: 1491: 1487: 1486: 1476: 1467: 1458: 1449: 1440: 1425: 1416:Thau Loo, Boon 1406: 1393: 1382: 1337: 1328: 1312: 1298: 1284: 1272: 1263: 1206: 1181: 1156: 1137: 1112: 1101: 1087: 1083:Wang-Chiew Tan 1079:Sanjeev Khanna 1064: 1051: 1038: 1025: 1012: 1003: 993: 980: 970: 961: 931: 898: 873: 858: 856: 853: 852: 851: 845: 838: 835: 829: 826: 820: 817: 811: 808: 797: 794: 788: 785: 779: 776: 766: 763: 725: 722: 712: 709: 703: 700: 680: 677: 667: 664: 658: 655: 649: 646: 636: 633: 597: 594: 588: 585: 579: 576: 566: 563: 558: 557: 554: 551: 528: 525: 481: 478: 477: 476: 456: 455: 437: 434: 422:data warehouse 410: 407: 398: 395: 353:grid computing 320: 317: 313: 312: 309: 306: 298: 295: 293: 290: 269: 266: 252: 249: 118:why provenance 98:for step-wise 89:data analytics 76: 75: 42: 40: 33: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 1522: 1511: 1508: 1506: 1503: 1501: 1498: 1497: 1495: 1480: 1477: 1471: 1468: 1462: 1459: 1453: 1450: 1444: 1441: 1436: 1429: 1426: 1421: 1417: 1410: 1407: 1400: 1398: 1394: 1391: 1386: 1383: 1378: 1372: 1358:on 2015-09-05 1354: 1347: 1341: 1338: 1332: 1329: 1323: 1321: 1319: 1317: 1313: 1308: 1302: 1299: 1294: 1288: 1285: 1279: 1277: 1273: 1267: 1264: 1259: 1255: 1250: 1245: 1241: 1237: 1233: 1229: 1225: 1221: 1217: 1210: 1207: 1196: 1192: 1185: 1182: 1170: 1166: 1160: 1157: 1154: 1150: 1147: 1141: 1138: 1127: 1123: 1116: 1113: 1110: 1105: 1102: 1097: 1091: 1088: 1084: 1080: 1076: 1075:Peter Buneman 1071: 1069: 1065: 1058: 1056: 1052: 1045: 1043: 1039: 1035: 1029: 1026: 1022: 1016: 1013: 1007: 1004: 997: 994: 990: 984: 981: 974: 971: 965: 962: 951:on 2020-09-29 950: 946: 942: 935: 932: 929: 923: 921: 919: 917: 915: 913: 911: 909: 907: 905: 903: 899: 888: 884: 877: 874: 869: 863: 860: 854: 849: 846: 844: 841: 840: 836: 834: 827: 825: 818: 816: 809: 807: 804: 795: 793: 786: 784: 777: 775: 772: 764: 758: 754: 746: 742: 738: 736: 731: 723: 721: 718: 710: 708: 701: 699: 696: 691: 686: 678: 676: 674: 665: 663: 656: 654: 647: 645: 643: 634: 628: 624: 620: 618: 614: 610: 605: 602: 595: 593: 586: 584: 577: 575: 571: 564: 562: 555: 552: 549: 545: 544: 543: 540: 536: 534: 526: 520: 516: 514: 506: 501: 497: 495: 490: 488: 479: 473: 470: 466: 465: 460: 451: 447: 446: 445: 443: 435: 433: 431: 425: 423: 418: 417: 408: 406: 404: 396: 394: 391: 386: 382: 378: 372: 370: 366: 362: 357: 354: 350: 346: 340: 336: 334: 330: 325: 318: 316: 310: 307: 304: 303: 302: 297:Massive scale 296: 291: 289: 287: 281: 278: 274: 267: 265: 262: 258: 250: 248: 246: 241: 237: 233: 232:data stewards 229: 225: 220: 218: 214: 210: 205: 201: 197: 193: 191: 186: 181: 179: 178:data elements 175: 171: 167: 162: 160: 156: 151: 146: 142: 138: 134: 129: 126: 121: 119: 115: 111: 110: 105: 101: 97: 92: 90: 86: 83:includes the 82: 72: 69: 61: 49: 47: 41: 32: 31: 19: 1479: 1470: 1461: 1452: 1443: 1434: 1428: 1419: 1409: 1385: 1360:. Retrieved 1353:the original 1340: 1331: 1301: 1287: 1266: 1223: 1219: 1209: 1198:. Retrieved 1194: 1184: 1173:. Retrieved 1168: 1159: 1140: 1129:. Retrieved 1125: 1115: 1104: 1090: 1028: 1015: 1006: 996: 983: 973: 964: 953:. Retrieved 949:the original 944: 934: 890:. Retrieved 886: 876: 862: 831: 822: 813: 799: 790: 781: 768: 751: 739: 727: 714: 705: 682: 669: 660: 651: 638: 621: 606: 599: 596:Architecture 590: 587:Associations 581: 572: 568: 559: 541: 537: 532: 530: 512: 509: 491: 486: 483: 468: 467: 448: 439: 429: 426: 415: 412: 400: 373: 371:and others. 358: 351:but using a 341: 337: 322: 314: 300: 282: 271: 254: 221: 200:Data quality 194: 190:data quality 182: 163: 161:efficiency. 144: 140: 136: 130: 122: 117: 113: 107: 93: 81:Data lineage 80: 79: 64: 55: 44: 778:Scalability 414:Scientific 224:data models 185:audit trail 174:regulations 150:scalability 1494:Categories 1362:2015-09-02 1226:: 170114. 1200:2017-09-20 1175:2017-09-20 1131:2017-09-20 1034:Beth Plale 955:2020-08-25 892:2017-09-20 855:References 765:Challenges 685:archetypes 550:operations 469:provenance 440:PROV is a 257:Map Reduce 1109:Webopedia 695:MapReduce 690:MapReduce 673:MapReduce 349:in-memory 251:Rationale 100:debugging 96:data flow 91:process. 1510:Big data 1371:cite web 1258:28872630 1195:Trifacta 1149:Archived 1126:Trifacta 887:Trifacta 837:See also 771:big data 735:big data 730:big data 642:big data 617:Big data 613:Big data 609:Big Data 601:Big data 403:big data 390:big data 385:Big Data 365:Trifacta 333:big data 329:Big data 273:Big data 261:big data 217:big data 58:May 2015 1249:5584398 1228:Bibcode 945:Octopai 381:grained 369:Alteryx 145:lineage 114:Lineage 1404:pages. 1256: 1246: 1081:, and 578:Actors 238:, and 215:, and 202:, and 168:, and 1356:(PDF) 1349:(PDF) 1144:SAS. 1062:2010. 1049:2002. 803:NoSQL 243:"meta 1484:ACM. 1377:link 1254:PMID 1001:ACM. 978:ACM. 548:CTAS 494:data 85:data 1244:PMC 1236:doi 442:W3C 377:ETL 120:." 1496:: 1396:^ 1373:}} 1369:{{ 1315:^ 1275:^ 1252:. 1242:. 1234:. 1222:. 1218:. 1193:. 1167:. 1124:. 1077:, 1067:^ 1054:^ 1041:^ 943:. 901:^ 885:. 367:, 234:, 230:, 226:, 219:. 159:BI 1379:) 1365:. 1309:. 1295:. 1260:. 1238:: 1230:: 1224:4 1203:. 1178:. 1134:. 1098:. 991:. 958:. 895:. 870:. 452:. 71:) 65:( 60:) 56:( 50:. 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.