language-icon Old Web
English
Sign In

Data lineage

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis. 'Lineage is a simple type of why provenance.' Data lineage can be represented visually to discover the data flow/movement from its source to destination via various changes and hops on its way in the enterprise environment, how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dot represents a data container for data point(s) and lines connecting them represents the transformation(s) the data point under goes, between the data containers. Representation broadly depends on scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leads to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end to end lineage for a reference point that provides complete audit trail of that data point of interest from source(s) to its final destination(s). As the data points or hops increases, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view would be to be able to simplify the view by temporarily masking unwanted peripheral data points. Tools that have the masking feature enables scalability of the view and enhances analysis with best user experience for both technical and business users. The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance, and data management determines the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes, and critical data elements of the organization. Data lineage provides the audit trail of the data points at the highest granular level, but presentation of the lineage may be done at various zoom levels to simplify the vast information, similar to analytic web maps. Data Lineage can be visualized at various levels based on the granularity of the view. At a very high level data lineage provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the data point level where it can provide the details of the data point and its historical behavior, attribute properties, and trends and data quality of the data passed through that specific data point in the data lineage. Data governance plays a key role in metadata management for guidelines, strategies, policies, implementation. Data quality, and master data management helps in enriching the data lineage with more business value. Even though the final representation of data lineage is provided in one interface but the way the metadata is harvested and exposed to the data lineage graphical user interface could be entirely different. Thus, data lineage can be broadly divided into three categories based on the way metadata is harvested: data lineage involving software packages for structured data, programming languages, and big data. Data lineage information includes technical metadata involving data transformations. Enriched data lineage information may include data quality test results, reference data values, data models, business vocabulary, data stewards, program management information, and enterprise information systems linked to the data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. To represent disparate systems into one common view, 'metadata normalization' or standardization may be necessary. The world of big data is changing rapidly. Statistics say that 90% of the world’s data has been created in the last two years alone. This explosion of data has resulted in the ever-growing number of systems and automation at all levels in all sizes of organizations.

[ "Database", "Data mining", "Provenance" ]
Parent Topic
Child Topic
    No Parent Topic